Example, Latency and Tuning - Technology Performance Pulse

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings. However, certain features require special attention.

Tuning

Tuning Efficiency Latency Strategy

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Its partitioned log architecture supports both queuing and publish-subscribe models, allowing it to handle large-scale event processing with minimal latency. Apache Kafka uses a custom TCP/IP protocol for high throughput and low latency. Apache Kafka, designed for distributed event streaming, maintains low latency at scale.

Latency

Latency Analytics Architecture Storage

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Analyzing impression history, for example, might help determine how well a specific row on the home page is functioning or assess the effectiveness of a merchandising strategy. Automating Performance Tuning with Autoscalers Tuning the performance of our Apache Flink jobs is currently a manual process.

Tuning

Tuning Latency Efficiency Storage

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Some examples: Why is title X not showing on the Coming Soon row for a particular member?

Traffic

Traffic Scalability Strategy Monitoring

Performance Tuning Java Applications in Linux

DZone

DECEMBER 4, 2019

You may also like: How to Properly Plan JVM Performance Tuning. While Performance Tuning an application both Code and Hardware running the code should be accounted for. Polling threads are an example where you might want to do this. For low latency, applications use Concurrent Mark and Sweep Algorithm — CMS or G1 GC.

Tuning

Tuning Java Performance Hardware

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. For example, if some fields in the responses are timestamps, those will differ.

Traffic

Traffic Latency Tuning Systems

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

Kubernetes microservices applications are a striking example of the complexity of today’s modern application and IT stacks. Tuning thousands of parameters has become an impossible task to achieve via a manual and time-consuming approach. SREcon21 – Automating Performance Tuning with Machine Learning. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets.

Latency

Latency Storage Cache Servers

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Traces are used for performance analysis, latency optimization, and root cause analysis. For example, a company using a log aggregation tool can use OpenTelemetry to gain additional trace data without disrupting its setup, thus enabling a gradual and smooth transition from legacy systems to modern observability. Contextualize data.

Latency

Latency Best Practices Metrics Open Source

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

For example, in order to enhance our user experience, one online application fetches subscribers’ preferences data to recommend movies and TV shows. The data warehouse is not designed to serve point requests from microservices with low latency. Let’s look at an example of a Bulldozer YAML configuration (Figure 3).

Latency

Latency Storage Big Data Tuning

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. For a given CPU utilization target, ZGC improves both average and P99 latencies with equal or better CPU utilization when compared to G1. No explicit tuning has been required to achieve these results.

Latency

Latency Java Tuning Efficiency

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. For example, is it more correct for an array to be empty or null, or is it just noise? The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system.

Traffic

Traffic Latency Metrics Cache

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. For example, a video encoding service is built of components that are scale-agnostic: API, workflow, and functions.

Serverless

Serverless Media Latency Social Media

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload. As Pushy’s portfolio grew, we experienced some pain points with Dynomite.

Latency

Latency Cache Tuning Efficiency

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

How Dynatrace uses Site Reliability Guardian In each of these Dynatrace examples, insight is made in a production-like environment. These examples can help you define your starting point for establishing DevOps and SRE best practices in your organization. The queries are depicted below (sensitive data has been removed).

DevOps

DevOps Traffic Latency Best Practices

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

AWS Lambda functions are an example of how a serverless framework works: Developers write a function in a supported language or platform. When an application is triggered, it can cause latency as the application starts. Building APIs (for example, Amazon API Gateway ). Creating a prototype (for example, on Azure ).

Serverless

Serverless Efficiency Lambda Azure

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. For example: All subfolders of the /opt directory are mounted as local, low latency, high-throughput drives, with relatively low storage capacity. Stay tuned for upcoming news about these changes.

Storage

Storage Latency Operating System Network

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others.

Monitoring

Monitoring Tuning Traffic Metrics

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. divide the input video into small chunks 2.

Processing

Processing Media Latency Innovation

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

As an example, cloud-based post-production editing and collaboration pipelines demand a complex set of functionalities, including the generation and hosting of high quality proxy content. The following table gives us an example of file sizes for 4K ProRes 422 HQ proxies.

Cloud

Cloud Media Storage Cache

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

While developers edit files, a simple CLI command applies configurations to Dynatrace and, for example, automates the setup of a quality gate, including workflows and Site Reliability Guardians. Stay tuned for more examples and easy-to-adopt automations provided in our public Github project.

Best Practices

Best Practices Code Infrastructure Latency

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

For example, it became cumbersome to ensure users of Packer received updates. This process is automated via a Spinnaker pipeline: Example Spinnaker pipeline showing the bake, canary, and deployment stages. The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses.

DevOps

DevOps AWS Tuning Infrastructure

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

You can observe the metrics across service instances split by region (in this example, API Gateways in us-east-1 and us-east-2 ). Metrics for each service instance are presented in detailed charts—see the example for ECS below. The example below visualizes average latency by API name and stage for a specific AWS API Gateway.

AWS

AWS Metrics IoT Storage

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

For example, improving latency by as little as 0.1 latency is the number one reason consumers abandon mobile sites. Fine-tuning the service-level indicators that make up quality gates will improve with the help of upcoming features. Organizations can feel the impact of even a minor roadblock in the user experience.

Benchmarking

Benchmarking Latency Speed Software

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

For example, a member-triggered event such as “ change in a profile’s maturity level” should have a much higher priority than a “ system diagnostic signal”. This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns.

Systems

Systems Traffic Architecture Mobile

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

For example, to handle traffic spikes and pay only for what they use. Higher latency and cold start issues due to the initialization time of the functions. Understanding cold-start behavior is essential to tune your cloud applications cost or performance to meet your operational needs.

Serverless

Serverless Lambda Azure AWS

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables).

Monitoring

Monitoring Social Media IoT Metrics

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Think about items such as general system metrics (for example, CPU utilization, free memory, number of services), the connectivity status, details of our web server, or even more granular in-application tasks like database queries. Let’s take our previous screenshot as an example.

Metrics

Metrics Database Monitoring Network

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. See example below. Example: Filter Processor, Sink Processors Opt in to schema Evolution example 2. Two Types of Processors 1.

Big Data

Big Data Government Processing Analytics

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

You can observe the metrics across service instances split by region (in this example, API Gateways in us-east-1 and us-east-2 ). Metrics for each service instance are presented in detailed charts—see the example for ECS below. The example below visualizes average latency by API name and stage for a specific AWS API Gateway.

AWS

AWS Metrics IoT Storage

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. Logs and background requests are examples of this type of traffic. Those two metrics are approximate indicators of failures and latency.

Traffic

Traffic Metrics Infrastructure Architecture

Streaming SQL in Data Mesh

The Netflix TechBlog

NOVEMBER 3, 2023

An example of a Data Mesh pipeline which moves and transforms data using Union, GraphQL Enrichment, and Column Rename Processor before writing to an Iceberg table. For example, filtering and projection can be expressed in SQL through SELECT and WHERE clauses. Stay tuned for more updates!

Processing

Processing Engineering Infrastructure Latency

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

For example, an SLA between a web host provider and customer can guarantee 99.95% uptime for all web services of a company over a year. For example, if the SLA for a website is 99.95% uptime, its corresponding SLO could be 99.95% availability of the login services. For example, if your SLO guarantees 99.5% What are SLOs?

Metrics

Metrics Best Practices DevOps Infrastructure

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

A Case Study in Misleading AI Advice An example of this disconnect in action comes from an interview with Jake Heller, CEO of Casetext. For example, the metrics that come built-in to many tools rarely correlate with what you actually care about. Well give examples and encourage the AI to think before it answers.

Latency

Latency Tuning Metrics Testing

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

For example, consider an e-commerce website that automatically sends personalized discount codes to customers who abandon their shopping carts. Let’s examine a few examples of answer-driven automation using AI and context. DevOps automation example #2: Threat detection and response Let’s build on the previous example.

DevOps

DevOps Traffic Efficiency Servers

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

For example, we allow a partition to have a few small files instead of always merging files in perfect sizes. These principles reduce resource usage by being more efficient and effective while lowering the end-to-end latency in data processing. Orient: Gather tuning parameters for a particular table that changed.

Storage

Storage Latency Efficiency Data Engineering

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

Throughout this article, well explore real-world examples of LLM application development and then consolidate what weve learned into a set of first principlescovering areas like nondeterminism, evaluation approaches, and iteration cyclesthat can guide your work regardless of which models or frameworks you choose.

Systems

Systems Development Tuning Monitoring

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

a Netflix member via Twitter This is an example of a question our on-call engineers need to answer to help resolve a member issue?—?which Our engineering teams tuned their services for performance after factoring in increased resource utilization due to tracing. which is difficult when troubleshooting distributed systems.

Infrastructure

Infrastructure Transportation Storage Open Source

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

For example, when we design a new version of VMAF, we need to effectively roll it out throughout the entire Netflix catalog of movies and TV shows. This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc.,

Media

Media Innovation Metrics Latency

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. Similarly, an increased throughput signifies an intensive workload on a server and a larger latency.

Metrics

Metrics Monitoring Latency Cache

Foundation Model for Personalized Recommendation

Netflix’s Distributed Counter Abstraction

Trending Sources

RabbitMQ vs. Kafka: Key Differences

Introducing Impressions at Netflix

Title Launch Observability at Netflix Scale

Performance Tuning Java Applications in Linux

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Optimizing your Kubernetes clusters without breaking the bank

Introducing Netflix’s Key-Value Data Abstraction Layer

Introducing Netflix TimeSeries Data Abstraction Layer

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Bending pause times to your will with Generational ZGC

Migrating Netflix to GraphQL Safely

The Netflix Cosmos Platform

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

How Dynatrace boosts production resilience with Site Reliability Guardian

What is serverless computing? Driving efficiency without sacrificing observability

Faster time to value with enhanced handling of OneAgent runtime data

Telltale: Netflix Application Monitoring Simplified

Rebuilding Netflix Video Processing Pipeline with Microservices

Netflix Cloud Packaging in the Terabyte Era

Automated observability, security, and reliability at scale

Applying Netflix DevOps Patterns to Windows

Get up to 300 new metrics out of the box with AWS supporting services (GA)

How BizDevOps can “shift left” using SLOs to automate quality gates

Rapid Event Notification System at Netflix

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

How digital experience monitoring helps deliver business observability

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Data Movement in Netflix Studio via Data Mesh

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Keeping Netflix Reliable Using Prioritized Load Shedding

Streaming SQL in Data Mesh

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

AI Essentials for Tech Executives

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Optimizing data warehouse storage

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Building Netflix’s Distributed Tracing Infrastructure

Netflix Video Quality at Scale with Cosmos Microservices

Crucial Redis Monitoring Metrics You Must Watch

Stay Connected