Latency, Systems and Tuning - Technology Performance Pulse

How to Optimize CPU Performance Through Isolation and System Tuning

DZone

MAY 1, 2023

CPU isolation and efficient system management are critical for any application which requires low-latency and high-performance computing. These measures are especially important for high-frequency trading systems, where split-second decisions on buying and selling stocks must be made.

Tuning

Tuning Systems Latency Performance

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Tuning

Tuning Efficiency Latency Strategy

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This technique facilitates validation on multiple fronts.

Traffic

Traffic Latency Tuning Systems

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

The system is inconsistent, slow, hallucinatingand that amazing demo starts collecting digital dust. Two big things: They bring the messiness of the real world into your system through unstructured data. When your system is both ingesting messy real-world data AND producing nondeterministic outputs, you need a different approach.

Systems

Systems Development Tuning Monitoring

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

Tuning thousands of parameters has become an impossible task to achieve via a manual and time-consuming approach. The optimization goal was to improve the application efficiency, that is to improve the ratio between service throughput and cloud costs while not increasing the application latency (e.g. The Akamas approach.

Latency

Latency Tuning Efficiency AWS

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Efficiency

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. Zuul , our primary edge gateway, assigns traffic to either cluster based on the experiment parameters.

Traffic

Traffic Latency Metrics Cache

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

During this time, you are also likely to experience a degraded performance of queries as your system resources are busy in index-creation work as well. 95th Percentile Latency. The 95th percentile latency of queries was also 1.8 Stay tuned for my follow-on blog post with more details! Index Creation on Master.

Best Practices

Best Practices Latency Tuning Database

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Using OpenTelemetry, developers can collect and process telemetry data from applications, services, and systems. Observability Observability is the ability to determine a system’s health by analyzing the data it generates, such as logs, metrics, and traces. There are three main types of telemetry data: Metrics.

Latency

Latency Best Practices Metrics Open Source

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Sample system diagram for an Alexa voice command. The other main use case was RENO, the Rapid Event Notification System mentioned above. Rewriting always comes with a risk, and it’s never the first solution we reach for, particularly when working with a system that’s in place and working well.

Latency

Latency Cache Tuning Efficiency

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The first generation of this system went live with the streaming launch in 2007. Delivery?—?A

Serverless

Serverless Media Latency Social Media

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Operating Systems are not always set up in the same way. Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. Another consequence of the recent discontinuation of support for 32-bit operating systems is the new default location of OneAgent for Windows.

Storage

Storage Latency Operating System Network

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

This is where large-scale system migrations come into play. By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. But what happens when this machinery needs a transformation?

Traffic

Traffic Metrics Systems Strategy

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. How Bulldozer leverages Spark, Protobuf and KV DAL for moving the data.

Latency

Latency Storage Big Data Tuning

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. By integrating with studio content systems, we enabled the pipeline to leverage rich metadata from the creative side and create more engaging member experiences like interactive storytelling.

Processing

Processing Media Latency Innovation

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Scalegrid

OCTOBER 24, 2019

As organizations continue to migrate to the cloud, it’s important to get in front of performance issues, such as high latency, low throughput, and replication lag with higher distances between your users and cloud infrastructure. AWS High Performance XLarge (see system details below). MySQL on AWS Performance Test. Amazon RDS.

AWS

AWS Latency Performance Performance Testing

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet. By Andrei U.,

Monitoring

Monitoring Tuning Traffic Metrics

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Traditional computing models rely on virtual or physical machines, where each instance includes a complete operating system, CPU cycles, and memory. There is no need to plan for extra resources, update operating systems, or install frameworks. The provider is essentially your system administrator. What is serverless computing?

Serverless

Serverless Efficiency Lambda AWS

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

As organizations continue to modernize their technology stacks, many turn to Kubernetes , an open source container orchestration system for automating software deployment, scaling, and management. You can ask for the best configuration to reduce latency or improve the user experience.” It’s not just a cost-reduction tool.

Engineering

Engineering DevOps Operating System Cloud

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

Lastly, the packager kicks in, adding a system layer to the asset, making it ready to be consumed by the clients. Uploading and downloading data always come with a penalty, namely latency. There are existing distributed file systems for the cloud as well as off-the-shelf FUSE modules for S3.

Cloud

Cloud Media Storage Cache

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

OCTOBER 17, 2019

On modern Linux systems, the difference in overhead between forking a process and creating a thread is much lesser than it used to be. It creates yet another component that must be maintained, fine tuned for your workload, security patched often, and upgraded as required. A middleware becomes a single point of failure.

Architecture

Architecture Database Latency Servers

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Observability is essential to ensure the reliability, security and quality of any software system. Higher latency and cold start issues due to the initialization time of the functions. Observability is typically achieved by collecting three types of data from a system, metrics, logs and traces.

Serverless

Serverless Lambda Azure AWS

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

which is difficult when troubleshooting distributed systems. If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls.

Infrastructure

Infrastructure Transportation Storage Open Source

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Those two metrics are approximate indicators of failures and latency.

Traffic

Traffic Metrics Infrastructure Architecture

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

Amazon Elastic File System (EFS). The example below visualizes average latency by API name and stage for a specific AWS API Gateway. Choose any service, for example, the Elastic File System (EFS) service, to view the list of configured metrics. Stay tuned for updates in Q1 2020. Amazon Aurora. Amazon API Gateway.

AWS

AWS Metrics IoT Storage

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

With their new Docker image, users launch their Packer baking jobs using Titus , our container management system. The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses. We can easily test server tuning changes, software upgrades, and other modifications to the runtime environment.

DevOps

DevOps AWS Tuning Infrastructure

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Think about items such as general system metrics (for example, CPU utilization, free memory, number of services), the connectivity status, details of our web server, or even more granular in-application tasks like database queries. DNS query time indicates the average response times of DNS requests across the system.

Metrics

Metrics Database Monitoring Network

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

Use cases We found several use cases where a system like AutoOptimize can bring tons of value. Merge As the data lands into the data warehouse through real-time data ingestion systems, it comes in different sizes. Orient: Gather tuning parameters for a particular table that changed.

Storage

Storage Latency Efficiency Data Engineering

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

System Setup Architecture The following diagram summarizes the architecture description: Figure 1: Event-sourcing architecture of the Device Management Platform. Fault Tolerance If the underlying KafkaConsumer crashes due to ephemeral system or network events, it should be automatically restarted. million elements.

Latency

Latency Traffic Transportation Cloud

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

However, they are scattered across multiple systems, and there isn’t an easy way to tie related messages together. You’re joining tables, resolving status types, cross-referencing data manually with other systems, and by the end of it all you ask yourself why? Things got hairy.

Traffic

Traffic Java Latency Google

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

Amazon Elastic File System (EFS). The example below visualizes average latency by API name and stage for a specific AWS API Gateway. Choose any service, for example, the Elastic File System (EFS) service, to view the list of configured metrics. Stay tuned for updates in Q1 2020. Amazon Aurora. Amazon API Gateway.

AWS

AWS Metrics IoT Storage

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

It enables them to adapt to user feedback swiftly, fine-tune feature releases, and deliver exceptional user experiences, all while maintaining control and minimizing disruption. Consider an event-driven automation system designed for incident management. But it doesn’t stop there. All these actions aim to avert future incidents.

DevOps

DevOps Traffic Efficiency Servers

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

To improve availability, we designed systems where components could fail separately and avoid single points of failure. Critically, this system allows us to seamlessly migrate services to service mesh with no configuration required, satisfying one of our main adoption constraints. We’re still early in our service mesh journey.

Traffic

Traffic Latency Cloud C++

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Unlike the system database tempdb, ReportServerTempDB is not recreated at startup. Tuning Options. Tuning SSRS is much like any other application.

Tuning

Tuning Servers Database Best Practices

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

The coupling problem Until recently, video quality measurements were generated as part of our Reloaded production system. This system is responsible for processing incoming media files, such as video, audio and subtitles, and making them playable on the streaming service. We call this system Cosmos. The workflow is initiated.

Media

Media Innovation Metrics Latency

How to Optimize CPU Performance Through Isolation and System Tuning

Netflix’s Distributed Counter Abstraction

Trending Sources

Foundation Model for Personalized Recommendation

Rapid Event Notification System at Netflix

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Title Launch Observability at Netflix Scale

Why applying chaos engineering to data-intensive applications matters

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Introducing Netflix’s Key-Value Data Abstraction Layer

Optimizing your Kubernetes clusters without breaking the bank

Best Practices for Scaling RabbitMQ

Migrating Netflix to GraphQL Safely

Best Practice for Creating Indexes on your MySQL Tables

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Introducing Netflix TimeSeries Data Abstraction Layer

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix Cosmos Platform

Faster time to value with enhanced handling of OneAgent runtime data

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Rebuilding Netflix Video Processing Pipeline with Microservices

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Telltale: Netflix Application Monitoring Simplified

What is serverless computing? Driving efficiency without sacrificing observability

Enhancing Kubernetes cluster management key to platform engineering success

Netflix Cloud Packaging in the Terabyte Era

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Building Netflix’s Distributed Tracing Infrastructure

Keeping Netflix Reliable Using Prioritized Load Shedding

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Applying Netflix DevOps Patterns to Windows

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Optimizing data warehouse storage

Towards a Reliable Device Management Platform

Achieving observability in async workflows

Get up to 300 new metrics out of the box with AWS supporting services (GA)

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Tuning SQL Server Reporting Services

Netflix Video Quality at Scale with Cosmos Microservices

Stay Connected