Latency and Systems - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Comparing Approaches to Durability in Low Latency Messaging Queues

DZone

AUGUST 2, 2022

I have generally held the view that replicating data to a secondary system is faster than sync-ing to disk, assuming the round trip network delay wasn’t high due to quality networks and co-located redundant servers. Little’s Law and Why Latency Matters. However, latency is often a key factor in why the throughput isn’t high enough.

Latency

Latency Benchmarking Network Infrastructure

Optimising for High Latency Environments

CSS Wizardry

SEPTEMBER 16, 2024

This gives fascinating insights into the network topography of our visitors, and how much we might be impacted by high latency regions. Round-trip-time (RTT) is basically a measure of latency—how long did it take to get from one endpoint to another and back again? What is RTT? RTT isn’t a you-thing, it’s a them-thing.

Latency

Latency Cache Transportation Mobile

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

The Netflix TechBlog

SEPTEMBER 29, 2022

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform.

Latency

Latency Systems Media Serverless

Optimizing Database Performance in Middleware Applications

DZone

FEBRUARY 14, 2025

In the realm of modern software architecture, middleware plays a pivotal role in connecting various components of distributed systems. Efficient database operations in middleware can dramatically improve overall system performance, reduce latency, and enhance user experience.

Database

Database Performance Software Architecture Latency

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

Understanding sustained memory bandwidth in these systems starts with assuming 100% utilization and then reviewing the factors that get in the way (e.g., This requires a completely different approach to modeling the memory system — one based on Little’s Law from queueing theory.

Latency

Latency Hardware Cache Systems

Build systems more reliably with Dynatrace: Chaos Engineering

Dynatrace

AUGUST 21, 2024

These releases often assumed ideal conditions such as zero latency, infinite bandwidth, and no network loss, as highlighted in Peter Deutsch’s eight fallacies of distributed systems. With Dynatrace, teams can seamlessly monitor the entire system, including network switches, database storage, and third-party dependencies.

Engineering

Engineering Systems Latency Metrics

How to Optimize CPU Performance Through Isolation and System Tuning

DZone

MAY 1, 2023

CPU isolation and efficient system management are critical for any application which requires low-latency and high-performance computing. These measures are especially important for high-frequency trading systems, where split-second decisions on buying and selling stocks must be made.

Tuning

Tuning Systems Latency Performance

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

Dynatrace

OCTOBER 10, 2024

New: identify hotspots with the honeycomb visualization Honeycombs are great for visualizing health in complex and distributed systems, enabling you to visualize countless entities effectively and at scale. Now, let’s introduce you to our two newest entries to our visualization catalog and tell you about the great things you can do with them.

Latency

Latency Infrastructure Monitoring Metrics

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Cache Open Source

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Tuning

Tuning Efficiency Latency Strategy

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

Mastering Latency With P90, P99, and Mean Response Times

DZone

FEBRUARY 5, 2024

In the fast-paced digital world, where every millisecond counts, understanding the nuances of network latency becomes paramount for developers and system architects. Latency, the delay before a transfer of data begins following an instruction for its transfer, can significantly impact user experience and system performance.

Latency

Latency Metrics Network Systems

Efficient Multimodal Data Processing: A Technical Deep Dive

DZone

FEBRUARY 27, 2025

Multimodal data processing is the evolving need of the latest data platforms powering applications like recommendation systems, autonomous vehicles, and medical diagnostics. Handling multimodal data spanning text, images, videos, and sensor inputs requires resilient architecture to manage the diversity of formats and scale.

Efficiency

Efficiency Processing Latency Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. It provides a good read on the availability and latency ranges under different production conditions.

Traffic

Traffic Latency Tuning Systems

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Dynatrace

AUGUST 27, 2019

In this alert, xMatters includes all the important incident information from Dynatrace, so there’s no need for you to visit additional system dashboards. Based on this contextual data, resources are prompted with their pre-configured response options, each of which kicks off a workflow across systems (based on the severity of the issue).

Systems

Systems DevOps Latency Azure

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Bandwidth or Latency: When to Optimise for Which

CSS Wizardry

JANUARY 31, 2019

When it comes to network performance, there are two main limiting factors that will slow you down: bandwidth and latency. Latency is defined as…. Where bandwidth deals with capacity, latency is more about speed of transfer 2. and reduction in latency. and reduction in latency. Bandwidth is defined as….

Latency

Latency Network Speed Servers

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Dynatrace

APRIL 10, 2025

As a result, many organizations are facing the burden of managing separate systems for network security and application observability, leading to redundant configurations, duplicated data collection, and operational overhead. While network security remains relevant, the emphasis is now on application observability and threat detection.

Strategy

Strategy Storage Network Architecture

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

The system is inconsistent, slow, hallucinatingand that amazing demo starts collecting digital dust. Two big things: They bring the messiness of the real world into your system through unstructured data. When your system is both ingesting messy real-world data AND producing nondeterministic outputs, you need a different approach.

Systems

Systems Development Tuning Monitoring

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Scalability

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

On Titus , our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers. To emit a run queue latency metric, we leveraged three eBPF hooks: sched_wakeup, sched_wakeup_new, and sched_switch.

Latency

Latency Metrics Programming Monitoring

Solve hybrid Kubernetes performance and reliability problems with unified observability

Dynatrace

APRIL 10, 2025

In modern containerized environments, teams often deploy Kubernetes across mixed operating systems, creating a situation where both Linux and Windows nodes reside in the same cluster. This eliminates the need for separate log management systems for each OS, streamlining your observability workflow.

Performance

Performance Java Operating System Infrastructure

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Servers

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. Zuul , our primary edge gateway, assigns traffic to either cluster based on the experiment parameters.

Traffic

Traffic Latency Metrics Cache

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

As the number of Titus users increased over the years, the load and pressure on the system increased substantially. cell): Titus Job Coordinator is a leader elected process managing the active state of the system. For example, a batch workflow orchestration system may create multiple jobs which are part of a single workflow execution.

Cache

Cache Latency Traffic Systems

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Dynatrace

SEPTEMBER 29, 2020

Sydney, we have a disk write latency problem! It was on August 25 th at 14:00 when Davis initially alerted on a disk write latency issues to Elastic File System (EFS) on one of our EC2 instances in AWS’s Sydney Data Center. The problem didn’t last long or have any impact on our services.

Infrastructure

Infrastructure Cloud Monitoring AWS

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Using OpenTelemetry, developers can collect and process telemetry data from applications, services, and systems. Observability Observability is the ability to determine a system’s health by analyzing the data it generates, such as logs, metrics, and traces. There are three main types of telemetry data: Metrics.

Latency

Latency Best Practices Metrics Open Source

Resilience Pattern: Circuit Breaker

DZone

NOVEMBER 16, 2023

In this article, we will explore one of the most common and useful resilience patterns in distributed systems: the circuit breaker. The circuit breaker is a design pattern that prevents cascading failures and improves the overall availability and performance of a system. What Is a Circuit Breaker?

Latency

Latency Network Database Monitoring

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

According to Google’s SRE handbook , best practices, there are “ Four Golden Signals ” we can convert into four SLOs for services: reliability, latency, availability, and saturation. Latency is the time that it takes a request to be served. Define SLOs for each service. Reliability. Saturation.

Software

Software Software Benchmarking Latency

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

During this time, you are also likely to experience a degraded performance of queries as your system resources are busy in index-creation work as well. 95th Percentile Latency. The 95th percentile latency of queries was also 1.8 Workload Throughput (Queries Per Second). Index Creation on Master. Rolling Index Creation.

Best Practices

Best Practices Latency Tuning Database

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The first generation of this system went live with the streaming launch in 2007. Delivery?—?A

Serverless

Serverless Media Latency Social Media

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Dynatrace

SEPTEMBER 18, 2020

High latency or lack of responses. You receive an alert message from Dynatrace (your infrastructure observability hub) letting you know that the average response latency of all deployed APIs has tripled. This increase is clearly correlated with the increased response latencies. Soaring number of active connections.

Infrastructure

Infrastructure Latency Metrics Cloud

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Dynatrace

JANUARY 15, 2020

Therefore, it requires multidimensional and multidisciplinary monitoring: Infrastructure health —automatically monitor the compute, storage, and network resources available to the Citrix system to ensure a stable platform. Citrix platform performance—optimize your Citrix landscape with insights into user load and screen latency per server.

Latency

Latency Performance Virtualization Infrastructure

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

JMeter, MicroFocus LoadRunner, and Tricentis Neoload) can be used to test the target system against the workloads and where Dynatrace is the single telemetry provider for all the KPIs measuring the results of applying that load to a specific configuration. below 500ms) and error rates (e.g. lower than 2%.). below 500ms) and error rates (e.g.

Latency

Latency Tuning Efficiency AWS

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

How to Install Pixie for Kubernetes Monitoring: The Complete Guide

DZone

SEPTEMBER 4, 2022

It is important to highlight that most older monitoring systems were considered inefficient due to their operational overhead. Pixie offers monitoring, telemetry, metrics, and more with less than 5% CPU overhead and latency degradation during data collection.

Monitoring

Monitoring Latency Metrics Cloud

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Not a Single Trace

DZone

OCTOBER 4, 2023

Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. But the celebrations are short-lived.

Latency

Latency Cache Software Software

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Sample system diagram for an Alexa voice command. The other main use case was RENO, the Rapid Event Notification System mentioned above. Rewriting always comes with a risk, and it’s never the first solution we reach for, particularly when working with a system that’s in place and working well.

Latency

Latency Cache Tuning Efficiency

API Design Principles for Optimal Performance and Scalability

DZone

JUNE 22, 2023

It involves a combination of techniques and best practices aimed at reducing latency, improving user experience, and increasing the overall efficiency of the system. API performance optimization is the process of improving the speed, scalability, and reliability of APIs.

Scalability

Scalability Design Best Practices Performance

Netflix’s Distributed Counter Abstraction

Comparing Approaches to Durability in Low Latency Messaging Queues

Trending Sources

Optimising for High Latency Environments

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Optimizing Database Performance in Middleware Applications

Rapid Event Notification System at Netflix

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Build systems more reliably with Dynatrace: Chaos Engineering

How to Optimize CPU Performance Through Isolation and System Tuning

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

RabbitMQ vs. Kafka: Key Differences

Supporting Diverse ML Systems at Netflix

Foundation Model for Personalized Recommendation

Introducing Impressions at Netflix

Mastering Latency With P90, P99, and Mean Response Times

Efficient Multimodal Data Processing: A Technical Deep Dive

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Title Launch Observability at Netflix Scale

Bandwidth or Latency: When to Optimise for Which

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Best Practices for Scaling RabbitMQ

Noisy Neighbor Detection with eBPF

Solve hybrid Kubernetes performance and reliability problems with unified observability

Introducing Netflix’s Key-Value Data Abstraction Layer

Migrating Netflix to GraphQL Safely

Consistent caching mechanism in Titus Gateway

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Why applying chaos engineering to data-intensive applications matters

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Resilience Pattern: Circuit Breaker

Implementing service-level objectives to improve software quality

Best Practice for Creating Indexes on your MySQL Tables

The Netflix Cosmos Platform

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Optimizing your Kubernetes clusters without breaking the bank

Predictive CPU isolation of containers at Netflix

How to Install Pixie for Kubernetes Monitoring: The Complete Guide

Introducing Netflix TimeSeries Data Abstraction Layer

Not a Single Trace

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

API Design Principles for Optimal Performance and Scalability

Stay Connected