Availability, Latency and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings.

Tuning

Tuning Efficiency Latency Strategy

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

This dual-path approach leverages Kafkas capability for low-latency streaming and Icebergs efficient management of large-scale, immutable datasets, ensuring both real-time responsiveness and comprehensive historical data availability. million impression events globally every second, with each event approximately 1.2KB in size.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Its design prioritizes high availability and efficient data transfer with minimal overhead, making it a practical choice for handling real-time data pipelines and distributed event processing. It follows a push-based approach, ensuring messages are distributed to consumers as soon as they become available.

Latency

Latency Analytics Architecture Storage

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. This significantly increases event latency. Spark Structured Streaming can also provide consistent fault recovery for applications where latency is not a critical requirement.

Engineering

Engineering Tuning Latency Open Source

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. Logging is selective to cases where the old and new responses do not match.

Traffic

Traffic Latency Tuning Systems

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Implementing clustering and quorum queues in RabbitMQ significantly improves load distribution and data redundancy, ensuring high availability and fault tolerance for messaging services. Classic queues can be used in clusters, emphasizing their behavior during node failures, particularly regarding durability and availability.

Best Practices

Best Practices Traffic Strategy Efficiency

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

However, setting the right parameters for Kubernetes clusters to ensure application availability, performance, and resilience while avoiding overspending isn’t a walk in the park. Tuning thousands of parameters has become an impossible task to achieve via a manual and time-consuming approach. The Akamas approach. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

Compare Latency. lower latency compared to DigitalOcean for PostgreSQL. Now, let’s take a look at the throughput and latency performance of our comparison. Next, we are going to test and compare the latency performance between ScaleGrid and DigitalOcean for PostgreSQL. PostgreSQL DigitalOcean Latency Averages (ms).

Database

Database Latency Benchmarking Performance

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

Central to this infrastructure is our use of multiple online distributed databases such as Apache Cassandra , a NoSQL database known for its high availability and scalability. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. The Replay Testing framework leverages the @override directive available in GraphQL Federation. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. How does it work?

Traffic

Traffic Latency Metrics Cache

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. For a given CPU utilization target, ZGC improves both average and P99 latencies with equal or better CPU utilization when compared to G1. No explicit tuning has been required to achieve these results.

Latency

Latency Java Tuning Efficiency

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 22, 2020

Compare Latency. On average, ScaleGrid achieves almost 30% lower latency over DigitalOcean for the same deployment configurations. Now that we’ve compared throughput performance, let’s take a look at ScaleGrid vs. DigitalOcean latency for MySQL. Read-Intensive Latency Benchmark. Balanced Workload Latency Benchmark.

Database

Database Benchmarking Latency Performance

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Throughout this evolution, we’ve been able to maintain high availability and a consistent message delivery rate, with Pushy successfully maintaining 99.999% reliability for message delivery over the last few months. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered.

Latency

Latency Cache Tuning Efficiency

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Scalegrid

OCTOBER 24, 2019

As organizations continue to migrate to the cloud, it’s important to get in front of performance issues, such as high latency, low throughput, and replication lag with higher distances between your users and cloud infrastructure. ScaleGrid also maintains 53% lower latency on average throughout the entire MySQL AWS performance tests.

AWS

AWS Latency Performance Performance Testing

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. Sometimes these locations landed on mount points which, due to capacity, availability, or access constraints, weren’t well suited for large runtime storage. Stay tuned for upcoming news about these changes.

Storage

Storage Latency Operating System Network

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. How Bulldozer leverages Spark, Protobuf and KV DAL for moving the data.

Latency

Latency Storage Big Data Tuning

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others.

Monitoring

Monitoring Tuning Traffic Metrics

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system. Warm capacity.

Serverless

Serverless Media Latency Social Media

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Every time the trigger executes, the function runs on an available resource. When an application is triggered, it can cause latency as the application starts. Serverless vendors make resources available exactly when you need them. This creates latency when they need to restart. Monitoring serverless applications.

Serverless

Serverless Efficiency Lambda AWS

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

Whether tracking internal, workload-centric indicators such as errors, duration, or saturation or focusing on the golden signals and other user-centric views such as availability, latency, traffic, or engagement, SLOs-as-code enables coherent and consistent monitoring throughout the environment at scale.

Best Practices

Best Practices Code Infrastructure Latency

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

For example, improving latency by as little as 0.1 latency is the number one reason consumers abandon mobile sites. Requirements surrounding the availability of both services and data are common, and they clearly define the consequences for failure to perform. Meanwhile, in the U.S., The value of fixing issues up-front.

Benchmarking

Benchmarking Latency Speed Software

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

After being available in an Early Adopter Release, we’re happy to announce that AWS supporting services are now Generally Available (GA). Supporting services include every service that isn’t available with out-of-the-box Dynatrace monitoring. Stay tuned for updates in Q1 2020. You can also create custom charts.

AWS

AWS Metrics IoT Storage

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

Once the instance was available, the engineer would use a remote administration tool like RDP to login to the instance to install software and customize settings. The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses.

DevOps

DevOps AWS Tuning Infrastructure

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

The computation is done as a first step so that it is available for the rest of the request lifecycle. Those two metrics are approximate indicators of failures and latency. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

Traffic

Traffic Metrics Infrastructure Architecture

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

These can include business metrics, such as conversion rates, uptime, and availability; service metrics, such as application performance; or technical metrics, such as dependencies to third-party services, underlying CPU, and the cost of running a service. availability of a website over a year, your error budget is.05%.

Metrics

Metrics Best Practices DevOps Infrastructure

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

After being available in an Early Adopter Release, we’re happy to announce that AWS supporting services are now Generally Available (GA). Supporting services include every service that isn’t available with out-of-the-box Dynatrace monitoring. Stay tuned for updates in Q1 2020. You can also create custom charts.

AWS

AWS Metrics IoT Storage

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

For that, we focused on OpenTelemetry as the underlying technology and showed how you can use the available SDKs and libraries to instrument applications across different languages and platforms. Here, we can find statistics on the overall availability of the database, connections, queries, and errors. What is OneAgent?

Metrics

Metrics Database Monitoring Network

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns. The core to bringing these engineering solutions to life is our direct collaboration with our colleagues and using the most impactful tools and technologies available.

Systems

Systems Traffic Architecture Mobile

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

DEM provides an outside-in approach to user monitoring that measures user experience (UX) in real time to ensure applications and services are available, functional, and well-performing across all channels of the digital experience, including web, mobile, and IoT.

Monitoring

Monitoring Social Media IoT Metrics

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

Since there were no existing solutions available, we needed to build them ourselves. To improve availability, we designed systems where components could fail separately and avoid single points of failure. There is a downside to fetching this data on-demand: this adds latency to the first request to a cluster.

Traffic

Traffic Latency Cloud C++

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

These principles reduce resource usage by being more efficient and effective while lowering the end-to-end latency in data processing. It is responsible for listening to incoming events and requests and prioritizing different tables and actions to make the best usage of the available resources. More processing resources.

Storage

Storage Latency Efficiency Data Engineering

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

For example, when running tests, the state of the device will change from “available for testing” to “in test.” Build a Spring @Configuration class that autowires the KafkaProperties bean injected by the Netflix Spring runtime and, using the Kafka settings available from that bean, construct an Alpakka-Kafka ConsumerSettings bean.

Latency

Latency Traffic Transportation Cloud

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

From the moment a Netflix film or series is pitched and long before it becomes available on Netflix, it goes through many phases. Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain.

Big Data

Big Data Government Processing Analytics

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? MySQL comes pre-configured to be conservative instead of making the most of the resources available in the server. Why is that?

Tuning

Tuning Cache Servers Benchmarking

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Tuning Options. Tuning SSRS is much like any other application. Disk latency for ReportServer and ReportServerTempDB are very important. General Tuning.

Tuning

Tuning Servers Database Best Practices

It’s All About Replication Lag in PostgreSQL

Percona

APRIL 13, 2023

In PostgreSQL, replication lag can occur due to various reasons such as network latency, slow disk I/O, long-running transactions, etc. Replication lag can have serious consequences in high-availability systems where standby databases are used for failover. can help improve replication performance and reduce replication lag.

Latency

Latency Tuning Open Source Network

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc., The quality results are now available to the caller via the getQuality endpoint. Stay tuned for more details on these algorithmic innovations.

Media

Media Innovation Metrics Latency

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

The eval process combines: Human review Model-based evaluation A/B testing The results then inform two parallel streams: Fine-tuning with carefully curated data Prompt engineering improvements These both feed into model improvements, which starts the cycle again. Were experiencing high latency in responses.

Latency

Latency Tuning Metrics Testing

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. Similarly, an increased throughput signifies an intensive workload on a server and a larger latency.

Metrics

Metrics Monitoring Latency Cache

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Nowadays, solid-state drives (SSDs) or non-volatile memory express (NVMe) drives are preferred over traditional hard disk drives (HDDs) for database servers due to their faster read and write speeds, lower latency, and improved reliability. Typically a good value is 70%-80% of available memory. I hope this helps!

Performance

Performance Hardware Tuning Storage

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.

Database

Database Benchmarking Latency C++

Netflix’s Distributed Counter Abstraction

Foundation Model for Personalized Recommendation

Trending Sources

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Why applying chaos engineering to data-intensive applications matters

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Best Practices for Scaling RabbitMQ

Optimizing your Kubernetes clusters without breaking the bank

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Introducing Netflix’s Key-Value Data Abstraction Layer

Migrating Netflix to GraphQL Safely

Bending pause times to your will with Generational ZGC

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Introducing Netflix TimeSeries Data Abstraction Layer

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Faster time to value with enhanced handling of OneAgent runtime data

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Telltale: Netflix Application Monitoring Simplified

The Netflix Cosmos Platform

What is serverless computing? Driving efficiency without sacrificing observability

Automated observability, security, and reliability at scale

How BizDevOps can “shift left” using SLOs to automate quality gates

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Applying Netflix DevOps Patterns to Windows

Keeping Netflix Reliable Using Prioritized Load Shedding

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Get up to 300 new metrics out of the box with AWS supporting services (GA)

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Rapid Event Notification System at Netflix

How digital experience monitoring helps deliver business observability

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Optimizing data warehouse storage

Towards a Reliable Device Management Platform

Data Movement in Netflix Studio via Data Mesh

The Most Important MySQL Setting

Tuning SQL Server Reporting Services

It’s All About Replication Lag in PostgreSQL

Netflix Video Quality at Scale with Cosmos Microservices

AI Essentials for Tech Executives

Crucial Redis Monitoring Metrics You Must Watch

InnoDB Performance Optimization Basics

How To Scale a Single-Host PostgreSQL Database With Citus

Stay Connected