Latency, Testing and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

The three strategies we will discuss today are AB Testing , Replay Testing, and Sticky Canaries. To launch Phase 1 safely, we used AB Testing. To launch Phase 2 safely, we used Replay Testing and Sticky Canaries. We knew we could test the same query with the same inputs and consistently expect the same results.

Traffic

Traffic Latency Metrics Cache

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

Tuning thousands of parameters has become an impossible task to achieve via a manual and time-consuming approach. The following figure shows the high-level architecture where any load testing solution (e.g. SREcon21 – Automating Performance Tuning with Machine Learning. The Akamas approach. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

Compare Latency. lower latency compared to DigitalOcean for PostgreSQL. PostgreSQL DigitalOcean Performance Test. Now, let’s take a look at the throughput and latency performance of our comparison. Next, we are going to test and compare the latency performance between ScaleGrid and DigitalOcean for PostgreSQL.

Database

Database Latency Benchmarking Performance

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

The test utilized a MySQL dataset created using Sysbench which had 3 tables with 50 million rows each. MySQL Test Bed Configuration. 95th Percentile Latency. The 95th percentile latency of queries was also 1.8 Stay tuned for my follow-on blog post with more details! Performance Benefits of Rolling Index Creation.

Best Practices

Best Practices Latency Tuning Database

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Servers

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 22, 2020

Compare Latency. On average, ScaleGrid achieves almost 30% lower latency over DigitalOcean for the same deployment configurations. Now that we’ve compared throughput performance, let’s take a look at ScaleGrid vs. DigitalOcean latency for MySQL. Read-Intensive Latency Benchmark. Balanced Workload Latency Benchmark.

Database

Database Benchmarking Latency Performance

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Scalegrid

OCTOBER 24, 2019

As organizations continue to migrate to the cloud, it’s important to get in front of performance issues, such as high latency, low throughput, and replication lag with higher distances between your users and cloud infrastructure. MySQL on AWS Performance Test. MySQL Performance Test Scenarios and Results. Amazon RDS.

AWS

AWS Latency Performance Performance Testing

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

Validation tasks are then extended left to cover performance testing and release validation in a pre-production environment. Resilient applications with chaos testing in pre-production Another Dynatrace team uses a guardian as a safeguard during chaos testing. The queries are depicted below (sensitive data has been removed).

DevOps

DevOps Traffic Latency Best Practices

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

Keptn closes the loop of planning, testing, deployment, and analysis in Agile-like environments with the help of quality gates defined by service- and business-level indicators. For example, improving latency by as little as 0.1 latency is the number one reason consumers abandon mobile sites. Meanwhile, in the U.S.,

Benchmarking

Benchmarking Latency Speed Software

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. Local development tools including specialized test runners, code generators, and a command line interface. Productivity?—?Local Delivery?—?A

Serverless

Serverless Media Latency Social Media

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Our previous blog post presented replay traffic testing — a crucial instrument in our toolkit that allows us to implement these transformations with precision and reliability. Compared to replay testing, canaries allow us to extend the validation scope beyond the service level.

Traffic

Traffic Metrics Systems Strategy

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Traces are used for performance analysis, latency optimization, and root cause analysis. OpenTelemetry provides [extensive documentation]([link] and examples to help you fine-tune your configuration for maximum effectiveness. Capture critical performance indicators such as request latency, error rates, and resource usage.

Latency

Latency Best Practices Metrics Open Source

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

Artisan Crafted Images In the Netflix full cycle DevOps culture the team responsible for building a service is also responsible for deploying, testing, infrastructure, and operation of that service. Now each change in the infrastructure is tested, canaried, and deployed like any other code change.

DevOps

DevOps AWS Tuning Infrastructure

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. This testing stage took about two weeks.

Processing

Processing Media Latency Innovation

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

You can ask for the best configuration to reduce latency or improve the user experience.” And with automatic application tuning, teams spend less time on manually testing and reviewing configurations, resulting in up to five times the productivity of performance engineers, DevOps, and SREs when it comes to application optimization.

Engineering

Engineering DevOps Operating System Open Source

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

OCTOBER 17, 2019

While there is plenty of well-documented benefits to using a connection pooler, there are some arguments to be made against using one: Introducing a middleware in the communication inevitably introduces some latency. Our tests show that even a small number of clients can significantly benefit from using a connection pooler.

Architecture

Architecture Database Latency Servers

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

By Benson Ma , Alok Ahuja Introduction At Netflix, hundreds of different device types, from streaming sticks to smart TVs, are tested every day through automation to ensure that new software releases continue to deliver the quality of the Netflix experience that our customers enjoy. In this blog post, we will focus on the latter feature set.

Latency

Latency Traffic Transportation Cloud

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

High level playback architecture with priority throttling and chaos testing Building a request taxonomy We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Those two metrics are approximate indicators of failures and latency.

Traffic

Traffic Metrics Infrastructure Architecture

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Our engineering teams tuned their services for performance after factoring in increased resource utilization due to tracing.

Infrastructure

Infrastructure Transportation Storage Open Source

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

During the interview, Jake made a statement about AI testing that was widely shared: One of the things we learned is that after it passes 100 tests, the odds that it will pass a random distribution of 100k user inputs with 100% accuracy is very high. If youre not hands-on with AI, this advice might sound reasonable.

Latency

Latency Tuning Metrics Testing

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? To be fair, that is also true with PostgreSQL; it hasn’t been tuned either, and it, too, can also perform much better.

Tuning

Tuning Cache Servers Benchmarking

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases. Even though Cosmos was developed for asynchronous media processing, we worked with them to expand to generic file processing and tune their workflow platform for our near real-time use case.

Traffic

Traffic Java Latency Google

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). One use case for STM is to model the behavior of a customer in the form of a flow of transactions along the buyer’s journey.

Monitoring

Monitoring Social Media IoT Metrics

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

Keeping feature parity between all of these implementations and ensuring that they all behave the same way is challenging: what we want is a single, well-tested implementation of all of this functionality, so we can make changes and fix bugs in one place. This is the first in a series of posts on our journey to service mesh, so stay tuned.

Traffic

Traffic Latency Cloud C++

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

Stable, well-calibrated SLOs pave the way for teams to automate more processes and testing throughout the software delivery life cycle (SDLC). You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. Promote automation. SLO best practices.

Metrics

Metrics Best Practices DevOps Infrastructure

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

Weve seen this across dozens of companies, and the teams that break out of this trap all adopt some version of Evaluation-Driven Development (EDD), where testing, monitoring, and evaluation drive every decision from the start. What breaks your app in production isnt always what you tested for in dev! The way out?

Systems

Systems Development Tuning Monitoring

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

Scalegrid

NOVEMBER 4, 2024

For a more detailed comparison of performance features between different versions, refer to: [link] Benchmarking Methodology Sysbench Overview Sysbench is a versatile, open-source benchmarking tool ideal for testing OLTP (Online Transaction Processing) database workloads. You can access the benchmark here: [link].

Benchmarking

Benchmarking AWS Tuning Metrics

It’s All About Replication Lag in PostgreSQL

Percona

APRIL 13, 2023

In PostgreSQL, replication lag can occur due to various reasons such as network latency, slow disk I/O, long-running transactions, etc. Replication lag can occur due to various reasons, such as: Network latency: Network latency is the delay caused by the time it takes for data to travel between the primary and standby databases.

Latency

Latency Tuning Open Source Network

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. It can achieve impressive performance, handling up to 50 million operations per second.

Metrics

Metrics Monitoring Latency Cache

KeyCDN Launches New POP in Mexico

KeyCDN

NOVEMBER 4, 2021

The POP is strategially located within the country and lowers latency overall. KeyCDN is always on the lookout for ways to minimize latency and accelerate asset delivery worldwide. Hola Mexico! We've launched our new point of presence (POP) in Mexico City. In this case, the POP's identifier is mxmc.

Latency

Latency Tuning Cache Traffic

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

Perceptual quality measurements are used to drive video encoding optimizations , perform video codec comparisons , carry out A/B testing and optimize streaming QoE decisions to mention a few. This enables us to use our scale to increase throughput and reduce latencies. Stay tuned for more details on these algorithmic innovations.

Media

Media Innovation Metrics Latency

Best Practices for a Seamless MongoDB Upgrade

Percona

NOVEMBER 2, 2023

Improved performance : MongoDB continually fine-tunes its database engine, resulting in faster query execution and reduced latency. Safeguarding your data, testing in a controlled environment, and having rollback plans are all part of this stage. We walk you through the essential steps required.

Best Practices

Best Practices Hardware Tuning Scalability

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Nowadays, solid-state drives (SSDs) or non-volatile memory express (NVMe) drives are preferred over traditional hard disk drives (HDDs) for database servers due to their faster read and write speeds, lower latency, and improved reliability. The optimal value can be decided after testing multiple settings, starting from eight is a good choice.

Performance

Performance Hardware Tuning Storage

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be. Steps Provisioning The first step is to provision the four nodes with both PostgreSQL and Citus.

Database

Database Benchmarking Latency C++

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Smashing Magazine

NOVEMBER 8, 2021

As developers, we rightfully obsess about the customer experience, relentlessly working to squeeze every millisecond out of the critical rendering path, optimize input latency, and eliminate jank. Stay tuned for more in 2022! Ilya Grigorik. 2021-11-08T14:30:00+00:00. 2021-11-08T19:34:34+00:00. Large preview ).

Cache

Cache Best Practices Strategy Servers

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The Netflix TechBlog

JULY 21, 2022

Dealing with ambiguities and missing data : Sometimes the entries in BDP are contaminated with testing entries and NULL values, along with ambiguous values that have no meaning or just simply contradictory values due to unreal test environments. Restricting Testing and Analysis to one day and device at a time.

Big Data

Big Data Cache Engineering Data Engineering

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

link] ) For the single-core case the bandwidth reported by the STREAM benchmark kernels is very close to the same as the bandwidth for the all-read tests reported here. To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.

Benchmarking

Benchmarking Cache Latency Tuning

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

Whether in analyzing A/B tests, optimizing studio production, training algorithms, investing in content acquisition, detecting security breaches, or optimizing payments, well structured and accurate data is foundational. Introduction Netflix relies on data to power its business in all phases.

Processing

Processing Big Data Efficiency Engineering

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

HammerDB

MAY 5, 2019

Fortunately the HammerDB TPC-C/OLTP workload intentionally has a great deal of contention between threads and is therefore ideal for testing spin-locks. ” and “as the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.” linux-glibc2.12-x86_64/data

Testing

Testing Tuning Latency Storage

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. There's also a test and println() in the loop to, hopefully, convince the compiler not to optimize-out an otherwise empty loop. This will slow this test a little.) Trying it out: centos$ time java TimeBench.

Speed

Speed Java AWS Virtualization

Software-defined far memory in warehouse scale computers

The Morning Paper

MAY 21, 2019

This boils down to a single digit µs latency toleration in the tail for far memory, and in addition to security and privacy concerns, rules out remote memory solutions. Thus we’re fundamentally trading (de)-compression latency at access time for the ability to pack more data in memory. ML-based auto-tuning. Evaluation.

Software

Software Software Google Hardware

Netflix’s Distributed Counter Abstraction

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Trending Sources

Migrating Netflix to GraphQL Safely

Optimizing your Kubernetes clusters without breaking the bank

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Best Practice for Creating Indexes on your MySQL Tables

Introducing Netflix’s Key-Value Data Abstraction Layer

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

How Dynatrace boosts production resilience with Site Reliability Guardian

How BizDevOps can “shift left” using SLOs to automate quality gates

The Netflix Cosmos Platform

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Applying Netflix DevOps Patterns to Windows

Rebuilding Netflix Video Processing Pipeline with Microservices

Enhancing Kubernetes cluster management key to platform engineering success

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Towards a Reliable Device Management Platform

Keeping Netflix Reliable Using Prioritized Load Shedding

Building Netflix’s Distributed Tracing Infrastructure

AI Essentials for Tech Executives

The Most Important MySQL Setting

Achieving observability in async workflows

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

How digital experience monitoring helps deliver business observability

Zero Configuration Service Mesh with On-Demand Cluster Discovery

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

It’s All About Replication Lag in PostgreSQL

Crucial Redis Monitoring Metrics You Must Watch

KeyCDN Launches New POP in Mexico

Netflix Video Quality at Scale with Cosmos Microservices

Best Practices for a Seamless MongoDB Upgrade

InnoDB Performance Optimization Basics

How To Scale a Single-Host PostgreSQL Database With Citus

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The evolution of single-core bandwidth in multicore processors

Incremental Processing using Netflix Maestro and Apache Iceberg

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

The Speed of Time

Software-defined far memory in warehouse scale computers

Stay Connected