Infrastructure, Latency and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Its partitioned log architecture supports both queuing and publish-subscribe models, allowing it to handle large-scale event processing with minimal latency. Apache Kafka uses a custom TCP/IP protocol for high throughput and low latency. Apache Kafka, designed for distributed event streaming, maintains low latency at scale.

Latency

Latency Analytics Architecture Storage

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings.

Tuning

Tuning Efficiency Latency Strategy

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

Now let’s look at how we designed the tracing infrastructure that powers Edgar. If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls.

Infrastructure

Infrastructure Transportation Storage Open Source

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? This approach provides a few advantages: Low burden on existing systems: Log processing imposes minimal changes to existing infrastructure.

Traffic

Traffic Scalability Strategy Monitoring

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Failures can occur unpredictably across various levels, from physical infrastructure to software layers. Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

As an open source database, it’s a highly popular choice for enterprise applications looking to modernize their infrastructure and reduce their total cost of ownership, along with startup and developer applications looking for a powerful, flexible and cost-effective database to work with. Compare Latency. At a glance – TLDR.

Database

Database Latency Benchmarking Performance

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

Its ability to densely schedule containers into the underlying machines translates to low infrastructure costs. Tuning thousands of parameters has become an impossible task to achieve via a manual and time-consuming approach. SREcon21 – Automating Performance Tuning with Machine Learning. The Akamas approach.

Latency

Latency Tuning Efficiency AWS

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

Vidhya Arvind , Rajasekhar Ummadisetty , Joey Lynch , Vinay Chella Introduction At Netflix our ability to deliver seamless, high-quality, streaming experiences to millions of users hinges on robust, global backend infrastructure. It also serves as central configuration of access patterns such as consistency or latency targets.

Latency

Latency Storage Cache Efficiency

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Text-based records of events and activities generated by applications and infrastructure components. Traces are used for performance analysis, latency optimization, and root cause analysis. OpenTelemetry provides [extensive documentation]([link] and examples to help you fine-tune your configuration for maximum effectiveness.

Latency

Latency Best Practices Metrics Open Source

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. A Sticky Canary is an infrastructure experiment where customers are assigned either to a canary or baseline host for the entire duration of an experiment. Are things loading in time before the user loses interest?

Traffic

Traffic Latency Metrics Cache

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 22, 2020

Compare Latency. On average, ScaleGrid achieves almost 30% lower latency over DigitalOcean for the same deployment configurations. Now that we’ve compared throughput performance, let’s take a look at ScaleGrid vs. DigitalOcean latency for MySQL. Read-Intensive Latency Benchmark. Balanced Workload Latency Benchmark.

Database

Database Benchmarking Latency Performance

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

This is particularly important as we build out new functionality that relies on Pushy; a strong, stable infrastructure foundation allows our partners to continue to build on top of Pushy with confidence. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered.

Latency

Latency Cache Tuning Efficiency

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. With 24/7 expert support, ScaleGrid assists with troubleshooting, performance tuning, and migration processes. Keeping queues short maintains a responsive and efficient RabbitMQ setup.

Best Practices

Best Practices Traffic Strategy Efficiency

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

Scalegrid

OCTOBER 24, 2019

As organizations continue to migrate to the cloud, it’s important to get in front of performance issues, such as high latency, low throughput, and replication lag with higher distances between your users and cloud infrastructure. ScaleGrid also maintains 53% lower latency on average throughout the entire MySQL AWS performance tests.

AWS

AWS Latency Performance Performance Testing

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system. Warm capacity.

Serverless

Serverless Media Latency Social Media

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

When an application is triggered, it can cause latency as the application starts. Cloud-hosted managed services eliminate the minute day-to-day tasks associated with hosting IT infrastructure on-premises. This creates latency when they need to restart. The platform builds the trigger to initiate the app.

Serverless

Serverless Efficiency Lambda AWS

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

While infrastructure has historically been treated as a bottleneck where proper scaling and compute power are applied to improve performance, these aspects are now typically addressed by hyperscalers that offer cloud-based infrastructure and infrastructure as a service.

Best Practices

Best Practices Code Infrastructure Latency

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

These functions are executed by a serverless platform or provider (such as AWS Lambda, Azure Functions or Google Cloud Functions) that manages the underlying infrastructure, scaling and billing. Enable faster development and deployment cycles by abstracting away the infrastructure complexity.

Serverless

Serverless Lambda Azure AWS

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

Artisan Crafted Images In the Netflix full cycle DevOps culture the team responsible for building a service is also responsible for deploying, testing, infrastructure, and operation of that service. Now each change in the infrastructure is tested, canaried, and deployed like any other code change.

DevOps

DevOps AWS Tuning Infrastructure

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. Bulldozer abstracts the underlying infrastructure on how the data moves.

Latency

Latency Storage Big Data Tuning

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

“We use AI to optimize the configuration of the software stack,” Doni said, highlighting how Akamas works by taking into account infrastructure and application metrics at the same time to achieve its optimization goals. You can ask for the best configuration to reduce latency or improve the user experience.”

Engineering

Engineering DevOps Operating System Open Source

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. divide the input video into small chunks 2.

Processing

Processing Media Latency Innovation

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. Infrastructure change events. Intelligent Monitoring Every service operator knows the difficulty of alert tuning.

Monitoring

Monitoring Tuning Traffic Metrics

What’s New at ScaleGrid – September 2024

Scalegrid

SEPTEMBER 10, 2024

We’re proud to introduce AWS Outposts support, allowing you to manage cloud infrastructure on-premises while maintaining full AWS integration. Additionally, we’ve added the Philadelphia AWS Local Zone , helping to reduce latency for customers operating in the eastern U.S. Stay tuned for more exciting updates in the months to come! <p>The

Latency

Latency AWS Storage Tuning

Streaming SQL in Data Mesh

The Netflix TechBlog

NOVEMBER 3, 2023

On the Data Platform team, we build the infrastructure used across the company to process data at scale. This includes features such as autoscaling, the ability to manage pipelines declaratively via Infrastructure as Code, and a rich connector ecosystem. Stay tuned for more updates!

Processing

Processing Engineering Infrastructure Latency

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Those two metrics are approximate indicators of failures and latency.

Traffic

Traffic Metrics Infrastructure Architecture

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

Gartner estimates that by 2025, 70% of digital business initiatives will require infrastructure and operations (I&O) leaders to include digital experience metrics in their business reporting. With DEM solutions, organizations can operate over on-premise network infrastructure or private or public cloud SaaS or IaaS offerings.

Monitoring

Monitoring Social Media IoT Metrics

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

SLOs can be a great way for DevOps and infrastructure teams to use data and performance expectations to make decisions, such as whether to release, and where engineers should focus their time. You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. Help with decision making.

Metrics

Metrics Best Practices DevOps Infrastructure

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

It enables them to adapt to user feedback swiftly, fine-tune feature releases, and deliver exceptional user experiences, all while maintaining control and minimizing disruption. Change impact analysis is an indispensable process for effectively managing changes within an organization’s infrastructure and applications.

DevOps

DevOps Traffic Efficiency Servers

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

The Partner Infrastructure team at Netflix provides solutions to support these two significant efforts by enabling device management at scale. Together, they form the Device Management Platform, which is the infrastructural foundation for Netflix Test Studio (NTS). million elements. this is configurable through enable.auto.commit.

Latency

Latency Traffic Transportation Cloud

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

These principles reduce resource usage by being more efficient and effective while lowering the end-to-end latency in data processing. Orient: Gather tuning parameters for a particular table that changed. AutoAnalyze In short, AutoAnalyze finds the best tuning/configuration parameters for a table. More processing resources.

Storage

Storage Latency Efficiency Data Engineering

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Tuning Options. Tuning SSRS is much like any other application. Reporting Services Infrastructure. General Tuning.

Tuning

Tuning Servers Database Best Practices

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases. Even though Cosmos was developed for asynchronous media processing, we worked with them to expand to generic file processing and tune their workflow platform for our near real-time use case.

Traffic

Traffic Java Latency Google

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Heading over to `Infrastructure` / `Hosts` in your dashboard, you should now have an entry for the host where you installed OneAgent. The other sections on that page (such as Disk analysis) provide further information and charts on topics such as available disk space, latency, dropped network packets, refused connections, and more.

Metrics

Metrics Database Monitoring Network

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Most of the business views created on top of the Iceberg tables can tolerate a few minutes of latency. Please stay tuned! Dehghani, Zhamak.

Big Data

Big Data Government Processing Analytics

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc., Stay tuned for more details on these algorithmic innovations. VQS is called using the measureQuality endpoint. The workflow is initiated.

Media

Media Innovation Metrics Latency

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

What we see here, though, is the emergence of the first iterations of the LLM SDLC: Were not yet changing our embeddings, fine-tuning, or business logic; were not using unit tests, CI/CD, or even a serious evaluation framework, but were building, deploying, monitoring, evaluating, and iterating! We tested both retrieval quality (e.g.,

Systems

Systems Development Tuning Monitoring

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Smashing Magazine

NOVEMBER 8, 2021

As developers, we rightfully obsess about the customer experience, relentlessly working to squeeze every millisecond out of the critical rendering path, optimize input latency, and eliminate jank. Stay tuned for more in 2022! Ilya Grigorik. 2021-11-08T14:30:00+00:00. 2021-11-08T19:34:34+00:00. Large preview ). Large preview ).

Cache

Cache Best Practices Strategy Servers

Most Common RabbitMQ Use Cases

Scalegrid

AUGUST 27, 2024

It is versatile enough for deployment in cloud-based infrastructures, on-premise data centers, or local setups, delivering a dependable and adaptable messaging framework. Furthermore, RabbitMQ embraces an acknowledgment pattern within its infrastructure, ensuring reliable message processing. Take Softonic’s platform as an example.

IoT

IoT Ecommerce Games Scalability

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Accelerate Machine Learning with Amazon SageMaker

All Things Distributed

NOVEMBER 29, 2017

After this, there is often a long process of training that includes tuning the knobs and levers, called hyperparameters, that control the different aspects of the training algorithm. Built-in model tuning (hyperparameter optimization) that can automatically adjust hundreds of different combinations of algorithm parameters.

Tuning

Tuning AWS Scalability Infrastructure

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

Our straining database infrastructure on Oracle led us to evaluate if we could develop a purpose-built database that would support our business needs for the long term. Performant – DynamoDB consistently delivers single-digit millisecond latencies even as your traffic volume increases.

Internet

Internet Internet AWS Performance

Netflix’s Distributed Counter Abstraction

RabbitMQ vs. Kafka: Key Differences

Trending Sources

Foundation Model for Personalized Recommendation

Building Netflix’s Distributed Tracing Infrastructure

Title Launch Observability at Netflix Scale

Why applying chaos engineering to data-intensive applications matters

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Optimizing your Kubernetes clusters without breaking the bank

Introducing Netflix’s Key-Value Data Abstraction Layer

Introducing Netflix TimeSeries Data Abstraction Layer

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Migrating Netflix to GraphQL Safely

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Best Practices for Scaling RabbitMQ

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

The Netflix Cosmos Platform

What is serverless computing? Driving efficiency without sacrificing observability

Automated observability, security, and reliability at scale

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Applying Netflix DevOps Patterns to Windows

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Enhancing Kubernetes cluster management key to platform engineering success

Rebuilding Netflix Video Processing Pipeline with Microservices

Telltale: Netflix Application Monitoring Simplified

What’s New at ScaleGrid – September 2024

Streaming SQL in Data Mesh

Keeping Netflix Reliable Using Prioritized Load Shedding

How digital experience monitoring helps deliver business observability

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Towards a Reliable Device Management Platform

Optimizing data warehouse storage

Tuning SQL Server Reporting Services

Achieving observability in async workflows

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Data Movement in Netflix Studio via Data Mesh

Netflix Video Quality at Scale with Cosmos Microservices

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Most Common RabbitMQ Use Cases

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Accelerate Machine Learning with Amazon SageMaker

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

Stay Connected