Design, Infrastructure and Latency - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Designing Instagram

High Scalability

JANUARY 11, 2022

Design a photo-sharing platform similar to Instagram where users can upload their photos and share it with their followers. High Level Design. FUN FACT : In this talk , Rodrigo Schmidt, director of engineering at Instagram talks about the different challenges they have faced in scaling the data infrastructure at Instagram.

Design

Design Media Storage Logistics

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. Its design prioritizes high availability and efficient data transfer with minimal overhead, making it a practical choice for handling real-time data pipelines and distributed event processing.

Latency

Latency Analytics Architecture Storage

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

Now let’s look at how we designed the tracing infrastructure that powers Edgar. If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls.

Infrastructure

Infrastructure Transportation Storage Open Source

Spring WebFlux: publishOn vs subscribeOn for Improving Microservices Performance

DZone

SEPTEMBER 23, 2024

With the rise of microservices architecture , there has been a rapid acceleration in the modernization of legacy platforms, leveraging cloud infrastructure to deliver highly scalable, low-latency, and more responsive services. Why Use Spring WebFlux?

Performance

Performance Latency Architecture Programming

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. These insights have shaped the design of our foundation model, enabling a transition from maintaining numerous small, specialized models to building a scalable, efficient system.

Tuning

Tuning Efficiency Latency Strategy

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? How can we design systems that recognize these nuances and empower every title to shine and bring joy to ourmembers?

Traffic

Traffic Scalability Strategy Monitoring

MySQL on Azure Performance Benchmark – ScaleGrid vs. Azure Database

Scalegrid

AUGUST 26, 2020

Microsoft Azure is one of the most popular cloud providers in the world, and a natural fit for database hosting on applications leveraging Microsoft across their infrastructure. ScaleGrid MySQL on Azure so you can see which provider offers the best throughput and latency performance. We measure latency in ms 95th percentile latency.

Azure

Azure Benchmarking Database Latency

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

Vidhya Arvind , Rajasekhar Ummadisetty , Joey Lynch , Vinay Chella Introduction At Netflix our ability to deliver seamless, high-quality, streaming experiences to millions of users hinges on robust, global backend infrastructure. It also serves as central configuration of access patterns such as consistency or latency targets.

Latency

Latency Storage Cache Efficiency

Self-Host Your Static Assets

CSS Wizardry

MAY 31, 2019

One of the quickest wins—and one of the first things I recommend my clients do—to make websites faster can at first seem counter-intuitive: you should self-host all of your static assets, forgoing others’ CDNs/infrastructure. On a slower, higher-latency connection, the story is much, mush worse. You’re going to suffer, too.

Cache

Cache Latency Infrastructure Website

The Three Cs: Concatenate, Compress, Cache

CSS Wizardry

OCTOBER 16, 2023

Plotted on the same horizontal axis of 1.6s, the waterfalls speak for themselves: 201ms of cumulative latency; 109ms of cumulative download. 4,362ms of cumulative latency; 240ms of cumulative download. When we talk about downloading files, we—generally speaking—have two things to consider: latency and bandwidth. It gets worse.

Cache

Cache Latency Strategy Speed

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

The architecture of RabbitMQ is meticulously designed for complex message routing, enabling dynamic and flexible interactions between producers and consumers. While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges.

Best Practices

Best Practices Traffic Strategy Efficiency

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Failures can occur unpredictably across various levels, from physical infrastructure to software layers. Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. We designed experimental scenarios inspired by chaos engineering.

Engineering

Engineering Tuning Latency Open Source

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

In these modern environments, every hardware, software, and cloud infrastructure component and every container, open-source tool, and microservice generates records of every activity. The architects and developers who create the software must design it to be observed. Benefits of observability.

Metrics

Metrics Open Source Monitoring Cloud

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

Data dependencies and framework intricacies require observing the lifecycle of an AI-powered application end to end, from infrastructure and model performance to semantic caches and workflow orchestration. Estimates show that NVIDIA, a semiconductor manufacturer, could release 1.5 million AI server units annually by 2027, consuming 75.4+

Cache

Cache Azure Infrastructure Monitoring

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

The network latency between cluster nodes should be around 10 ms or less. For Premium HA, this has been extended from 10 ms latency (in the same network region) to around 100 ms network latency due to asynchronous data replication between regions. In the image below, three downed nodes make an entire cluster unavailable.

Availability

Availability Hardware Latency Traffic

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Dynatrace

AUGUST 27, 2019

Welcome back to the blog series in which we share how you can easily solve three common problem scenarios by using Dynatrace and xMatters Flow Designer. Step 1 – Let Dynatrace analyze your infrastructure health in real-time. This is where xMatters Flow Designer comes into play, by automating remediation steps at the touch of a button.

Systems

Systems DevOps Latency Azure

Dynatrace automatically monitors OpenAI ChatGPT for companies that deliver reliable, cost-effective services powered by generative AI

Dynatrace

JUNE 7, 2023

A typical design pattern is the use of a semantic search over a domain-specific knowledge base, like internal documentation, to provide the required context in the prompt. With these latency, reliability, and cost measurements in place, your operations team can now define their own OpenAI dashboards and SLOs.

Monitoring

Monitoring Latency Metrics Azure

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

To support this growth, we’ve revisited Pushy’s past assumptions and design decisions with an eye towards both Pushy’s future role and future stability. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered.

Latency

Latency Cache Tuning Efficiency

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

The big difference from the monolith, though, is that this is now a standalone service deployed as a separate “application” (service) in our cloud infrastructure. Being able to canary a new route let us verify latency and error rates were within acceptable limits. For the migration, testing was a first-class citizen.

Latency

Latency Cache Java Traffic

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system.

Serverless

Serverless Media Latency Social Media

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

Organizations can offload much of the burden of managing app infrastructure and transition many functions to the cloud by going serverless with the help of Lambda. AWS continues to improve how it handles latency issues. An application could rely on dozens or even hundreds of Lambdas and other infrastructure.

Lambda

Lambda AWS Serverless Hardware

SRE vs DevOps: What you need to know

Dynatrace

FEBRUARY 24, 2021

SRE is the transformation of traditional operations practices by using software engineering and DevOps principles to improve the availability, performance, and scalability of releases by building resiliency into apps and infrastructure. Designating and managing Service Level Objectives (SLOs) as availability targets for a service.

DevOps

DevOps Software Engineering Speed Google

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

ITOps is an IT discipline involving actions and decisions made by the operations team responsible for an organization’s IT infrastructure. ITOps refers to the process of acquiring, designing, deploying, configuring, and maintaining equipment and services that support an organization’s desired business outcomes.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

For your eyes only: improving Netflix video quality with neural networks

The Netflix TechBlog

NOVEMBER 17, 2022

Our approach to NN-based video downscaling The deep downscaler is a neural network architecture designed to improve the end-to-end video quality by learning a higher-quality video downscaler. We employed an adaptive network design that is applicable to the wide variety of resolutions we use for encoding.

Network

Network Media Innovation Efficiency

What is real user monitoring (RUM)?

Dynatrace

JANUARY 13, 2022

Providing insight into the service latency to help developers identify poorly performing code. And UX designers can use that data to better understand how users interact with an application and how developers can streamline the interface. Want to learn more? There are also some limitations of real user monitoring. Learn more!

Monitoring

Monitoring Mobile Latency Best Practices

Netflix at AWS re:Invent 2019

The Netflix TechBlog

NOVEMBER 22, 2019

Netflix shares how Amazon EC2 Auto Scaling allows its infrastructure to automatically adapt to changing traffic patterns in order to keep its audience entertained and its costs on target. In this talk, we share how Netflix deploys systems to meet its demands, Ceph’s design for high availability, and results from our benchmarking.

AWS

AWS Entertainment Open Source Benchmarking

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. divide the input video into small chunks 2.

Processing

Processing Media Latency Innovation

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

Berg , Romain Cledat , Kayla Seeley , Shashank Srikanth , Chaoying Wang , Darin Yu Netflix uses data science and machine learning across all facets of the company, powering a wide range of business applications from our internal infrastructure and content demand modeling to media understanding.

Systems

Systems Media Cache Open Source

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

Dynatrace

OCTOBER 4, 2022

Data lakehouses deliver the query response with minimal latency. Designed to provide a single source of truth for structured data, they offer a way for organizations to simplify data management by centralizing inputs. The performance of these queries needs to be at a level where they can support ad-hoc analytics use cases.

Artificial Intelligence

Artificial Intelligence Storage Analytics Government

What is cloud migration?

Dynatrace

SEPTEMBER 30, 2021

Generally speaking, cloud migration involves moving from on-premises infrastructure to cloud-based services. In cloud computing environments, infrastructure and services are maintained by the cloud vendor, allowing you to focus on how best to serve your customers. However, it can also mean migrating from one cloud to another.

Cloud

Cloud Traffic Best Practices Strategy

Amazon DynamoDB ? a Fast and Scalable NoSQL Database.

All Things Distributed

JANUARY 18, 2012

a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications. Today is a very exciting day as we release Amazon DynamoDB , a fast, highly reliable and cost-effective NoSQL database service designed for internet scale applications. Amazon DynamoDB offers low, predictable latencies at any scale. Comments ().

Scalability

Scalability Database Ecommerce Latency

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. Bulldozer abstracts the underlying infrastructure on how the data moves.

Latency

Latency Storage Big Data Tuning

Why growing AI adoption requires an AI observability strategy

Dynatrace

JANUARY 17, 2024

By adopting a cloud- and edge-based AI approach, teams can benefit from the flexibility, scalability, and pay-per-use model of the cloud while also reducing the latency, bandwidth, and cost of sending AI data to cloud-based operations. Use containerization. Continuously monitor AI models’ performance.

Strategy

Strategy Artificial Intelligence Storage Cloud

Implementing AWS well-architected pillars with automated workflows

Dynatrace

SEPTEMBER 13, 2023

This is a set of best practices and guidelines that help you design and operate reliable, secure, efficient, cost-effective, and sustainable systems in the cloud. If you use AWS cloud services to build and run your applications, you may be familiar with the AWS Well-Architected framework.

AWS

AWS Efficiency Azure Cloud

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

SLOs can be a great way for DevOps and infrastructure teams to use data and performance expectations to make decisions, such as whether to release, and where engineers should focus their time. You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. Help with decision making.

Metrics

Metrics Best Practices DevOps Infrastructure

Introducing Dynatrace built-in data observability on Davis AI and Grail

Dynatrace

JANUARY 31, 2024

This freshness measurement can then be used by out-of-the-box Dynatrace anomaly detection to actively alert on abnormal changes within the data ingest latency to ensure the expected freshness of all the data records. This requires monitoring of the upstream infrastructure, applications, or platform supporting those data streams.

DevOps

DevOps Analytics Airlines Metrics

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

The Partner Infrastructure team at Netflix provides solutions to support these two significant efforts by enabling device management at scale. Together, they form the Device Management Platform, which is the infrastructural foundation for Netflix Test Studio (NTS). million elements.

Latency

Latency Traffic Transportation Cloud

Interpreting A/B test results: false negatives and power

The Netflix TechBlog

OCTOBER 26, 2021

Subsequent posts will go into more details on experimentation across Netflix, how Netflix has invested in infrastructure to support and scale experimentation, and the importance of the culture of experimentation within Netflix. Have a look at Part 1 (Decision Making at Netflix), Part 2 (What is an A/B Test?),

Testing

Testing Metrics Latency Design

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

This article will list some of the use cases of AutoOptimize, discuss the design principles that help enhance efficiency, and present the high-level architecture. These principles reduce resource usage by being more efficient and effective while lowering the end-to-end latency in data processing. Transparency to end-users.

Storage

Storage Latency Efficiency Data Engineering

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

Gartner estimates that by 2025, 70% of digital business initiatives will require infrastructure and operations (I&O) leaders to include digital experience metrics in their business reporting. With DEM solutions, organizations can operate over on-premise network infrastructure or private or public cloud SaaS or IaaS offerings.

Monitoring

Monitoring Social Media IoT Metrics

Under the Hood of Amazon EC2 Container Service

All Things Distributed

JULY 20, 2015

This architecture affords Amazon ECS high availability, low latency, and high throughput because the data store is never pessimistically locked. As you can see, the latency remains relatively jitter-free despite large fluctuations in the cluster size. Most of this infrastructure was deployed as a large monolithic application.

Latency

Latency Architecture AWS Open Source

Netflix’s Distributed Counter Abstraction

Designing Instagram

Trending Sources

RabbitMQ vs. Kafka: Key Differences

Building Netflix’s Distributed Tracing Infrastructure

Spring WebFlux: publishOn vs subscribeOn for Improving Microservices Performance

Foundation Model for Personalized Recommendation

Title Launch Observability at Netflix Scale

MySQL on Azure Performance Benchmark – ScaleGrid vs. Azure Database

Introducing Netflix’s Key-Value Data Abstraction Layer

Self-Host Your Static Assets

The Three Cs: Concatenate, Compress, Cache

Introducing Netflix TimeSeries Data Abstraction Layer

Best Practices for Scaling RabbitMQ

Why applying chaos engineering to data-intensive applications matters

What is observability? Not just logs, metrics and traces

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Dynatrace automatically monitors OpenAI ChatGPT for companies that deliver reliable, cost-effective services powered by generative AI

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix Cosmos Platform

What is AWS Lambda?

SRE vs DevOps: What you need to know

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Predictive CPU isolation of containers at Netflix

For your eyes only: improving Netflix video quality with neural networks

What is real user monitoring (RUM)?

Netflix at AWS re:Invent 2019

Rebuilding Netflix Video Processing Pipeline with Microservices

Supporting Diverse ML Systems at Netflix

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

What is cloud migration?

Amazon DynamoDB ? a Fast and Scalable NoSQL Database.

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Why growing AI adoption requires an AI observability strategy

Implementing AWS well-architected pillars with automated workflows

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Introducing Dynatrace built-in data observability on Davis AI and Grail

Towards a Reliable Device Management Platform

Interpreting A/B test results: false negatives and power

Optimizing data warehouse storage

How digital experience monitoring helps deliver business observability

Under the Hood of Amazon EC2 Container Service

Stay Connected