Infrastructure, Latency and Systems - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Comparing Approaches to Durability in Low Latency Messaging Queues

DZone

AUGUST 2, 2022

A significant feature of Chronicle Queue Enterprise is support for TCP replication across multiple servers to ensure the high availability of application infrastructure. Little’s Law and Why Latency Matters. In many cases, the assumption is that as long as throughput is high enough, the latency won’t be a problem.

Latency

Latency Benchmarking Network Infrastructure

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

Dynatrace

OCTOBER 10, 2024

Take your monitoring, data exploration, and storytelling to the next level with outstanding data visualization All your applications and underlying infrastructure produce vast volumes of data that you need to monitor or analyze for insights. Infrastructure health: A honeycomb chart is often used to visualize infrastructure health.

Latency

Latency Infrastructure Monitoring Metrics

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Dynatrace

SEPTEMBER 29, 2020

A critical component to this success was that the Dynatrace Team itself uses the Dynatrace Platform to monitor every single Dynatrace cluster in the cloud and trusts the Dynatrace Davis AI to alert in case there are any issues, either with a new feature, a configuration change or with the infrastructure our servers are running on.

Infrastructure

Infrastructure Cloud Monitoring AWS

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Dynatrace

APRIL 10, 2025

Dynatrace integrates application performance monitoring (APM), infrastructure monitoring, and real-user monitoring (RUM) into a single platform, with its Foundation & Discovery mode offering a cost-effective, unified view of the entire infrastructure, including non-critical applications previously monitored using legacy APM tools.

Strategy

Strategy Storage Network Architecture

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Dynatrace

SEPTEMBER 18, 2020

Sure, cloud infrastructure requires comprehensive performance visibility, as Dynatrace provides , but the services that leverage cloud infrastructures also require close attention. Extend infrastructure observability to WSO2 API Manager. High latency or lack of responses. Soaring number of active connections.

Infrastructure

Infrastructure Latency Metrics Cloud

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

which is difficult when troubleshooting distributed systems. Now let’s look at how we designed the tracing infrastructure that powers Edgar. This insight led us to build Edgar: a distributed tracing infrastructure and user experience. Investigating a video streaming failure consists of inspecting all aspects of a member account.

Infrastructure

Infrastructure Transportation Storage Open Source

Build systems more reliably with Dynatrace: Chaos Engineering

Dynatrace

AUGUST 21, 2024

These releases often assumed ideal conditions such as zero latency, infinite bandwidth, and no network loss, as highlighted in Peter Deutsch’s eight fallacies of distributed systems. With Dynatrace, teams can seamlessly monitor the entire system, including network switches, database storage, and third-party dependencies.

Engineering

Engineering Systems Latency Metrics

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Tuning

Tuning Efficiency Latency Strategy

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

Berg , Romain Cledat , Kayla Seeley , Shashank Srikanth , Chaoying Wang , Darin Yu Netflix uses data science and machine learning across all facets of the company, powering a wide range of business applications from our internal infrastructure and content demand modeling to media understanding. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Cache Open Source

Solve hybrid Kubernetes performance and reliability problems with unified observability

Dynatrace

APRIL 10, 2025

In modern containerized environments, teams often deploy Kubernetes across mixed operating systems, creating a situation where both Linux and Windows nodes reside in the same cluster. This inconsistency leads to gaps in monitoring and alerting, making it difficult to maintain a unified view of the cluster’s health.

Performance

Performance Java Operating System Infrastructure

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Dynatrace

AUGUST 27, 2019

Step 1 – Let Dynatrace analyze your infrastructure health in real-time. The Dynatrace all-in-one software intelligence platform gives your team real-time visibility into your underlying infrastructure —be it on bare metal, VMware, OpenStack, AWS, Azure, or a hybrid solution. xMatters creates and updates Jira issues.

Systems

Systems DevOps Latency Azure

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

The first step is determining whether the problem originates from the application or the underlying infrastructure. On Titus , our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers.

Latency

Latency Metrics Programming Monitoring

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Dynatrace

JANUARY 15, 2020

Therefore, it requires multidimensional and multidisciplinary monitoring: Infrastructure health —automatically monitor the compute, storage, and network resources available to the Citrix system to ensure a stable platform. OneAgent: Citrix infrastructure performance. OneAgent: SAP infrastructure performance. Citrix VDA.

Latency

Latency Performance Virtualization Infrastructure

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

Vidhya Arvind , Rajasekhar Ummadisetty , Joey Lynch , Vinay Chella Introduction At Netflix our ability to deliver seamless, high-quality, streaming experiences to millions of users hinges on robust, global backend infrastructure. Data Model At its core, the KV abstraction is built around a two-level map architecture.

Latency

Latency Storage Cache Servers

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Scalability

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Using OpenTelemetry, developers can collect and process telemetry data from applications, services, and systems. Observability Observability is the ability to determine a system’s health by analyzing the data it generates, such as logs, metrics, and traces. There are three main types of telemetry data: Metrics.

Latency

Latency Best Practices Metrics Open Source

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

The system is inconsistent, slow, hallucinatingand that amazing demo starts collecting digital dust. Two big things: They bring the messiness of the real world into your system through unstructured data. When your system is both ingesting messy real-world data AND producing nondeterministic outputs, you need a different approach.

Systems

Systems Development Tuning Monitoring

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

SLOs can be a great way for DevOps and infrastructure teams to use data and performance expectations to make decisions, such as whether to release and where engineers should focus their time. Latency is the time that it takes a request to be served. SLOs aid decision making. SLOs promote automation. Define SLOs for each service.

Software

Software Software Benchmarking Latency

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

Its ability to densely schedule containers into the underlying machines translates to low infrastructure costs. The optimization goal was to improve the application efficiency, that is to improve the ratio between service throughput and cloud costs while not increasing the application latency (e.g. below 500ms) and error rates (e.g.

Latency

Latency Tuning Efficiency AWS

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. Are things loading in time before the user loses interest?

Traffic

Traffic Latency Metrics Cache

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

As dynamic systems architectures increase in complexity and scale, IT teams face mounting pressure to track and respond to conditions and issues across their multi-cloud environments. Dynatrace news. To get an analyst’s perspective, you can hear how Nancy Gohring from 451 Research defines observability: Why is observability important?

Metrics

Metrics Open Source Monitoring Cloud

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Site reliability engineering: 5 things you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. SRE applies DevOps principles to developing systems and software that help increase site reliability and performance.

Engineering

Engineering DevOps Government Latency

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The first generation of this system went live with the streaming launch in 2007. Delivery?—?A

Serverless

Serverless Media Latency Social Media

What is full stack observability?

Dynatrace

APRIL 6, 2022

Endpoints include on-premises servers, Kubernetes infrastructure, cloud-hosted infrastructure and services, and open-source technologies. Observability across the full technology stack gives teams comprehensive, real-time insight into the behavior, performance, and health of applications and their underlying infrastructure.

DevOps

DevOps Innovation Infrastructure Cloud

Best practices and key metrics for improving mobile app performance

Dynatrace

DECEMBER 13, 2023

User demographics , such as app version, operating system, location, and device type, can help tailor an app to better meet users’ needs and preferences. By monitoring metrics such as error rates, response times, and network latency, developers can identify trends and potential issues, so they don’t become critical. Issue remediation.

Best Practices

Best Practices Mobile Metrics Performance

Site reliability engineering: 5 things to you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. SRE applies DevOps principles to developing systems and software that help increase site reliability and performance.

Engineering

Engineering DevOps Government Latency

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Sample system diagram for an Alexa voice command. The other main use case was RENO, the Rapid Event Notification System mentioned above. Rewriting always comes with a risk, and it’s never the first solution we reach for, particularly when working with a system that’s in place and working well.

Latency

Latency Cache Tuning Efficiency

Service level objectives: 5 SLOs to get started

Dynatrace

JUNE 1, 2023

It represents the percentage of time a system or service is expected to be accessible and functioning correctly. Response time Response time refers to the total time it takes for a system to process a request or complete an operation. Note : you might hear the term latency used instead of response time.

Latency

Latency Website Traffic Virtualization

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

MAY 31, 2023

How site reliability engineering affects organizations’ bottom line SRE applies the disciplines of software engineering to infrastructure management, both on-premises and in the cloud. There are now many more applications, tools, and infrastructure variables that impact an application’s performance and availability.

Best Practices

Best Practices DevOps Latency Metrics

How Park ‘N Fly eliminated silos and improved customer experience with Dynatrace cloud monitoring

Dynatrace

APRIL 7, 2021

But your infrastructure teams don’t see any issue on their AWS or Azure monitoring tools, your platform team doesn’t see anything too concerning in Kubernetes logging, and your apps team says there are green lights across the board. This scenario has become all too common as digital infrastructure has grown increasingly complex.

Cloud

Cloud Monitoring Latency Games

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

GenAI is prone to erratic behavior due to unforeseen data scenarios or underlying system issues. Data dependencies and framework intricacies require observing the lifecycle of an AI-powered application end to end, from infrastructure and model performance to semantic caches and workflow orchestration.

Cache

Cache Azure Infrastructure Monitoring

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

Dynatrace

OCTOBER 23, 2023

Microsoft Hyper-V is a virtualization platform that manages virtual machines (VMs) on Windows-based systems. It enables multiple operating systems to run simultaneously on the same physical hardware and integrates closely with Windows-hosted services. This leads to a more efficient and streamlined experience for users.

Efficiency

Efficiency Virtualization Hardware Performance

What are quality gates? How to use quality gates to deliver better software at speed and scale

Dynatrace

FEBRUARY 21, 2024

To remain competitive in today’s fast-paced market, organizations must not only ensure that their digital infrastructure is functioning optimally but also that software deployments and updates are delivered rapidly and consistently. In this example, unlike latency, the remaining three signals did not receive a “pass.”

Speed

Speed Software Software Latency

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

Organizations can offload much of the burden of managing app infrastructure and transition many functions to the cloud by going serverless with the help of Lambda. You will likely need to write code to integrate systems and handle complex tasks or incoming network requests. AWS continues to improve how it handles latency issues.

Lambda

Lambda AWS Serverless Hardware

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

The network latency between cluster nodes should be around 10 ms or less. For Premium HA, this has been extended from 10 ms latency (in the same network region) to around 100 ms network latency due to asynchronous data replication between regions. In the image below, three downed nodes make an entire cluster unavailable.

Availability

Availability Hardware Latency Traffic

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

Designing Instagram

High Scalability

JANUARY 11, 2022

The streaming data store makes the system extensible to support other use-cases (e.g. FUN FACT : In this talk , Rodrigo Schmidt, director of engineering at Instagram talks about the different challenges they have faced in scaling the data infrastructure at Instagram. System Components. Fetching User Feed. Optimization.

Design

Design Media Storage Logistics

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

This transition to public, private, and hybrid cloud is driving organizations to automate and virtualize IT operations to lower costs and optimize cloud processes and systems. ITOps is an IT discipline involving actions and decisions made by the operations team responsible for an organization’s IT infrastructure.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Traditional computing models rely on virtual or physical machines, where each instance includes a complete operating system, CPU cycles, and memory. There is no need to plan for extra resources, update operating systems, or install frameworks. The provider is essentially your system administrator. What is serverless computing?

Serverless

Serverless Efficiency Lambda Azure

Netflix’s Distributed Counter Abstraction

Comparing Approaches to Durability in Low Latency Messaging Queues

Trending Sources

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

RabbitMQ vs. Kafka: Key Differences

Building Netflix’s Distributed Tracing Infrastructure

Build systems more reliably with Dynatrace: Chaos Engineering

Foundation Model for Personalized Recommendation

Supporting Diverse ML Systems at Netflix

Solve hybrid Kubernetes performance and reliability problems with unified observability

Title Launch Observability at Netflix Scale

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Noisy Neighbor Detection with eBPF

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Introducing Netflix’s Key-Value Data Abstraction Layer

Best Practices for Scaling RabbitMQ

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Implementing service-level objectives to improve software quality

Optimizing your Kubernetes clusters without breaking the bank

Migrating Netflix to GraphQL Safely

What is observability? Not just logs, metrics and traces

Introducing Netflix TimeSeries Data Abstraction Layer

Site reliability engineering: 5 things you need to know

Why applying chaos engineering to data-intensive applications matters

The Netflix Cosmos Platform

What is full stack observability?

Best practices and key metrics for improving mobile app performance

Site reliability engineering: 5 things to you need to know

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Service level objectives: 5 SLOs to get started

Site reliability done right: 5 SRE best practices that deliver on business objectives

How Park ‘N Fly eliminated silos and improved customer experience with Dynatrace cloud monitoring

Dynatrace accelerates business transformation with new AI observability solution

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

What are quality gates? How to use quality gates to deliver better software at speed and scale

What is AWS Lambda?

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Predictive CPU isolation of containers at Netflix

Designing Instagram

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

What is serverless computing? Driving efficiency without sacrificing observability

Stay Connected