Architecture, Availability and Latency - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

This article outlines the key differences in architecture, performance, and use cases to help determine the best fit for your workload. RabbitMQ follows a message broker model with advanced routing, while Kafkas event streaming architecture uses partitioned logs for distributed processing. What is RabbitMQ? What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

“Latency” is the duration from the execution of a load instruction (to an address that misses in all the caches), and the completion of that load instruction when the data is returned from memory. The example below is for a 2005-era processor with 60 ns memory latency and 6.4 cache lines -> 5.6 cache lines -> 5.6

Latency

Latency Hardware Cache Systems

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Architecture Overview The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset. This dual availability ensures immediate processing capabilities alongside comprehensive long-term data retention. Thus, all data in one region is processed by the Flink job deployed within thatregion.

Tuning

Tuning Latency Efficiency Storage

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Implementing clustering and quorum queues in RabbitMQ significantly improves load distribution and data redundancy, ensuring high availability and fault tolerance for messaging services. This decoupling is crucial in modern architectures where scalability and fault tolerance are paramount.

Best Practices

Best Practices Traffic Strategy Scalability

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models. Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs.

Tuning

Tuning Efficiency Latency Strategy

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

When undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. It provides a good read on the availability and latency ranges under different production conditions.

Traffic

Traffic Latency Tuning Systems

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

As more organizations embrace microservices-based architecture to deliver goods and services digitally, maintaining customer satisfaction has become exponentially more challenging. Latency is the time that it takes a request to be served. Availability. Define SLOs for each service. Reliability.

Software

Software Software Benchmarking Latency

Scalable Annotation Service?—?Marken

The Netflix TechBlog

JANUARY 25, 2023

The service should be able to serve real-time, aka UI, applications so CRUD and search operations should be achieved with low latency. All data should be also available for offline analytics in Hive/Iceberg. Our service will be used by a lot of internal UI applications hence the latency for CRUD and search operations must be low.

Scalability

Scalability Latency Media Architecture

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Dynatrace

NOVEMBER 28, 2022

The new Amazon capability enables customers to improve the startup latency of their functions from several seconds to as low as sub-second (up to 10 times faster) at P99 (the 99th latency percentile). This can cause latency outliers and may lead to a poor end-user experience for latency-sensitive applications.

Lambda

Lambda AWS Serverless Latency

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

Central to this infrastructure is our use of multiple online distributed databases such as Apache Cassandra , a NoSQL database known for its high availability and scalability. Data Model At its core, the KV abstraction is built around a two-level map architecture.

Latency

Latency Storage Cache Servers

Lessons learned from enterprise service-level objective management

Dynatrace

MAY 19, 2022

Every organization’s goal is to keep its systems available and resilient to support business demands. Example 1: Architecture boundaries. This view shows the availability SLO for key application functions, like login and vehicle list, as well as a large set of timeframes, like last 30 minutes, last hour, today, and last six days.

Automotive

Automotive Latency Architecture Mobile

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

However, setting the right parameters for Kubernetes clusters to ensure application availability, performance, and resilience while avoiding overspending isn’t a walk in the park. The following figure shows the high-level architecture where any load testing solution (e.g. below 500ms) and error rates (e.g. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

Compare Latency. lower latency compared to DigitalOcean for PostgreSQL. Now, let’s take a look at the throughput and latency performance of our comparison. Next, we are going to test and compare the latency performance between ScaleGrid and DigitalOcean for PostgreSQL. PostgreSQL DigitalOcean Latency Averages (ms).

Database

Database Latency Benchmarking Performance

Site reliability engineering: 5 things you need to know

Dynatrace

FEBRUARY 4, 2021

As a discipline, SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response. Collaboration between developers, operations, and product owners enables site reliability engineers to define and meet uptime and availability targets.

Engineering

Engineering DevOps Government Latency

Analyze OpenTelemetry traces and log data at scale: Accelerate troubleshooting and optimize application performance

Dynatrace

OCTOBER 3, 2024

Trace your application Imagine a microservices architecture with hundreds of dependencies. Without distributed tracing, pinpointing the cause of increased latency could take hours or even days. Try it out yourself The capabilities highlighted in this blog post will be available in Dynatrace SaaS environments in the coming weeks.

Performance

Performance Architecture Innovation Latency

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Within this paradigm, it is possible to run entire architectures without touching a traditional virtual server, either locally or in the cloud. In a serverless architecture, applications are distributed to meet demand and scale requirements efficiently. Every time the trigger executes, the function runs on an available resource.

Serverless

Serverless Efficiency Lambda AWS

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

The original assumptions and architectural choices were no longer viable. Overview The figure below depicts a simplified high-level architecture of a single Titus cluster (a.k.a We started seeing increased response latencies and leader servers running at dangerously high utilization.

Cache

Cache Latency Traffic Systems

Site reliability engineering: 5 things to you need to know

Dynatrace

FEBRUARY 4, 2021

As a discipline, SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response. Collaboration between developers, operations, and product owners enables site reliability engineers to define and meet uptime and availability targets.

Engineering

Engineering DevOps Government Latency

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

MAY 31, 2023

Keeping pace with modern digital transformation requires ensuring that applications are responsive, resilient, and always available amid increased complexity. Microservices-based architectures and software containers enable organizations to deploy and modify applications with unprecedented speed. availability.

Best Practices

Best Practices DevOps Latency Metrics

Dynatrace supports the newly released AWS Lambda Response Streaming

Dynatrace

APRIL 7, 2023

Now, customers can use streamed responses to build more responsive applications by sending partial responses to clients as the response becomes available. Customers can use AWS Lambda Response Streaming to improve performance for latency-sensitive applications and return larger payload sizes. Return larger payload sizes.

Lambda

Lambda AWS Serverless Latency

Dynatrace supports Azure Managed Instance for Apache Cassandra

Dynatrace

MAY 13, 2022

Because of its scalability and distributed architecture, thousands of companies trust it to run their cloud and hybrid-based workloads at high availability without compromising performance. Below is an example Dynatrace problem card, which shows how a spike in Cassandra write latency impacts your application.

Azure

Azure Latency Metrics Infrastructure

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

Motivation With the rapid growth in Netflix member base and the increasing complexity of our systems, our architecture has evolved into an asynchronous one that enables both online and offline computation. Architecture As shown in the diagram above, the RENO service can be broken down into the following components.

Systems

Systems Traffic Architecture Mobile

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. In fact, we’ve found for our services and architecture that there is no such trade off. We considered that an acceptable trade off, as avoiding pauses provided benefits that would outweigh that overhead.

Latency

Latency Java Tuning Efficiency

Extending Vector with eBPF to inspect host and container performance

The Netflix TechBlog

FEBRUARY 20, 2019

by Jason Koch , with Martin Spier , Brendan Gregg , Ed Hunter Improving the tools available to our engineers to help them diagnose, triage, and work through software performance challenges in the cloud is a key goal for the cloud performance engineering team at Netflix. 10–20 MB/sec (it is, unsurprisingly, receiving lots of data).

Performance

Performance Latency Open Source Metrics

Data ingestion pipeline with Operation Management

The Netflix TechBlog

MARCH 7, 2023

But we cannot search or present low latency retrievals from files Etc. Marken Architecture Marken’s architecture diagram is as follows. Marken Architecture Marken’s architecture diagram is as follows. Using memcache allows us to keep latencies for our search low (most of our queries are less than 100ms).

Media

Media Latency Architecture Database

How Park ‘N Fly eliminated silos and improved customer experience with Dynatrace cloud monitoring

Dynatrace

APRIL 7, 2021

Organizations are rapidly adopting multicloud architectures to achieve the agility needed to drive customer success through new digital service channels. For example, if there is a latency on a particular service, Dynatrace will flag this and trace its source – even if the source is a third party.

Cloud

Cloud Monitoring Latency Games

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

Retrieval-augmented generation emerges as the standard architecture for LLM-based applications Given that LLMs can generate factually incorrect or nonsensical responses, retrieval-augmented generation (RAG) has emerged as an industry standard for building GenAI applications.

Cache

Cache Azure Infrastructure Monitoring

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system.

Serverless

Serverless Media Latency Social Media

Implementing AWS well-architected pillars with automated workflows

Dynatrace

SEPTEMBER 13, 2023

Because Google offers its own Google Cloud Architecture Framework and Microsoft its Azure Well-Architected Framework , organizations that use a combination of these platforms triple the challenge of integrating their performance frameworks into a cohesive strategy. SRG validates the status of the resiliency SLOs for the experiment period.

AWS

AWS Efficiency Azure Cloud

Unlock the power of contextual log analytics

Dynatrace

OCTOBER 2, 2024

The Clouds app provides a view of all available cloud-native services. Logs in context, along with other details, are instantly available after selecting a resource. The reasons are easy to find, looking at the latest improvements that went live along with the general availability of the Logs app.

Analytics

Analytics AWS DevOps Cloud

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

As dynamic systems architectures increase in complexity and scale, IT teams face mounting pressure to track and respond to conditions and issues across their multi-cloud environments. Dynatrace news. As teams begin collecting and working with observability data, they are also realizing its benefits to the business, not just IT.

Metrics

Metrics Open Source Monitoring Cloud

Observability vs. monitoring: What’s the difference?

Dynatrace

NOVEMBER 3, 2021

Organizations are depending more and more on distributed architectures to provide application services. For example, when monitoring a database, you’ll want to know about any latency when writing data to a disk or average query response time. DevOps practitioners struggle to maintain highly available and scalable applications.

Monitoring

Monitoring Metrics DevOps Scalability

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

We tried a few iterations of what this new service should look like, and eventually settled on a modern architecture that aimed to give more control of the API experience to the client teams. For us, it means that we now need to have ~15 MDN tabs open when writing routes :) Let’s briefly discuss the architecture of this microservice.

Latency

Latency Cache Java Traffic

Improved Alerting with Atlas Streaming Eval

The Netflix TechBlog

APRIL 27, 2023

High Level Architecture The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation. First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. OK, Results?

Storage

Storage Cache Metrics Database

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

Lambda’s highly efficient, on-demand computing environment aligns with today’s microservices-centric architectures, and readily integrates with other popular AWS offerings that an organization may already be using. AWS continues to improve how it handles latency issues. How do AWS Lambda functions impact monitoring?

Lambda

Lambda AWS Serverless Hardware

Under the Hood of Amazon EC2 Container Service

All Things Distributed

JULY 20, 2015

Today, I want to explore the Amazon ECS architecture and what this architecture enables. A cluster is just a pool of compute resources available to a customer’s applications. The agent is written in Go, has a minimal footprint, and is available on GitHub under an Apache license. How we manage state. task definition).

Latency

Latency Architecture AWS Open Source

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

High level playback architecture with priority throttling and chaos testing Building a request taxonomy We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. The computation is done as a first step so that it is available for the rest of the request lifecycle.

Traffic

Traffic Metrics Infrastructure Architecture

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

Since there were no existing solutions available, we needed to build them ourselves. To improve availability, we designed systems where components could fail separately and avoid single points of failure. In this architecture, service to service communication no longer goes through the single point of failure of a load balancer.

Traffic

Traffic Latency Cloud C++

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

As organizations adopt microservices-based architecture , service-level objectives (SLOs) have become a vital way for teams to set specific, measurable targets that ensure users are receiving agreed-upon service levels. availability of a website over a year, your error budget is.05%. Dynatrace news. What are error budgets?

Metrics

Metrics Best Practices DevOps Infrastructure

Titan Graph Database Integration with DynamoDB: World-class Performance, Availability, and Scale for New Workloads

All Things Distributed

AUGUST 20, 2015

When using relational databases, traversing relationships requires expensive table JOIN operations, causing significantly increased latency as table size and query complexity grow. Titan has a pluggable storage architecture, using existing NoSQL databases as underlying storage for the graph data. Enter graph databases.

Database

Database Logistics Availability Social Media

Netflix’s Distributed Counter Abstraction

RabbitMQ vs. Kafka: Key Differences

Trending Sources

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Introducing Impressions at Netflix

Best Practices for Scaling RabbitMQ

Foundation Model for Personalized Recommendation

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Implementing service-level objectives to improve software quality

Scalable Annotation Service?—?Marken

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Introducing Netflix’s Key-Value Data Abstraction Layer

Lessons learned from enterprise service-level objective management

Why applying chaos engineering to data-intensive applications matters

Optimizing your Kubernetes clusters without breaking the bank

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Site reliability engineering: 5 things you need to know

Analyze OpenTelemetry traces and log data at scale: Accelerate troubleshooting and optimize application performance

What is serverless computing? Driving efficiency without sacrificing observability

Consistent caching mechanism in Titus Gateway

Site reliability engineering: 5 things to you need to know

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace supports the newly released AWS Lambda Response Streaming

Dynatrace supports Azure Managed Instance for Apache Cassandra

Rapid Event Notification System at Netflix

Bending pause times to your will with Generational ZGC

Extending Vector with eBPF to inspect host and container performance

Data ingestion pipeline with Operation Management

How Park ‘N Fly eliminated silos and improved customer experience with Dynatrace cloud monitoring

Introducing Netflix TimeSeries Data Abstraction Layer

Predictive CPU isolation of containers at Netflix

Dynatrace accelerates business transformation with new AI observability solution

The Netflix Cosmos Platform

Implementing AWS well-architected pillars with automated workflows

Unlock the power of contextual log analytics

What is observability? Not just logs, metrics and traces

Observability vs. monitoring: What’s the difference?

Seamlessly Swapping the API backend of the Netflix Android app

Improved Alerting with Atlas Streaming Eval

What is AWS Lambda?

Under the Hood of Amazon EC2 Container Service

Keeping Netflix Reliable Using Prioritized Load Shedding

Zero Configuration Service Mesh with On-Demand Cluster Discovery

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Titan Graph Database Integration with DynamoDB: World-class Performance, Availability, and Scale for New Workloads

Stay Connected