Latency, Servers and Systems - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Comparing Approaches to Durability in Low Latency Messaging Queues

DZone

AUGUST 2, 2022

A significant feature of Chronicle Queue Enterprise is support for TCP replication across multiple servers to ensure the high availability of application infrastructure. Little’s Law and Why Latency Matters. In many cases, the assumption is that as long as throughput is high enough, the latency won’t be a problem.

Latency

Latency Benchmarking Network Infrastructure

Optimising for High Latency Environments

CSS Wizardry

SEPTEMBER 16, 2024

This gives fascinating insights into the network topography of our visitors, and how much we might be impacted by high latency regions. Round-trip-time (RTT) is basically a measure of latency—how long did it take to get from one endpoint to another and back again? What is RTT? RTT isn’t a you-thing, it’s a them-thing.

Latency

Latency Cache Transportation Mobile

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

The Netflix TechBlog

SEPTEMBER 29, 2022

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform.

Latency

Latency Systems Media Serverless

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

Dynatrace

OCTOBER 10, 2024

New: identify hotspots with the honeycomb visualization Honeycombs are great for visualizing health in complex and distributed systems, enabling you to visualize countless entities effectively and at scale. Now, let’s introduce you to our two newest entries to our visualization catalog and tell you about the great things you can do with them.

Latency

Latency Infrastructure Monitoring Metrics

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

The Multicore Era Over the past ~15 years, server processors from Intel and AMD have evolved from the early quad-core processors to the current monsters with over 50 cores per socket. Understanding sustained memory bandwidth in these systems starts with assuming 100% utilization and then reviewing the factors that get in the way (e.g.,

Latency

Latency Hardware Cache Systems

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. It provides a good read on the availability and latency ranges under different production conditions.

Traffic

Traffic Latency Tuning Systems

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

Before GraphQL: Monolithic Falcor API implemented and maintained by the API Team Before moving to GraphQL, our API layer consisted of a monolithic server built with Falcor. A single API team maintained both the Java implementation of the Falcor framework and the API Server. To launch Phase 1 safely, we used AB Testing.

Traffic

Traffic Latency Cache Metrics

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Servers

Solve hybrid Kubernetes performance and reliability problems with unified observability

Dynatrace

APRIL 10, 2025

In modern containerized environments, teams often deploy Kubernetes across mixed operating systems, creating a situation where both Linux and Windows nodes reside in the same cluster. This eliminates the need for separate log management systems for each OS, streamlining your observability workflow.

Performance

Performance Java Operating System Infrastructure

Bandwidth or Latency: When to Optimise for Which

CSS Wizardry

JANUARY 31, 2019

When it comes to network performance, there are two main limiting factors that will slow you down: bandwidth and latency. Latency is defined as…. Where bandwidth deals with capacity, latency is more about speed of transfer 2. and reduction in latency. and reduction in latency. Bandwidth is defined as….

Latency

Latency Network Speed Servers

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

On Titus , our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers. To emit a run queue latency metric, we leveraged three eBPF hooks: sched_wakeup, sched_wakeup_new, and sched_switch.

Latency

Latency Metrics Programming Monitoring

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Dynatrace

JANUARY 15, 2020

Citrix is a sophisticated, efficient, and highly scalable application delivery platform that is itself comprised of anywhere from hundreds to thousands of servers. Dynatrace Extension: database performance as experienced by the SAP ABAP server. SAP server. It delivers vital enterprise applications to thousands of users.

Latency

Latency Performance Virtualization Infrastructure

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Dynatrace

SEPTEMBER 29, 2020

A critical component to this success was that the Dynatrace Team itself uses the Dynatrace Platform to monitor every single Dynatrace cluster in the cloud and trusts the Dynatrace Davis AI to alert in case there are any issues, either with a new feature, a configuration change or with the infrastructure our servers are running on.

Infrastructure

Infrastructure Cloud Monitoring AWS

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

As the number of Titus users increased over the years, the load and pressure on the system increased substantially. cell): Titus Job Coordinator is a leader elected process managing the active state of the system. For example, a batch workflow orchestration system may create multiple jobs which are part of a single workflow execution.

Cache

Cache Latency Traffic Systems

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

During this time, you are also likely to experience a degraded performance of queries as your system resources are busy in index-creation work as well. 95th Percentile Latency. The 95th percentile latency of queries was also 1.8 times higher when the index creation happened on the master server. Index Creation on Master.

Best Practices

Best Practices Latency Tuning Database

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

In this example, “Reverse proxy” and “Front-end server” are clearly in the critical path. According to Google’s SRE handbook , best practices, there are “ Four Golden Signals ” we can convert into four SLOs for services: reliability, latency, availability, and saturation. Without them, the application won’t work.

Software

Software Software Benchmarking Latency

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Dynatrace

SEPTEMBER 18, 2020

High latency or lack of responses. API manager monitoring from the application server perspective, which is what Dynatrace delivers with the WSO2 API Manager monitoring extension, can save you hours of bug hunting time. This increase is clearly correlated with the increased response latencies. Read on to see how it works.

Infrastructure

Infrastructure Latency Metrics Cloud

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

By Karthik Yagna , Baskar Odayarkoil , and Alex Ellis Pushy is Netflix’s WebSocket server that maintains persistent WebSocket connections with devices running the Netflix application. Sample system diagram for an Alexa voice command. The other main use case was RENO, the Rapid Event Notification System mentioned above.

Latency

Latency Cache Tuning Efficiency

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

The 2014 launch of AWS Lambda marked a milestone in how organizations use cloud services to deliver their applications more efficiently, by running functions at the edge of the cloud without the cost and operational overhead of on-premises servers. AWS continues to improve how it handles latency issues. What is AWS Lambda?

Lambda

Lambda AWS Serverless Hardware

FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency?

DZone

MARCH 14, 2023

As an engineer, you probably know that server performance under heavy load is crucial for maintaining the availability and responsiveness of your services. But what happens when traffic bursts overwhelm your system? Queueing requests is a common solution, but what's the best approach: FIFO or LIFO?

Strategy

Strategy Latency Availability Traffic

Lessons learned from enterprise service-level objective management

Dynatrace

MAY 19, 2022

Every organization’s goal is to keep its systems available and resilient to support business demands. Lastly, error budgets, as the difference between a current state and the target, represent the maximum amount of time a system can fail per the contractual agreement without repercussions. Dynatrace news. A world of misunderstandings.

Automotive

Automotive Latency Architecture Azure

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The first generation of this system went live with the streaming launch in 2007. Delivery?—?A

Serverless

Serverless Media Latency Social Media

Designing Instagram

High Scalability

JANUARY 11, 2022

When the server receives a request for an action (post, like etc.) The streaming data store makes the system extensible to support other use-cases (e.g. System Components. The system will comprise of several micro-services each performing a separate task. High Level Design. Architecture. Fetching User Feed. Optimization.

Design

Design Media Storage Logistics

What is a Distributed Storage System

Scalegrid

FEBRUARY 8, 2024

A distributed storage system is foundational in today’s data-driven landscape, ensuring data spread over multiple servers is reliable, accessible, and manageable. This guide delves into how these systems work, the challenges they solve, and their essential role in businesses and technology.

Storage

Storage Systems Big Data Azure

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

This is where large-scale system migrations come into play. By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. But what happens when this machinery needs a transformation?

Traffic

Traffic Metrics Systems Strategy

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

OCTOBER 17, 2019

On modern Linux systems, the difference in overhead between forking a process and creating a thread is much lesser than it used to be. Using a connection pool in each module is hardly efficient: Even with a relatively small number of modules, and a small pool size in each, you end up with a lot of server processes.

Architecture

Architecture Database Latency Servers

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

This transition to public, private, and hybrid cloud is driving organizations to automate and virtualize IT operations to lower costs and optimize cloud processes and systems. Besides the traditional system hardware, storage, routers, and software, ITOps also includes virtual components of the network and cloud infrastructure.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

Dynatrace

OCTOBER 23, 2023

Microsoft Hyper-V is a virtualization platform that manages virtual machines (VMs) on Windows-based systems. It enables multiple operating systems to run simultaneously on the same physical hardware and integrates closely with Windows-hosted services. This leads to a more efficient and streamlined experience for users.

Efficiency

Efficiency Virtualization Hardware Performance

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

The network latency between cluster nodes should be around 10 ms or less. For Premium HA, this has been extended from 10 ms latency (in the same network region) to around 100 ms network latency due to asynchronous data replication between regions. In the image below, three downed nodes make an entire cluster unavailable.

Availability

Availability Hardware Latency Traffic

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Traditional computing models rely on virtual or physical machines, where each instance includes a complete operating system, CPU cycles, and memory. Within this paradigm, it is possible to run entire architectures without touching a traditional virtual server, either locally or in the cloud. What is serverless computing? Pay Per Use.

Serverless

Serverless Efficiency Lambda Azure

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

MAY 31, 2023

Uptime Institute’s 2022 Outage Analysis report found that over 60% of system outages resulted in at least $100,000 in total losses, up from 39% in 2019. At the lowest level, SLIs provide a view of service availability, latency, performance, and capacity across systems. Make SLOs realistic.

Best Practices

Best Practices DevOps Latency Metrics

Edge Authentication and Token-Agnostic Identity Propagation

The Netflix TechBlog

FEBRUARY 9, 2021

The whole system was quite complex, and starting to become brittle. The API server orchestrates backend systems to authenticate the user. Upstream systems had to reopen the tokens to identify the user logging in and potentially manage multiple parallel identity data structures, which could easily get out of sync.

Architecture

Architecture Latency Servers Website

How To Design For High-Traffic Events And Prevent Your Website From Crashing

Smashing Magazine

JANUARY 7, 2025

Too many concurrent server requests can lead to website crashes if youre not equipped to deal with them. You can free up space and reduce the load on your server by compressing and optimizing images. You can often do this using built-in apps on your operating system. This can result in a loss of revenue and reputation damage.

Traffic

Traffic Website Design Cache

Why growing AI adoption requires an AI observability strategy

Dynatrace

JANUARY 17, 2024

An AI observability strategy—which monitors IT system performance and costs—may help organizations achieve that balance. AI observability is the use of artificial intelligence to capture the performance and cost details generated by various systems in an IT environment. Additionally, organizations need to consider AI observability.

Strategy

Strategy Artificial Intelligence Storage Cloud

Achieving 100Gbps intrusion prevention on a single server

The Morning Paper

NOVEMBER 15, 2020

Achieving 100 Gbps intrusion prevention on a single server , Zhao et al., This stems from a combination of Jevon’s paradox and the interconnectedness of systems – doing more in one area often leads to a need for more elsewhere too. Today’s paper choice is a wonderful example of pushing the state of the art on a single server.

Servers

Servers Hardware Latency Design

Enhanced AI model observability with Dynatrace and Traceloop OpenLLMetry

Dynatrace

DECEMBER 4, 2023

By using OpenLLMetry and Dynatrace, anyone can get complete visibility into their system, including gen-AI parts with 5 minutes of work.” Resource consumption: Observing computational resource availability and saturation, whether deployed in cloud-native environments like Kubernetes or CPU-enabled servers.

Open Source

Open Source Metrics Java Latency

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Observability is essential to ensure the reliability, security and quality of any software system. However, serverless applications have unique characteristics that make observability more difficult than in traditional server-based applications. Higher latency and cold start issues due to the initialization time of the functions.

Serverless

Serverless Lambda Azure AWS

Observability vs. monitoring: What’s the difference?

Dynatrace

NOVEMBER 3, 2021

Logging provides additional data but is typically viewed in isolation of a broader system context. Observability is the ability to understand a system’s internal state by analyzing the data it generates, such as logs, metrics, and traces. Monitoring typically provides a limited view of system data focused on individual metrics.

Monitoring

Monitoring Metrics DevOps Scalability

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Think about items such as general system metrics (for example, CPU utilization, free memory, number of services), the connectivity status, details of our web server, or even more granular in-application tasks like database queries. Let’s click “Apache Web Server apache” now.

Metrics

Metrics Database Monitoring Network

Netflix’s Distributed Counter Abstraction

Comparing Approaches to Durability in Low Latency Messaging Queues

Trending Sources

Optimising for High Latency Environments

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Rapid Event Notification System at Netflix

Next-level interaction and customization of data visualizations in Dynatrace Dashboards and Notebooks

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

RabbitMQ vs. Kafka: Key Differences

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Introducing Impressions at Netflix

Migrating Netflix to GraphQL Safely

Introducing Netflix’s Key-Value Data Abstraction Layer

Solve hybrid Kubernetes performance and reliability problems with unified observability

Bandwidth or Latency: When to Optimise for Which

Noisy Neighbor Detection with eBPF

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Cloud infrastructure monitoring in action: Dynatrace on Dynatrace

Consistent caching mechanism in Titus Gateway

Best Practice for Creating Indexes on your MySQL Tables

Implementing service-level objectives to improve software quality

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Introducing Netflix TimeSeries Data Abstraction Layer

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

What is AWS Lambda?

FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency?

Lessons learned from enterprise service-level objective management

The Netflix Cosmos Platform

Designing Instagram

What is a Distributed Storage System

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

What is serverless computing? Driving efficiency without sacrificing observability

Site reliability done right: 5 SRE best practices that deliver on business objectives

Edge Authentication and Token-Agnostic Identity Propagation

How To Design For High-Traffic Events And Prevent Your Website From Crashing

Why growing AI adoption requires an AI observability strategy

Achieving 100Gbps intrusion prevention on a single server

Enhanced AI model observability with Dynatrace and Traceloop OpenLLMetry

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Observability vs. monitoring: What’s the difference?

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Stay Connected