Latency, Servers and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

These events are promptly relayed from the client side to our servers, entering a centralized event processing queue. Automating Performance Tuning with Autoscalers Tuning the performance of our Apache Flink jobs is currently a manual process. This queue ensures we are consistently capturing raw events from our global userbase.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Its partitioned log architecture supports both queuing and publish-subscribe models, allowing it to handle large-scale event processing with minimal latency. Kafka clusters can be deployed in Kubernetes using Helm charts to simplify scaling and management across multiple servers.

Latency

Latency Analytics Architecture Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. We will examine these alternatives in the upcoming sections.

Traffic

Traffic Latency Tuning Systems

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

Before GraphQL: Monolithic Falcor API implemented and maintained by the API Team Before moving to GraphQL, our API layer consisted of a monolithic server built with Falcor. A single API team maintained both the Java implementation of the Falcor framework and the API Server. To launch Phase 1 safely, we used AB Testing.

Traffic

Traffic Latency Metrics Cache

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

95th Percentile Latency. The 95th percentile latency of queries was also 1.8 times higher when the index creation happened on the master server. The 95th percentile latency of queries was also 1.8 times higher when the index creation happened on the master server. Workload Throughput (Queries Per Second).

Best Practices

Best Practices Latency Tuning Database

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. That’s particularly true of our GRPC clients and servers, where request cancellations due to timeouts interact with reliability features such as retries, hedging and fallbacks.

Latency

Latency Java Tuning Efficiency

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

By Karthik Yagna , Baskar Odayarkoil , and Alex Ellis Pushy is Netflix’s WebSocket server that maintains persistent WebSocket connections with devices running the Netflix application. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered.

Latency

Latency Cache Tuning Efficiency

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™. Based on those insights, they implemented automated validation tasks, and shifted left in their software delivery pipeline.

DevOps

DevOps Traffic Latency Best Practices

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

OCTOBER 17, 2019

Using a connection pool in each module is hardly efficient: Even with a relatively small number of modules, and a small pool size in each, you end up with a lot of server processes. You either need an extra server (or 3), or your database server(s) must have enough resources to support a connection pooler, in addition to PostgreSQL.

Architecture

Architecture Database Latency Servers

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

Many database administrators find themselves having to support instances of SQL Server Reporting Services (SSRS), or at least the backend databases that are required for SSRS. This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation.

Tuning

Tuning Servers Database Best Practices

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

However, serverless applications have unique characteristics that make observability more difficult than in traditional server-based applications. Serverless applications have several benefits over server-based applications: Eliminate the need to provision, manage and maintain servers or containers.

Serverless

Serverless Lambda Azure AWS

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Within this paradigm, it is possible to run entire architectures without touching a traditional virtual server, either locally or in the cloud. When an application is triggered, it can cause latency as the application starts. Unlike on-premises machines, shared servers, or rented virtual machines, there is no cost for downtime.

Serverless

Serverless Efficiency Lambda AWS

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system. Warm capacity.

Serverless

Serverless Media Latency Social Media

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses. Running a canary for each change and testing the AMI in production allows us to capture insights around impact on Windows updates, script changes, tuning web server configuration, among others.

DevOps

DevOps AWS Tuning Infrastructure

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns.

Systems

Systems Traffic Architecture Mobile

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? Sysbench ran on a third server, which I’ll refer to as the application server (APP).

Tuning

Tuning Cache Servers Benchmarking

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Think about items such as general system metrics (for example, CPU utilization, free memory, number of services), the connectivity status, details of our web server, or even more granular in-application tasks like database queries. Let’s click “Apache Web Server apache” now.

Metrics

Metrics Database Monitoring Network

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). PC, smartphone, server) or virtual (virtual machines, cloud gateways). Endpoints can be physical (i.e.,

Monitoring

Monitoring Social Media IoT Metrics

Discord Scales to 1 Million+ Online MidJourney Users in a Single Server

InfoQ

JANUARY 26, 2024

Discord optimized its platform to serve over one million online users in a single server while maintaining a responsive user experience. By Rafal Gancarz

Servers

Servers Tuning Scalability Performance

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

It enables them to adapt to user feedback swiftly, fine-tune feature releases, and deliver exceptional user experiences, all while maintaining control and minimizing disruption. Using advanced causal AI and context-aware decision-making, it identifies the root cause behind server failures.

DevOps

DevOps Traffic Efficiency Servers

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

You will need to know which monitoring metrics for Redis to watch and a tool to monitor these critical server metrics to ensure its health. Understanding Redis Performance Indicators Redis is designed to handle high traffic and low latency with its in-memory data store and efficient data structures.

Metrics

Metrics Monitoring Latency Cache

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Hardware Memory The amount of RAM to be provisioned for database servers can vary greatly depending on the size of the database and the specific requirements of the company. Some servers may need a few GBs of RAM, while others may need hundreds of GBs or even terabytes of RAM. If you see concurrency issues, you can tune this variable.

Performance

Performance Hardware Tuning Storage

It’s All About Replication Lag in PostgreSQL

Percona

APRIL 13, 2023

In PostgreSQL, replication lag can occur due to various reasons such as network latency, slow disk I/O, long-running transactions, etc. Replication lag can occur due to various reasons, such as: Network latency: Network latency is the delay caused by the time it takes for data to travel between the primary and standby databases.

Latency

Latency Tuning Open Source Network

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

IPC clients are instantiated targeting that VIP or SVIP, and the Eureka client code handles the translation of that VIP to a set of IP and port pairs by fetching them from the Eureka server. There is a downside to fetching this data on-demand: this adds latency to the first request to a cluster.

Traffic

Traffic Latency Cloud C++

KeyCDN Launches New POP in Mexico

KeyCDN

NOVEMBER 4, 2021

The POP is strategially located within the country and lowers latency overall. KeyCDN is always on the lookout for ways to minimize latency and accelerate asset delivery worldwide. For more POPs planned, check our current network for a list of both active and planned edge server locations. Hola Mexico!

Latency

Latency Cache Tuning Traffic

KeyCDN Launches New POPs in 2021

KeyCDN

MARCH 10, 2021

With Tel Aviv being the technology capital of Israel, it's the ideal edge server location. The image below shows a significant drop in latency once we've launched the new point of presence in Israel. In fact, latency has been reduced by almost 50%! Performance report Brisbane - Australia Brisbane is our 4th POP in Australia.

Latency

Latency Internet Internet Speed

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Smashing Magazine

NOVEMBER 8, 2021

As developers, we rightfully obsess about the customer experience, relentlessly working to squeeze every millisecond out of the critical rendering path, optimize input latency, and eliminate jank. Hydrogen fuels dynamic commerce by uniting React Server Components, streaming server-side rendering, and smart caching controls.

Cache

Cache Best Practices Strategy Servers

Save Money in AWS RDS: Don’t Trust the Defaults

Percona

MAY 1, 2023

I’ll show you some MySQL settings to tune to get better performance, and cost savings, with AWS RDS. After some time of receiving these messages, eventually, they hit performance issues to the point that the server becomes unresponsive for a few minutes. This was exactly what was happening on this server.

AWS

AWS Hardware Storage Tuning

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

PostgreSQL Cluster One coordinator node citus-coord-01 Three worker nodes citus1 citus2 citus3 Hardware AWS Instance Ubuntu Server 20.04, SSD volume type 64-bit (x86) c5.xlarge Steps Provisioning The first step is to provision the four nodes with both PostgreSQL and Citus. psql pgbench <<_eof1_ qecho adding node citus3.

Database

Database Benchmarking Latency C++

Most Common RabbitMQ Use Cases

Scalegrid

AUGUST 27, 2024

The software also extends capabilities allowing fine-tuning consumption parameters through QoS (Quality of Service) prefetch limits catered toward balancing load among numerous consumers, thus preventing overwhelming any single consumer entity. RabbitMQ ensures that this data exchange is smooth and uninterrupted across different platforms.

IoT

IoT Ecommerce Games Scalability

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The Netflix TechBlog

JULY 21, 2022

device characteristics come from our on-field knowledge and runtime memory data comes from real-time user data pushed to our servers. Stay tuned for further posts on memory management and the use of ML modeling to deal with systemic and low latency data collected at the device level.

Big Data

Big Data Cache Engineering Data Engineering

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

percent availability in the event of a server, a rack of servers, or an Availability Zone failure. DynamoDB automatically re-distributes your data to healthy servers to ensure there are always multiple replicas of your data without you needing to intervene.

Internet

Internet Internet AWS Performance

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Passive instances across regions are also possible, though it is recommended to operate in the same region as the database host in order to keep the change capture latencies low. Stay Tuned DBLog has additional capabilities which are not covered by this blog post, such as: Ability to capture table schemas without using locks.

Database

Database Traffic Transportation Open Source

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Passive instances across regions are also possible, though it is recommended to operate in the same region as the database host in order to keep the change capture latencies low. Stay Tuned DBLog has additional capabilities which are not covered by this blog post, such as: Ability to capture table schemas without using locks.

Database

Database Traffic Transportation Open Source

Growth Engineering at Netflix- Creating a Scalable Offers Platform

The Netflix TechBlog

FEBRUARY 9, 2021

The action corresponds to the button on the page and the withFields specify which fields the server expects to have sent back when the button is clicked. Stay tuned for more details on this, as well as more details on the internals of the new SKU Platform in one of our upcoming blog posts. Step 3 & 4?—?Determine

Engineering

Engineering Scalability Architecture Innovation

KeyCDN Launches New POPs in 2023

KeyCDN

JANUARY 25, 2023

With a dedicated POP, latency for visitors is reduced even further, resulting in better loading times. This makes Dublin the ideal location for an edge server. How to check a POP location Each edge server adds the HTTP response header X-Edge-Location delivered by KeyCDN. Lima - Peru Lima is our 6th POP in Latin America.

Latency

Latency Cache Tuning Speed

HTTP/3 From A To Z: Core Concepts (Part 1)

Smashing Magazine

AUGUST 9, 2021

You’ve probably heard things like: “HTTP/3 is much faster than HTTP/2 when there is packet loss”, or “HTTP/3 connections have less latency and take less time to set up”, and probably “HTTP/3 can send data more quickly and can send more resources in parallel”. Websites would magically become 50% faster with the flip of a switch!

Transportation

Transportation Internet Internet Network

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. This server is spending about a third of its CPU cycles just checking the time! I've shared many posts about superpower observability tools, but often humble hacking is just as effective. 30.14% in the middle of the flame graph.

Speed

Speed Java AWS Virtualization

HTTP/3: Practical Deployment Options (Part 3)

Smashing Magazine

SEPTEMBER 6, 2021

Next, we’ll look at how to set up servers and clients (that’s the hard part unless you’re using a content delivery network (CDN)). This difference by itself doesn’t do all that much (it mainly reduces the overhead on the server-side), but it leads to most of the following points. Server Sharding and Connection Coalescing.

Network

Network Servers Cache Traffic

Netflix’s Distributed Counter Abstraction

Introducing Impressions at Netflix

Trending Sources

RabbitMQ vs. Kafka: Key Differences

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Introducing Netflix’s Key-Value Data Abstraction Layer

Migrating Netflix to GraphQL Safely

Best Practice for Creating Indexes on your MySQL Tables

Introducing Netflix TimeSeries Data Abstraction Layer

Bending pause times to your will with Generational ZGC

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

How Dynatrace boosts production resilience with Site Reliability Guardian

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Tuning SQL Server Reporting Services

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

What is serverless computing? Driving efficiency without sacrificing observability

The Netflix Cosmos Platform

Applying Netflix DevOps Patterns to Windows

Rapid Event Notification System at Netflix

The Most Important MySQL Setting

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

How digital experience monitoring helps deliver business observability

Discord Scales to 1 Million+ Online MidJourney Users in a Single Server

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Crucial Redis Monitoring Metrics You Must Watch

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

InnoDB Performance Optimization Basics

It’s All About Replication Lag in PostgreSQL

Zero Configuration Service Mesh with On-Demand Cluster Discovery

KeyCDN Launches New POP in Mexico

KeyCDN Launches New POPs in 2021

Meet Hydrogen: A React Framework For Dynamic, Contextual And Personalized E-Commerce

Save Money in AWS RDS: Don’t Trust the Defaults

How To Scale a Single-Host PostgreSQL Database With Citus

Most Common RabbitMQ Use Cases

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

DBLog: A Generic Change-Data-Capture Framework

DBLog: A Generic Change-Data-Capture Framework

Growth Engineering at Netflix- Creating a Scalable Offers Platform

KeyCDN Launches New POPs in 2023

HTTP/3 From A To Z: Core Concepts (Part 1)

The Speed of Time

HTTP/3: Practical Deployment Options (Part 3)

Stay Connected