Latency, Processing and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

How to Optimize CPU Performance Through Isolation and System Tuning

DZone

MAY 1, 2023

CPU isolation and efficient system management are critical for any application which requires low-latency and high-performance computing. To achieve this level of performance, such systems require dedicated CPU cores that are free from interruptions by other processes, together with wider system tuning.

Tuning

Tuning Systems Latency Performance

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings. However, as in LLMs, the quality of data often outweighs its sheer volume.

Tuning

Tuning Efficiency Latency Strategy

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. RabbitMQ follows a message broker model with advanced routing, while Kafkas event streaming architecture uses partitioned logs for distributed processing. What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process. The Netflix video processing pipeline went live with the launch of our streaming service in 2007. The Netflix video processing pipeline went live with the launch of our streaming service in 2007.

Processing

Processing Media Latency Innovation

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Option 1: Log Processing Log processing offers a straightforward solution for monitoring and analyzing title launches.

Traffic

Traffic Scalability Strategy Monitoring

Performance Tuning Java Applications in Linux

DZone

DECEMBER 4, 2019

You may also like: How to Properly Plan JVM Performance Tuning. While Performance Tuning an application both Code and Hardware running the code should be accounted for. For low latency, applications use Concurrent Mark and Sweep Algorithm — CMS or G1 GC. Ensure there is enough RAM to hold your java process.

Java

Java Tuning Performance Hardware

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience.

Traffic

Traffic Latency Tuning Systems

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

by Jun He , Yingyi Zhang , and Pawan Dixit Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset.

Processing

Processing Big Data Efficiency Engineering

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

Compare Latency. lower latency compared to DigitalOcean for PostgreSQL. Now, let’s take a look at the throughput and latency performance of our comparison. We measure PostgreSQL throughput in terms of transactions processed. Latency is the average transaction execution time of your PostgreSQL data. Compare Pricing.

Database

Database Latency Benchmarking Performance

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. Proper setup involves creating a configuration process that accounts for hostname changes, which could prevent nodes from rejoining the cluster. Erlang is the backbone of RabbitMQ clustering.

Best Practices

Best Practices Traffic Strategy Efficiency

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

NOVEMBER 20, 2019

In this blog post, we discuss an approach to optimize the MySQL index creation process in such a way that your regular workload is not impacted. 95th Percentile Latency. The 95th percentile latency of queries was also 1.8 Stay tuned for my follow-on blog post with more details! MySQL Rolling Index Creation.

Best Practices

Best Practices Latency Tuning Database

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Reduced tail latencies In both our GRPC and DGS Framework services, GC pauses are a significant source of tail latencies. For a given CPU utilization target, ZGC improves both average and P99 latencies with equal or better CPU utilization when compared to G1. No explicit tuning has been required to achieve these results.

Latency

Latency Java Tuning Efficiency

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Using OpenTelemetry, developers can collect and process telemetry data from applications, services, and systems. Traces are used for performance analysis, latency optimization, and root cause analysis. It enhances observability by providing standardized tools and APIs for collecting, processing, and exporting metrics, logs, and traces.

Latency

Latency Best Practices Metrics Open Source

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

The voice service then constructs a message for the device and places it on the message queue, which is then processed and sent to Pushy to deliver to the device. The previous version of the message processor was a Mantis stream-processing job that processed messages from the message queue.

Latency

Latency Cache Tuning Efficiency

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system.

Serverless

Serverless Media Latency Social Media

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto , to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3.

Latency

Latency Storage Big Data Tuning

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. As a consequence, the automatic updates as well as the automatic deep-code monitoring injection processes are even more stable. Stay tuned for upcoming news about these changes.

Storage

Storage Latency Operating System Network

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

To ensure high standards, it’s essential that your organization establish automated validations in an early phase of the software development process—ideally when code is written. In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™.

DevOps

DevOps Traffic Latency Best Practices

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

As more organizations respond to the pressure to release better software faster, there is an increasing need to build quality gates into every stage of BizDevOps processes , from early development to deployment. Automating quality gates creates reliable checks and balances and speeds up the process by avoiding manual intervention.

Benchmarking

Benchmarking Latency Speed Software

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

By Xiaomei Liu , Rosanna Lee , Cyril Concolato Introduction Behind the scenes of the beloved Netflix streaming service and content, there are many technology innovations in media processing. Packaging has always been an important step in media processing. Uploading and downloading data always come with a penalty, namely latency.

Cloud

Cloud Media Storage Cache

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

OCTOBER 17, 2019

In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice. It is difficult to fault their argument – as it’s absolutely true that: Each client having its own process prevents a poorly behaving client from crashing the entire database.

Architecture

Architecture Database Latency Servers

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

REST APIs, authentication, databases, email, and video processing all have a home on serverless platforms. The Serverless Process. When an application is triggered, it can cause latency as the application starts. The average request is handled, processed, and returned quickly. Services scale to meet demand.

Serverless

Serverless Efficiency Lambda AWS

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

Event Prioritization Considering the use cases were wide ranging both in terms of their sources and their importance, we built segmentation into the event processing. We thus assigned a priority to each use case and sharded event traffic by routing to priority-specific queues and the corresponding event processing clusters.

Systems

Systems Traffic Architecture Mobile

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

” The Dynatrace Kubernetes app also provides process ownership data, ensuring information is directed to the right team and root causes are addressed. You can ask for the best configuration to reduce latency or improve the user experience.” “We can see that one node has memory pressure. It’s using 1.5

Engineering

Engineering DevOps Operating System Cloud

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact.

Traffic

Traffic Metrics Systems Strategy

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

Baking Windows with Packer By Justin Phelps and Manuel Correa Customizing Windows images at Netflix was a manual, error-prone, and time consuming process. We looked at our process for creating a Windows AMI and discovered it was error-prone and full of toil. Last year, we decided to improve the AMI baking process.

DevOps

DevOps AWS Tuning Infrastructure

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others.

Monitoring

Monitoring Tuning Traffic Metrics

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

There are several benefits of such optimizations like saving on storage, faster query time, cheaper downstream processing, and an increase in developer productivity by removing additional ETLs written only for query performance improvement. Orient: Gather tuning parameters for a particular table that changed.

Storage

Storage Latency Efficiency Data Engineering

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

As software development grows more complex, managing components using an automated onboarding process becomes increasingly important. The validation process is automated based on events that occur, while the objectives’ configuration, which is validated by the Site Reliability Guardian , is stored in a separate file.

Best Practices

Best Practices Code Infrastructure Latency

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Teams who want to move their data no longer need to learn and write customized Stream Processing jobs. Two Types of Processors 1.

Big Data

Big Data Government Processing Analytics

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

Reconstructing a streaming session was a tedious and time consuming process that involved tracing all interactions (requests) between the Netflix app, our Content Delivery Network (CDN), and backend microservices. The process started with manual pull of member account information that was part of the session.

Infrastructure

Infrastructure Transportation Storage Open Source

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Higher latency and cold start issues due to the initialization time of the functions. Data analysis : how to process, aggregate and query observability data from serverless functions effectively, accurately, and comprehensively? Enable faster development and deployment cycles by abstracting away the infrastructure complexity.

Serverless

Serverless Lambda Azure AWS

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

The challenge, then, is to be able to ingest and process these events in a scalable manner, i.e., scaling with the number of devices, which will be the focus of this blog post. In-Order Processing The semantics of correct device information updates ingestion requires that messages be consumed in the order that they are produced.

Latency

Latency Traffic Transportation Cloud

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

The other sections on that page (such as Disk analysis) provide further information and charts on topics such as available disk space, latency, dropped network packets, refused connections, and more. This leads us to the process page of our specific Apache instance. On the other hand, if we checked out the process page for our Node.js

Metrics

Metrics Database Monitoring Network

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

And why have SLOs and SLIs become so important as teams automate processes to consistently meet SLAs and error budgets? As defined by Gartner , service-level objectives are an agreed-upon target within an SLA that must be achieved for each activity, function, and process to provide the best opportunity for customer success.

Metrics

Metrics Best Practices DevOps Infrastructure

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

Focusing on tools over processes is a red flag and the biggest mistake I see executives make when it comes to AI. Improvement Requires Process Assuming that buying a tool will solve your AI problems is like joining a gym but not actually going. You also need to develop and follow processes.

Latency

Latency Tuning Metrics Testing

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. The ReportServer and ReportServerTempDB databases are SQL Server databases and should be part of a regular backup process, just like other user databases. Tuning Options.

Tuning

Tuning Servers Database Best Practices

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases. The goal is to process these documents as fast as possible and reliably deliver them to recipients while offering strong observability to both our users and internal teams.

Traffic

Traffic Java Latency Google

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. In this way, no human intervention is required in the remediation process. Multi-objective optimizations. user name).

Tuning

Tuning Efficiency Big Data Engineering

Netflix’s Distributed Counter Abstraction

How to Optimize CPU Performance Through Isolation and System Tuning

Trending Sources

Foundation Model for Personalized Recommendation

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Rebuilding Netflix Video Processing Pipeline with Microservices

Why applying chaos engineering to data-intensive applications matters

Title Launch Observability at Netflix Scale

Performance Tuning Java Applications in Linux

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Incremental Processing using Netflix Maestro and Apache Iceberg

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Introducing Netflix’s Key-Value Data Abstraction Layer

Best Practices for Scaling RabbitMQ

Best Practice for Creating Indexes on your MySQL Tables

Bending pause times to your will with Generational ZGC

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix Cosmos Platform

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Introducing Netflix TimeSeries Data Abstraction Layer

Faster time to value with enhanced handling of OneAgent runtime data

How Dynatrace boosts production resilience with Site Reliability Guardian

How BizDevOps can “shift left” using SLOs to automate quality gates

Netflix Cloud Packaging in the Terabyte Era

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

What is serverless computing? Driving efficiency without sacrificing observability

Rapid Event Notification System at Netflix

Enhancing Kubernetes cluster management key to platform engineering success

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Applying Netflix DevOps Patterns to Windows

Telltale: Netflix Application Monitoring Simplified

Optimizing data warehouse storage

Automated observability, security, and reliability at scale

Data Movement in Netflix Studio via Data Mesh

Building Netflix’s Distributed Tracing Infrastructure

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Towards a Reliable Device Management Platform

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

AI Essentials for Tech Executives

Tuning SQL Server Reporting Services

Achieving observability in async workflows

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Stay Connected