Latency, Metrics and Tuning - Technology Performance Pulse

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

This dual-path approach leverages Kafkas capability for low-latency streaming and Icebergs efficient management of large-scale, immutable datasets, ensuring both real-time responsiveness and comprehensive historical data availability. million impression events globally every second, with each event approximately 1.2KB in size.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Its partitioned log architecture supports both queuing and publish-subscribe models, allowing it to handle large-scale event processing with minimal latency. Apache Kafka uses a custom TCP/IP protocol for high throughput and low latency. Apache Kafka, designed for distributed event streaming, maintains low latency at scale.

Latency

Latency Analytics Architecture Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience.

Traffic

Traffic Latency Tuning Systems

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Stay tuned for a closer look at the innovation behind thescenes!

Traffic

Traffic Scalability Strategy Monitoring

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. This significantly increases event latency. Spark Structured Streaming can also provide consistent fault recovery for applications where latency is not a critical requirement.

Engineering

Engineering Tuning Latency Open Source

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

So, we relied on higher-level metrics-based testing: AB Testing and Sticky Canaries. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. Wins High-Level Health Metrics: AB Testing provided the assurance we needed in our overall client-side GraphQL implementation.

Traffic

Traffic Latency Metrics Cache

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

To reduce your CloudWatch costs and throttling, you can now select from additional services and metrics to monitor. Get up to 300 new AWS metrics out of the box. Dynatrace ingests AWS CloudWatch metrics for multiple preselected services. Select Add service to pick the service that has the metric you want to add.

AWS

AWS Metrics IoT Storage

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. With 24/7 expert support, ScaleGrid assists with troubleshooting, performance tuning, and migration processes. Keeping queues short maintains a responsive and efficient RabbitMQ setup.

Best Practices

Best Practices Traffic Strategy Efficiency

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Observability Observability is the ability to determine a system’s health by analyzing the data it generates, such as logs, metrics, and traces. There are three main types of telemetry data: Metrics. Metrics are typically aggregated and stored in time series databases for monitoring and alerting purposes.

Latency

Latency Best Practices Metrics Open Source

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

To reduce your CloudWatch costs and throttling, you can now select from additional services and metrics to monitor. Get up to 300 new AWS metrics out of the box. Dynatrace ingests AWS CloudWatch metrics for multiple preselected services. Select Add service to pick the service that has the metric you want to add.

AWS

AWS Metrics IoT Storage

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

You will need to know which monitoring metrics for Redis to watch and a tool to monitor these critical server metrics to ensure its health. Redis returns a big list of database metrics when you run the info command on the Redis shell. You can pick a smart selection of relevant metrics from these.

Metrics

Metrics Monitoring Latency Cache

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload. As Pushy’s portfolio grew, we experienced some pain points with Dynomite.

Latency

Latency Cache Tuning Efficiency

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. Metrics are a key part of understanding application health. Client metrics and QoE changes.

Monitoring

Monitoring Tuning Traffic Metrics

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Higher latency and cold start issues due to the initialization time of the functions. Observability is typically achieved by collecting three types of data from a system, metrics, logs and traces. Understanding cold-start behavior is essential to tune your cloud applications cost or performance to meet your operational needs.

Serverless

Serverless Lambda Azure AWS

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™. Based on those insights, they implemented automated validation tasks, and shifted left in their software delivery pipeline.

DevOps

DevOps Traffic Latency Best Practices

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns. Furthermore, in addition to real-time alerting, we added trend analysis for important metrics to help catch longer term degradations.

Systems

Systems Traffic Architecture Mobile

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

. “We use AI to optimize the configuration of the software stack,” Doni said, highlighting how Akamas works by taking into account infrastructure and application metrics at the same time to achieve its optimization goals. You can ask for the best configuration to reduce latency or improve the user experience.”

Engineering

Engineering DevOps Operating System Cloud

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

Fast, consistent application delivery creates a positive user experience that can ultimately drive customer loyalty and improve business metrics like conversion rate and user retention. Expanding on the traditional observability pillars of metrics, logs, and traces, DEM collects user experience data to complete the end-to-end picture.

Monitoring

Monitoring Social Media IoT Metrics

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

When an application is triggered, it can cause latency as the application starts. This creates latency when they need to restart. Your team should incorporate performance metrics, errors, and access logs into your monitoring platform. The platform builds the trigger to initiate the app. Monitoring serverless applications.

Serverless

Serverless Efficiency Lambda AWS

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Making applications observable—relying on metrics, logs, and traces to understand what software is doing and how it’s performing—has become increasingly important as workloads are shifting to multicloud environments. We also introduced our demo app and explained how to define the metrics and traces it uses.

Metrics

Metrics Database Monitoring Network

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

In the canary stage, Kayenta is used to compare metrics between a baseline (current AMI) and the canary (new AMI). The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses. If this score is within a healthy threshold the AMI is deployed to each environment.

DevOps

DevOps AWS Tuning Infrastructure

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

These can include business metrics, such as conversion rates, uptime, and availability; service metrics, such as application performance; or technical metrics, such as dependencies to third-party services, underlying CPU, and the cost of running a service. What are SLIs? For example, if your SLO is to deliver 99.5%

Metrics

Metrics Best Practices DevOps Infrastructure

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count.

Traffic

Traffic Metrics Infrastructure Architecture

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

For example, improving latency by as little as 0.1 latency is the number one reason consumers abandon mobile sites. Fine-tuning the service-level indicators that make up quality gates will improve with the help of upcoming features. Organizations can feel the impact of even a minor roadblock in the user experience.

Benchmarking

Benchmarking Latency Speed Software

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

Thus, the implemented solution must integrate with Netflix Spring facilities for authentication and metrics support at the very minimum?—?the By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. million elements.

Latency

Latency Traffic Transportation Cloud

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

In particular, the VMAF metric lies at the core of improving the Netflix member’s streaming video quality. This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc., Assembly for two of the metrics (e.g.

Media

Media Innovation Metrics Latency

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

Here are some key takeaways to keep in mind: Be skeptical of advice or metrics that sound too good to be true. For example, the metrics that come built-in to many tools rarely correlate with what you actually care about. Of course, theres more to making improvements than just relying on tools and metrics.

Latency

Latency Tuning Metrics Testing

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

Business value : Once we have a rubric for evaluating our systems, how do we tie our macro-level business value metrics to our micro-level LLM evaluations? Iteration: Iterate quickly using prompt engineering, embeddings, tool use, fine-tuning, business logic, and more! How do we do so? We tested both retrieval quality (e.g.,

Systems

Systems Development Tuning Monitoring

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

Scalegrid

NOVEMBER 4, 2024

We use Sysbench to benchmark key performance metrics under different workloads and thread configurations, including Transactions Per Second (TPS) and Queries Per Second (QPS). Key metrics include TPS and QPS. However, to ensure a level playing field regarding connection handling, we tuned ScaleGrid’s instances to allow 830 connections.

Benchmarking

Benchmarking AWS Tuning Metrics

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

Observability data provides a treasure trove of performance, stability, and user experience metrics encompassing error rates, response times, and user engagement. With swift precision, an answer-driven automation solution that uses causal AI can transform these metrics into invaluable insights.

DevOps

DevOps Traffic Efficiency Servers

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Tuning Options. Tuning SSRS is much like any other application. Disk latency for ReportServer and ReportServerTempDB are very important. General Tuning.

Tuning

Tuning Servers Database Best Practices

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Our engineering teams tuned their services for performance after factoring in increased resource utilization due to tracing.

Infrastructure

Infrastructure Transportation Storage Open Source

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? To be fair, that is also true with PostgreSQL; it hasn’t been tuned either, and it, too, can also perform much better.

Tuning

Tuning Cache Servers Benchmarking

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Data Quality Data Mesh provides metrics and dashboards at both the processor and pipeline level for operational observability. Please stay tuned!

Big Data

Big Data Government Processing Analytics

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

And now, execute the benchmark: -- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench The results are not pretty. psql pgbench <<_eof1_ qecho adding node citus3. select citus_add_node('citus3', 5432); qecho rebalancing shards across TWO nodes. psql pgbench <<_eof1_ qecho adding node citus3.

Database

Database Benchmarking Latency C++

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores. This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., Stay tuned!

Benchmarking

Benchmarking Cache Latency Tuning

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

As our business scales globally, the demand for data is growing and the needs for scalable low latency incremental processing begin to emerge. We are taking Big Data Orchestration to the next level and constantly solving new problems and challenges, please stay tuned. There are three common issues that the dataset owners usually face.

Processing

Processing Big Data Efficiency Engineering

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

All Things Distributed

APRIL 9, 2015

The Amazon ML console and API provide data and model visualization tools, as well as wizards to guide you through the process of creating machine learning models, measuring their quality and fine-tuning the predictions to match your application requirements.

Lambda

Lambda Cloud IoT AWS

Software engineering for machine learning: a case study

The Morning Paper

JULY 7, 2019

In addition to availability, our respondents focus most heavily on supporting the following data attributes: “accessibility, accuracy, authoritativeness, freshness, latency, structuredness, ontological typing, connectedness, and semantic joinability.” To address this, rigorous rollout processes are required.

Software Engineering

Software Engineering Engineering Software Software

Software-defined far memory in warehouse scale computers

The Morning Paper

MAY 21, 2019

This boils down to a single digit µs latency toleration in the tail for far memory, and in addition to security and privacy concerns, rules out remote memory solutions. Thus we’re fundamentally trading (de)-compression latency at access time for the ability to pack more data in memory. ML-based auto-tuning. Evaluation.

Software

Software Software Google Hardware

Monitoring Serverless Applications

Dotcom-Montior

NOVEMBER 11, 2020

Developers don’t have to put in additional time to fine-tuning the system, or rely on other teams for support, as it’s done automatically with the cloud provider. The primary challenge being not able to access the underlying infrastructure metrics. The time it takes between an action and a response is latency. Monitoring.

Serverless

Serverless Monitoring Lambda Latency

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Trending Sources

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Title Launch Observability at Netflix Scale

Why applying chaos engineering to data-intensive applications matters

Migrating Netflix to GraphQL Safely

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Best Practices for Scaling RabbitMQ

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Crucial Redis Monitoring Metrics You Must Watch

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Telltale: Netflix Application Monitoring Simplified

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Introducing Netflix TimeSeries Data Abstraction Layer

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

How Dynatrace boosts production resilience with Site Reliability Guardian

Rapid Event Notification System at Netflix

Enhancing Kubernetes cluster management key to platform engineering success

How digital experience monitoring helps deliver business observability

What is serverless computing? Driving efficiency without sacrificing observability

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Applying Netflix DevOps Patterns to Windows

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Keeping Netflix Reliable Using Prioritized Load Shedding

How BizDevOps can “shift left” using SLOs to automate quality gates

Towards a Reliable Device Management Platform

Netflix Video Quality at Scale with Cosmos Microservices

AI Essentials for Tech Executives

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Tuning SQL Server Reporting Services

Building Netflix’s Distributed Tracing Infrastructure

The Most Important MySQL Setting

Data Movement in Netflix Studio via Data Mesh

How To Scale a Single-Host PostgreSQL Database With Citus

The evolution of single-core bandwidth in multicore processors

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Incremental Processing using Netflix Maestro and Apache Iceberg

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

Software engineering for machine learning: a case study

Software-defined far memory in warehouse scale computers

Monitoring Serverless Applications

Stay Connected