Event, Latency and Metrics - Technology Performance Pulse

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Optimising for High Latency Environments

CSS Wizardry

SEPTEMBER 16, 2024

This gives fascinating insights into the network topography of our visitors, and how much we might be impacted by high latency regions. Round-trip-time (RTT) is basically a measure of latency—how long did it take to get from one endpoint to another and back again? RTT data should be seen as an insight and not a metric.

Latency

Latency Cache Transportation Mobile

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

In IT and cloud computing, observability is the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces. If you’ve read about observability, you likely know that collecting the measurements of logs, metrics, and distributed traces are the three key pillars to achieving success.

Metrics

Metrics Open Source Monitoring Cloud

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Collecting Raw Impression Events As Netflix members explore our platform, their interactions with the user interface spark a vast array of raw events. These events are promptly relayed from the client side to our servers, entering a centralized event processing queue.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. Kafka is optimized for high-throughput event streaming , excelling in real-time analytics and large-scale data ingestion. What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

Continuous Instrumentation of the Linux Scheduler To ensure the reliability of our workloads that depend on low latency responses, we instrumented the run queue latency for each container, which measures the time processes spend in the scheduling queue before being dispatched to the CPU.

Latency

Latency Metrics Programming Monitoring

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

They need event-driven automation that not only responds to events and triggers but also analyzes and interprets the context to deliver precise and proactive actions. These initial automation endeavors paved the way for greater advancements, leading to the next evolution of event-driven automation.

DevOps

DevOps Traffic Efficiency Servers

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Using the source of truth: Logs serve as a reliable source of truth by providing a comprehensive record of system events.

Traffic

Traffic Scalability Strategy Monitoring

Seeing through hardware counters: a journey to threefold performance increase

The Netflix TechBlog

NOVEMBER 9, 2022

A quick canary test was free of errors and showed lower latency, which is expected given that our standard canary setup routes an equal amount of traffic to both the baseline running on 4xl and the canary on 12xl. What’s worse, average latency degraded by more than 50%, with both CPU and latency patterns becoming more “choppy.”

Hardware

Hardware Cache Performance Latency

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. Event-driven architecture in RabbitMQ supports horizontal scalability by decoupling services, enabling them to process messages independently.

Best Practices

Best Practices Traffic Strategy Efficiency

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

By implementing service-level objectives, teams can avoid collecting and checking a huge amount of metrics for each service. According to Google’s SRE handbook , best practices, there are “ Four Golden Signals ” we can convert into four SLOs for services: reliability, latency, availability, and saturation.

Software

Software Software Benchmarking Latency

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

So, we relied on higher-level metrics-based testing: AB Testing and Sticky Canaries. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. Wins High-Level Health Metrics: AB Testing provided the assurance we needed in our overall client-side GraphQL implementation.

Traffic

Traffic Latency Metrics Cache

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Observability Observability is the ability to determine a system’s health by analyzing the data it generates, such as logs, metrics, and traces. There are three main types of telemetry data: Metrics. Metrics are typically aggregated and stored in time series databases for monitoring and alerting purposes.

Latency

Latency Best Practices Metrics Open Source

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Dynatrace

JANUARY 15, 2020

Citrix platform performance—optimize your Citrix landscape with insights into user load and screen latency per server. As a part of the Citrix monitoring extension for Dynatrace, we deliver a OneAgent plugin that adds several Citrix-specific WMI counters to the set of metrics reported by OneAgent.

Latency

Latency Performance Virtualization Infrastructure

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

The second phase involves migrating the traffic over to the new systems in a manner that mitigates the risk of incidents while continually monitoring and confirming that we are meeting crucial metrics tracked at multiple levels. It provides a good read on the availability and latency ranges under different production conditions.

Traffic

Traffic Latency Tuning Systems

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Dynatrace

NOVEMBER 28, 2022

The new Amazon capability enables customers to improve the startup latency of their functions from several seconds to as low as sub-second (up to 10 times faster) at P99 (the 99th latency percentile). This can cause latency outliers and may lead to a poor end-user experience for latency-sensitive applications.

Lambda

Lambda AWS Serverless Latency

Build systems more reliably with Dynatrace: Chaos Engineering

Dynatrace

AUGUST 21, 2024

This approach enhances key DORA metrics and enables early detection of failures in the release process, allowing SREs more time for innovation. This blog post explores the Reliability metric , which measures modern operational practices. Your org’s challenge is to get ROI on those events.” Why reliability?

Engineering

Engineering Systems Latency Metrics

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. This significantly increases event latency. Recovery time of the throughput metric. Recovery time of the latency p90.

Engineering

Engineering Tuning Latency Open Source

Dynatrace supports the newly released AWS Lambda Response Streaming

Dynatrace

APRIL 7, 2023

Customers can use AWS Lambda Response Streaming to improve performance for latency-sensitive applications and return larger payload sizes. Triggering the Lambda function is event-driven and could include changes in state or an update to a file. To learn more about the AWS Lambda features, visit the Lamba features page.

Lambda

Lambda AWS Serverless Latency

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Dynatrace automatically monitors OpenAI ChatGPT for companies that deliver reliable, cost-effective services powered by generative AI

Dynatrace

JUNE 7, 2023

One of the crucial success factors for delivering cost-efficient and high-quality AI-agent services, following the approach described above, is to closely observe their cost, latency, and reliability. With these latency, reliability, and cost measurements in place, your operations team can now define their own OpenAI dashboards and SLOs.

Monitoring

Monitoring Latency Metrics Azure

Get seamless insights into Nutanix clusters with Dynatrace

Dynatrace

NOVEMBER 9, 2023

Get ready for Nutanix insights: Here’s how Dynatrace helps The extension comes with a comprehensive set of essential metrics that can quickly identify the root causes of performance issues, saving time and minimizing disruptions. With Dynatrace, Nutanix metrics can be leveraged for various use cases.

Virtualization

Virtualization Storage Metrics Monitoring

Implementing AWS well-architected pillars with automated workflows

Dynatrace

SEPTEMBER 13, 2023

Dynatrace AutomationEngine workflows automate release validation using AWS Well-Architected pillars With Dynatrace, you can create workflows that automate various tasks based on events, schedules or Davis problem triggers. Workflows are powered by a core platform technology of Dynatrace called the AutomationEngine.

AWS

AWS Efficiency Azure Cloud

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

AWS Lambda is a serverless compute service that can run code in response to predetermined events or conditions and automatically manage all the computing resources required for those processes. Real-time stream processing to perform live activity tracking, data cleansing, metrics generation, and more. What is AWS Lambda?

Lambda

Lambda AWS Serverless Hardware

Improved Alerting with Atlas Streaming Eval

The Netflix TechBlog

APRIL 27, 2023

A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! Hence, we started down the path of alert evaluation via real-time streaming metrics. This has proven to be valuable towards reducing Mean Time to Recover (MTTR).

Storage

Storage Cache Metrics Database

Service level objectives: 5 SLOs to get started

Dynatrace

JUNE 1, 2023

Certain SLOs can help organizations get started on measuring and delivering metrics that matter. With this objective, the app ensures that users experience real-time feedback and immediate updates when logging workouts, recording sets and reps, or tracking performance metrics. Latency primarily focuses on the time spent in transit.

Latency

Latency Website Traffic DevOps

Observability vs. monitoring: What’s the difference?

Dynatrace

NOVEMBER 3, 2021

Monitoring focuses on watching specific metrics. Observability is the ability to understand a system’s internal state by analyzing the data it generates, such as logs, metrics, and traces. For example, we can actively watch a single metric for changes that indicate a problem — this is monitoring.

Monitoring

Monitoring Metrics DevOps Scalability

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

In the Device Management Platform, this is achieved by having device updates be event-sourced through the control plane to the cloud so that NTS will always have the most up-to-date information about the devices available for testing. The RAE is configured to be effectively a router that devices under test (DUTs) are connected to.

Latency

Latency Traffic Transportation Cloud

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

The other main use case was RENO, the Rapid Event Notification System mentioned above. Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload.

Latency

Latency Cache Tuning Efficiency

Automated Change Impact Analysis with Site Reliability Guardian

Dynatrace

FEBRUARY 15, 2023

SREs use Service-Level Indicators (SLI) to see the complete picture of service availability, latency, performance, and capacity across various systems, especially revenue-critical systems. Additionally, you can easily use any previously defined metrics and SLOs from your environments.

DevOps

DevOps Latency Traffic Best Practices

Unlock the power of contextual log analytics

Dynatrace

OCTOBER 2, 2024

Davis AI contextually aligns all relevant data points—such as logs, traces, and metrics—enabling teams to act quickly and accurately while still providing power users with the flexibility and depth they desire and need. The Clouds app provides a view of all available cloud-native services. Figure 11.

Analytics

Analytics AWS DevOps Cloud

Application observability meets developer observability: Unlock a 360º view of your environment

Dynatrace

NOVEMBER 6, 2023

When an incident occurs, developers need to know what data to look at, where the incident occurred, and other relevant metrics. In this example, Grabner saw that the adservice workload was running on EKS and could see the relevant metrics, logs, services, events, error logs, and more. “Then, I add a breakpoint.

Development

Development DevOps Programming Cloud

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Serverless applications are composed of event-driven functions that run on demand in response to triggers from various sources, such as HTTP requests, messages, or timers. Higher latency and cold start issues due to the initialization time of the functions. What are serverless applications?

Serverless

Serverless Lambda Azure AWS

Data Reprocessing Pipeline in Asset Management Platform @Netflix

The Netflix TechBlog

MARCH 10, 2023

Data Sharding strategy in elasticsearch is updated to provide low search latency (as described in blog post) Design of new Cassandra reverse indices to support different sets of queries. After reading the asset ids using one of the ways, an event is created per asset id to be processed synchronously or asynchronously based on the use case.

Media

Media Traffic Processing Design

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

Dynatrace Managed is intrinsically highly available as it stores three copies of all events, user sessions, and metrics across its cluster nodes. The network latency between cluster nodes should be around 10 ms or less. In the image below, three downed nodes make an entire cluster unavailable.

Availability

Availability Hardware Latency Traffic

Real user monitoring vs. synthetic monitoring: Understanding best practices

Dynatrace

JUNE 27, 2022

RUM gathers information on a variety of performance metrics. Data collected on page load events, for example, can include navigation start (when performance begins to be measured), request start (right before the user makes a request from the server), and speed index metrics (measure page load speed). Tools may be limited.

Best Practices

Best Practices Monitoring Wireless Traffic

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Making applications observable—relying on metrics, logs, and traces to understand what software is doing and how it’s performing—has become increasingly important as workloads are shifting to multicloud environments. We also introduced our demo app and explained how to define the metrics and traces it uses.

Metrics

Metrics Database Monitoring Network

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

Unlike a traditional virtual machine-model where customers must build and manage an entire VM, serverless computing provides the ability to purchase only the CPU cycles and memory needed to support an application using an event-based pay-per-use model. When an application is triggered, it can cause latency as the application starts.

Serverless

Serverless Efficiency Lambda AWS

Edgar: Solving Mysteries Faster with Observability

The Netflix TechBlog

SEPTEMBER 2, 2020

With request tracing and additional data from logs, events, metadata, and analysis, Edgar is able to show the flow of a request through our distributed system?—?what Tracing as a foundation Logs, metrics, and traces are the three pillars of observability. Is this an anomaly or are we dealing with a pattern?

Latency

Latency Transportation Engineering Traffic

Nine ways technology executives can get significant business value with the right observability platform

Dynatrace

MAY 21, 2024

That’s because it does not require any pre-prepared schemas, and access to cold/hot storage is fully automatic and with zero latency. Tens or even hundreds of DIY and commercial tools are being used to handle logs, metrics, traces, security events, and vulnerabilities all in their own way.

Technology

Technology Technology Analytics Storage

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

Bringing together metrics, logs, traces, problem analytics, and root-cause information in dashboards and notebooks, Dynatrace offers an end-to-end unified operational view of cloud applications. Organizations need to stay on top of AI developments, and AI adoption is not a one-time event for which they can plan.

Cache

Cache Azure Infrastructure Monitoring

Performance Hero: Annie Sullivan

Speed Curve

JANUARY 19, 2025

Annie leads the Chrome Speed Metrics team at Google, which has arguably had the most significant impact on web performance of the past decade. We've gotten to know Annie through frequent discussions, feedback sessions, and hallway talks at various events. What is the charter of the Chrome Speed Metrics team? Nice job, everyone!

Performance

Performance Google Speed Metrics

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

The Workflows screenshot below shows that a task is triggered by a change event related to the application, execution of the guardians, and final aggregation of the results. In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™.

DevOps

DevOps Traffic Latency Best Practices

Observability platform vs. observability tools

Dynatrace

DECEMBER 22, 2021

Observability is made up of three key pillars: metrics, logs, and traces. Metrics are measures of critical system values, such as CPU utilization or average write latency to persistent storage. Logs are files that record events in a system, such as the start of a subprocess or the trapping of an error.

Artificial Intelligence

Artificial Intelligence Metrics Architecture DevOps

Rapid Event Notification System at Netflix

Optimising for High Latency Environments

Trending Sources

What is observability? Not just logs, metrics and traces

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Noisy Neighbor Detection with eBPF

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Title Launch Observability at Netflix Scale

Seeing through hardware counters: a journey to threefold performance increase

Best Practices for Scaling RabbitMQ

Implementing service-level objectives to improve software quality

Migrating Netflix to GraphQL Safely

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Optimize Citrix platform performance and user experience with Dynatrace (GA)

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Build systems more reliably with Dynatrace: Chaos Engineering

Why applying chaos engineering to data-intensive applications matters

Dynatrace supports the newly released AWS Lambda Response Streaming

Introducing Netflix TimeSeries Data Abstraction Layer

Dynatrace automatically monitors OpenAI ChatGPT for companies that deliver reliable, cost-effective services powered by generative AI

Get seamless insights into Nutanix clusters with Dynatrace

Implementing AWS well-architected pillars with automated workflows

What is AWS Lambda?

Improved Alerting with Atlas Streaming Eval

Service level objectives: 5 SLOs to get started

Observability vs. monitoring: What’s the difference?

Towards a Reliable Device Management Platform

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Automated Change Impact Analysis with Site Reliability Guardian

Unlock the power of contextual log analytics

Application observability meets developer observability: Unlock a 360º view of your environment

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Data Reprocessing Pipeline in Asset Management Platform @Netflix

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Real user monitoring vs. synthetic monitoring: Understanding best practices

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

What is serverless computing? Driving efficiency without sacrificing observability

Edgar: Solving Mysteries Faster with Observability

Nine ways technology executives can get significant business value with the right observability platform

Dynatrace accelerates business transformation with new AI observability solution

Performance Hero: Annie Sullivan

How Dynatrace boosts production resilience with Site Reliability Guardian

Observability platform vs. observability tools

Stay Connected