Event, Latency and Strategy - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Dynatrace

APRIL 10, 2025

Here are five strategies executives can pursue to reduce tool sprawl, lower costs, and increase operational efficiency. All data in context : By bringing together metrics, logs, traces, user behavior, and security events into one platform, Dynatrace eliminates silos and delivers real-time, end-to-end visibility.

Strategy

Strategy Storage Network Architecture

Optimising for High Latency Environments

CSS Wizardry

SEPTEMBER 16, 2024

This gives fascinating insights into the network topography of our visitors, and how much we might be impacted by high latency regions. Round-trip-time (RTT) is basically a measure of latency—how long did it take to get from one endpoint to another and back again? What is RTT? RTT isn’t a you-thing, it’s a them-thing.

Latency

Latency Cache Transportation Mobile

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

We can experiment with different content placements or promotional strategies to boost visibility and engagement. Analyzing impression history, for example, might help determine how well a specific row on the home page is functioning or assess the effectiveness of a merchandising strategy.

Tuning

Tuning Latency Efficiency Storage

Why growing AI adoption requires an AI observability strategy

Dynatrace

JANUARY 17, 2024

An AI observability strategy—which monitors IT system performance and costs—may help organizations achieve that balance. They can do so by establishing a solid FinOps strategy. Predictive AI uses machine learning to identify patterns in past events and make predictions about future events. What is AI observability?

Strategy

Strategy Artificial Intelligence Storage Cloud

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. Kafka is optimized for high-throughput event streaming , excelling in real-time analytics and large-scale data ingestion. What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. Key insights from this shiftinclude: A Data-Centric Approach : Shifting focus from model-centric strategies, which heavily rely on feature engineering, to a data-centric one.

Tuning

Tuning Efficiency Latency Strategy

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Youll also learn strategies for maintaining data safety and managing node failures so your RabbitMQ setup is always up to the task. They can be mirrored and configured for either availability or consistency, providing different strategies for managing network partitions.

Best Practices

Best Practices Traffic Strategy Scalability

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Using the source of truth: Logs serve as a reliable source of truth by providing a comprehensive record of system events.

Traffic

Traffic Scalability Strategy Monitoring

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

Continuous Instrumentation of the Linux Scheduler To ensure the reliability of our workloads that depend on low latency responses, we instrumented the run queue latency for each container, which measures the time processes spend in the scheduling queue before being dispatched to the CPU.

Latency

Latency Metrics Programming Monitoring

Using Pausers in Event Loops

DZone

SEPTEMBER 14, 2022

Typically in low-latency development, a trade-off must be made between minimizing latency and avoiding excessive CPU utilization. In a typical application stack, multiple threads are used for servicing events, processing data, pipelining, and so on. Description of the Problem.

Latency

Latency Open Source Strategy Design

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

The three strategies we will discuss today are AB Testing , Replay Testing, and Sticky Canaries. Let’s discuss the three testing strategies in further detail. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render.

Traffic

Traffic Latency Metrics Cache

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Text-based records of events and activities generated by applications and infrastructure components. Traces are used for performance analysis, latency optimization, and root cause analysis. Logs are detailed records of events that happen within an application. Logs are used for debugging, troubleshooting, and auditing purposes.

Latency

Latency Best Practices Metrics Open Source

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing systems, designed for continuous, low-latency processing, demand swift recovery mechanisms to tolerate and mitigate failures effectively. After failures, Kafka Streams’ partition assignment strategy, triggered by rebalances, causes its executions to accumulate more lag. This significantly increases event latency.

Engineering

Engineering Tuning Latency Open Source

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal. In this testing strategy, we execute a copy (replay) of production traffic against a system’s existing and new versions to perform relevant validations. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. For simpler use cases, it also represents flat key-value Maps (e.g.

Latency

Latency Storage Cache Servers

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

In that scenario, the system would need to deal with the data propagation latency directly, for example, by use of timeouts or client-originated update tracking mechanisms. We started seeing increased response latencies and leader servers running at dangerously high utilization. Let’s assume a sequence of events E?…E??,

Cache

Cache Latency Traffic Systems

Implementing AWS well-architected pillars with automated workflows

Dynatrace

SEPTEMBER 13, 2023

Because Google offers its own Google Cloud Architecture Framework and Microsoft its Azure Well-Architected Framework , organizations that use a combination of these platforms triple the challenge of integrating their performance frameworks into a cohesive strategy. SRG validates the status of the resiliency SLOs for the experiment period.

AWS

AWS Efficiency Azure Cloud

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

AWS Lambda is a serverless compute service that can run code in response to predetermined events or conditions and automatically manage all the computing resources required for those processes. Many events can trigger a lambda function. AWS continues to improve how it handles latency issues. What is AWS Lambda?

Lambda

Lambda AWS Serverless Hardware

Automated Change Impact Analysis with Site Reliability Guardian

Dynatrace

FEBRUARY 15, 2023

Streamline development and delivery processes Nowadays, digital transformation strategies are executed by almost every organization across all industries. SREs use Service-Level Indicators (SLI) to see the complete picture of service availability, latency, performance, and capacity across various systems, especially revenue-critical systems.

DevOps

DevOps Latency Traffic Best Practices

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

This includes response time, accuracy, speed, throughput, uptime, CPU utilization, and latency. To ensure resilience, ITOps teams simulate disasters and implement strategies to mitigate downtime and reduce financial loss. This is the number of failures that affect users’ ability to use an application by the total time in service.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Change Data Capture(CDC) source connector reads from studio applications’ database transaction logs and emits the change events.

Big Data

Big Data Government Processing Analytics

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

For production models, this provides observability of service-level agreement (SLA) performance metrics, such as token consumption, latency, availability, response time, and error count. Finding a balance between complexity and impact must be a priority for organizations that adopt AI strategies.

Cache

Cache Azure Infrastructure Monitoring

SRE vs DevOps: What you need to know

Dynatrace

FEBRUARY 24, 2021

The events of 2020 accelerated the trend of organizations shifting to cloud-native technologies in response to the dramatic increase in demand for online services. Reduced latency. If you haven’t implemented either, a best practice to get started is to develop a strategy that incorporates both DevOps and SRE practices.

DevOps

DevOps Software Engineering Speed Google

The Best Way to Host MongoDB on DigitalOcean

Scalegrid

DECEMBER 16, 2019

In this post, we’ll walk you through the best way to host MongoDB on DigitalOcean, including the best instance types to use, disk types, replication strategy, and managed service providers. MongoDB Replication Strategies. DigitalOcean Advantages for MongoDB. DigitalOcean Droplets. MongoDB Sharding.

Azure

Azure AWS Database Latency

DevOps observability: A guide for DevOps and DevSecOps teams

Dynatrace

JANUARY 18, 2023

This methodology aims to improve software system reliability using several key categories such as availability, performance, latency, efficiency, capacity, and incident response. Organizations that are new to both practices will want to adopt a strategy that incorporates both. Site reliability engineers, or SREs, lead these efforts.

DevOps

DevOps Best Practices Innovation Strategy

Unlock the power of contextual log analytics

Dynatrace

OCTOBER 2, 2024

Grail is built for exabyte scale and leverages massively parallel processing (MPP) as well as advanced automated cold/hot data management to ensure that data remains fully accessible at all times, with zero latency , and full hydration.

Analytics

Analytics AWS DevOps Cloud

Real user monitoring vs. synthetic monitoring: Understanding best practices

Dynatrace

JUNE 27, 2022

Data collected on page load events, for example, can include navigation start (when performance begins to be measured), request start (right before the user makes a request from the server), and speed index metrics (measure page load speed). connectivity, access, user count, latency) of geographic regions. Tools may be limited.

Best Practices

Best Practices Monitoring Wireless Traffic

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

Many organizations also adopt an observability solution to help them detect and analyze the significance of events to their operations, software development life cycles, application security, and end-user experiences. Metrics: These are the values represented as counts or measures that are often calculated or aggregated over a period of time.

Metrics

Metrics Open Source Monitoring Cloud

Observability vs. monitoring: What’s the difference?

Dynatrace

NOVEMBER 3, 2021

For example, when monitoring a database, you’ll want to know about any latency when writing data to a disk or average query response time. Log entries describe events, such as starting a process, handling an error, or simply completing some part of a workload. Here’s a closer look at logs, metrics, and distributed traces.

Monitoring

Monitoring Metrics DevOps Scalability

Making Cloud.typography Fast(er)

CSS Wizardry

AUGUST 13, 2019

Although this response has a 0B filesize, we will always take the latency hit on every single page view (and this response is basically 100% latency). com , which introduces yet more latency for the connection setup. not replacement —the current method would remain fully functional and valid) non-blocking loading strategy.

Latency

Latency Cache Strategy Media

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

These principles reduce resource usage by being more efficient and effective while lowering the end-to-end latency in data processing. Both automatic (event-driven) as well as manual (ad-hoc) optimization. It decides what to do and when to do in response to an incoming event. Transparency to end-users.

Storage

Storage Latency Efficiency Data Engineering

The Three Types of Performance Testing

CSS Wizardry

OCTOBER 27, 2018

Simply knowing the different forms of performance testing that we have available to us, and where they sit in the product development process, makes it much easier for businesses to adopt a performance strategy and keep on top of things. Unfortunately, most issues do not get captured at this point. Who: Engineers, Product Owners, Marketing.

Performance Testing

Performance Testing Testing Performance Strategy

Performance Hero: Annie Sullivan

Speed Curve

JANUARY 19, 2025

We've gotten to know Annie through frequent discussions, feedback sessions, and hallway talks at various events. So in addition to all the optimization work we did for Google Docs, I got to spend a lot of time and energy working on the measurement problem: how can we get end-to-end latency numbers?

Performance

Performance Google Speed Metrics

Introducing Dynatrace built-in data observability on Davis AI and Grail

Dynatrace

JANUARY 31, 2024

Data is the foundation upon which strategies are built, directions are chosen, and innovations are pursued. For DevOps teams that inform deployment strategies, optimize processes, and drive continuous improvement, the integrity and timeliness of data are of significant importance.

DevOps

DevOps Analytics Airlines Metrics

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. Some of DBLog’s features are: Processes captured log events in-order. Interleaves log with dump events, by taking dumps in chunks. Hence, downstream consumers have confidence to receive change events as they occur on a source.

Database

Database Traffic Transportation Open Source

Managing risk for financial services: The secret to visibility and control during times of volatility

Dynatrace

APRIL 8, 2024

In summary, the Dynatrace platform enables banks to do the following: Capture any data type: logs, metrics, traces, topology, behavior, code, metadata, network, security, web, and real-user monitoring data, and business events. Maximize performance for high-frequency and low-latency trading strategies. Break down data silos.

Analytics

Analytics Infrastructure Efficiency Technology

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. Some of DBLog’s features are: Processes captured log events in-order. Interleaves log with dump events, by taking dumps in chunks. Hence, downstream consumers receive change events as they occur on a source.

Database

Database Traffic Transportation Open Source

Most Common RabbitMQ Use Cases

Scalegrid

AUGUST 27, 2024

RabbitMQ excels at managing asynchronous processing and reducing latency while distributing workloads effectively across the system. By prioritizing such messages, RabbitMQ delivers notifications with minimal latency, thus improving the user experience while sustaining the efficacy of communication systems.

Ecommerce

Ecommerce IoT Games Scalability

Data Compression for Large-Scale Streaming Experimentation

The Netflix TechBlog

DECEMBER 2, 2019

The development of an effective data compression strategy enabled us to deploy bootstrapping methods at dramatically greater scale, allowing experimenters to analyze their A/B test results faster and with clearer insights. To understand differences in metrics that occur rarely, we developed a class of methods we call the rare event bootstrap.

Metrics

Metrics Strategy Testing Efficiency

In-Stream Big Data Processing

Highly Scalable

AUGUST 20, 2013

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing fault-tolerance and strict transactioanlity i.e. none of these events can be lost or duplicated. Lineage Tracking.

Big Data

Big Data Processing Lambda Database

Data Compression for Large-Scale Streaming Experimentation

The Netflix TechBlog

DECEMBER 2, 2019

The development of an effective data compression strategy enabled us to deploy bootstrapping methods at dramatically greater scale, allowing experimenters to analyze their A/B test results faster and with clearer insights. To understand differences in metrics that occur rarely, we developed a class of methods we call the rare event bootstrap.

Metrics

Metrics Strategy Testing Efficiency

Netflix’s Distributed Counter Abstraction

Cut costs and complexity: 5 strategies for reducing tool sprawl with Dynatrace

Trending Sources

Optimising for High Latency Environments

Introducing Impressions at Netflix

Why growing AI adoption requires an AI observability strategy

RabbitMQ vs. Kafka: Key Differences

Foundation Model for Personalized Recommendation

Best Practices for Scaling RabbitMQ

Title Launch Observability at Netflix Scale

Noisy Neighbor Detection with eBPF

Using Pausers in Event Loops

Introducing Netflix TimeSeries Data Abstraction Layer

Migrating Netflix to GraphQL Safely

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Predictive CPU isolation of containers at Netflix

Why applying chaos engineering to data-intensive applications matters

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Introducing Netflix’s Key-Value Data Abstraction Layer

Consistent caching mechanism in Titus Gateway

Implementing AWS well-architected pillars with automated workflows

What is AWS Lambda?

Automated Change Impact Analysis with Site Reliability Guardian

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Data Movement in Netflix Studio via Data Mesh

Dynatrace accelerates business transformation with new AI observability solution

SRE vs DevOps: What you need to know

The Best Way to Host MongoDB on DigitalOcean

DevOps observability: A guide for DevOps and DevSecOps teams

Unlock the power of contextual log analytics

Real user monitoring vs. synthetic monitoring: Understanding best practices

What is observability? Not just logs, metrics and traces

Observability vs. monitoring: What’s the difference?

Making Cloud.typography Fast(er)

Optimizing data warehouse storage

The Three Types of Performance Testing

Performance Hero: Annie Sullivan

Introducing Dynatrace built-in data observability on Davis AI and Grail

DBLog: A Generic Change-Data-Capture Framework

Managing risk for financial services: The secret to visibility and control during times of volatility

DBLog: A Generic Change-Data-Capture Framework

Most Common RabbitMQ Use Cases

Data Compression for Large-Scale Streaming Experimentation

In-Stream Big Data Processing

Data Compression for Large-Scale Streaming Experimentation

Stay Connected