Processing, Traffic and Tuning - Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 2 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience.

Traffic

Traffic Metrics Systems Strategy

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

As Netflix expanded globally and the volume of title launches skyrocketed, the operational challenges of maintaining this manual process became undeniable. Metadata and assets must be correctly configured, data must flow seamlessly, microservices must process titles without error, and algorithms must function as intended.

Traffic

Traffic Scalability Strategy Monitoring

Title Launch Observability at Netflix Scale

The Netflix TechBlog

MARCH 4, 2025

Accurately Reflecting Production Behavior A key part of our solution is insights into production behavior, which necessitates our requests to the endpoint result in traffic to the real service functions that mimics the same pathways the traffic would take if it came from the usualcallers. We call this capability TimeTravel.

Traffic

Traffic Strategy Entertainment Innovation

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. RabbitMQ follows a message broker model with advanced routing, while Kafkas event streaming architecture uses partitioned logs for distributed processing. What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Optimizing RabbitMQ performance through strategies such as keeping queues short, enabling lazy queues, and monitoring health checks is essential for maintaining system efficiency and effectively managing high traffic loads.

Best Practices

Best Practices Traffic Strategy Efficiency

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process. The Netflix video processing pipeline went live with the launch of our streaming service in 2007. The Netflix video processing pipeline went live with the launch of our streaming service in 2007.

Processing

Processing Media Latency Innovation

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

Event Prioritization Considering the use cases were wide ranging both in terms of their sources and their importance, we built segmentation into the event processing. We thus assigned a priority to each use case and sharded event traffic by routing to priority-specific queues and the corresponding event processing clusters.

Systems

Systems Traffic Architecture Mobile

Unlock end-to-end observability insights with Dynatrace PurePath 4 seamless integration of OpenTracing for Java

Dynatrace

DECEMBER 9, 2020

With Dynatrace OneAgent you also benefit from support for traffic routing and traffic control. OneAgent implements network zones to create traffic routing rules and limit cross data-center traffic. Stay tuned for upcoming announcements around OpenTracing and OpenTelemetry. What’s next?

Java

Java Traffic Architecture Strategy

Using Dynatrace to master the 5 pillars of the AWS Well-Architected Framework (Part 1)

Dynatrace

NOVEMBER 24, 2020

Tracking changes to automated processes, including auditing impacts to the system, and reverting to the previous environment states seamlessly. The ultimate goal of each of these reviews is to identify gaps, quantify risk, and develop recommendations for improving the team, processes, and architecture with each of the five pillars.

AWS

AWS Artificial Intelligence Best Practices Lambda

9 key DevOps metrics for success

Dynatrace

SEPTEMBER 28, 2021

Your next challenge is ensuring your DevOps processes, pipelines, and tooling meet the intended goal. For example, by measuring deployment frequency daily or weekly, you can determine how efficiently your team is responding to process changes. Lead time for changes helps teams understand how effective their processes are.

DevOps

DevOps Metrics Traffic Efficiency

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. Regional traffic evacuations. A regional traffic shift means one region ends up with zero traffic while another region has double.

Monitoring

Monitoring Tuning Traffic Metrics

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Logs and background requests are examples of this type of traffic.

Traffic

Traffic Metrics Infrastructure Architecture

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

To ensure high standards, it’s essential that your organization establish automated validations in an early phase of the software development process—ideally when code is written. While the first guardian validates the traffic, the second guardian checks the business transactions generated during the observation period.

DevOps

DevOps Traffic Latency Best Practices

What is web application security? Everything you need to know.

Dynatrace

JUNE 9, 2021

Web application security is the process of protecting web applications against various types of threats that are designed to exploit vulnerabilities in an application’s code. Web Application Firewall (WAF) helps protect a web application against malicious HTTP traffic. Whether the process is exposed to the Internet.

Open Source

Open Source Entertainment Tuning Internet

Easy SLA and SLO reporting for all your API endpoints with public synthetic HTTP monitors

Dynatrace

JUNE 26, 2020

Dynatrace Synthetic Monitoring helps you quickly verify if your application is delivering the expected end user experience by offering an outside-in view of all your applications and services, independent of real traffic. So stay tuned! Automated SLA/SLO monitoring using the HTTP monitoring API.

Monitoring

Monitoring Azure AWS Traffic

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

For example, to handle traffic spikes and pay only for what they use. Scale automatically based on the demand and traffic patterns. Data analysis : how to process, aggregate and query observability data from serverless functions effectively, accurately, and comprehensively? Such anomalies can be caused by function cold-starts.

Serverless

Serverless Lambda Azure AWS

Bending pause times to your will with Generational ZGC

The Netflix TechBlog

MARCH 5, 2024

Each of these errors is a canceled request resulting in a retry so this reduction further reduces overall service traffic by this rate: Errors rates per second. Operational simplicity Service owners often reach out to us with questions about excessive pause times and for help with tuning.

Latency

Latency Java Tuning Efficiency

Dynatrace simplifies StatsD, Telegraf, and Prometheus observability with Davis AI

Dynatrace

OCTOBER 7, 2020

Stay tuned for an upcoming blog series where we’ll give you a more hands-on walkthrough of how to ingest any kind of data from StatsD, Telegraf, Prometheus, scripting languages, or our integrated REST API. Telegraf is a plugin-based system for collecting, processing, aggregating, and writing metrics. Stay tuned.

Open Source

Open Source Metrics Analytics Tuning

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Handling Bursty Traffic : Managing significant traffic spikes during high-demand events, such as new content launches or regional failovers. Sharded Infrastructure : Leveraging the Data Gateway Platform , we can deploy single-tenant and/or multi-tenant infrastructure with the necessary access and traffic isolation.

Latency

Latency Storage Traffic Tuning

OneAgent for Linux on IBM Z (General Availability)

Dynatrace

NOVEMBER 20, 2019

Network measurements with per-interface and per-process resolution. OneAgent for Z/Linux collects a number of network metrics: input and output traffic measured in bytes and packets, retransmissions, and connectivity. Network metrics are also collected for detected processes. Stay tuned for more announcements on this topic.

Availability

Availability Hardware Java Tuning

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

As software development grows more complex, managing components using an automated onboarding process becomes increasingly important. The validation process is automated based on events that occur, while the objectives’ configuration, which is validated by the Site Reliability Guardian , is stored in a separate file.

Best Practices

Best Practices Code Infrastructure Latency

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

To address this, we use a static limit for the initial queries to the backing store, query with this limit, and process the results. To mitigate these issues, we implemented adaptive pagination which dynamically tunes the limits based on observed data. While processing this request, the server retrieves data from the backing store.

Latency

Latency Storage Cache Efficiency

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

We will also explore the evolution of DevOps automation and the significance of data-driven answers in unlocking streamlined, automated DevOps and SRE processes. Business process automation Business process automation is the foundation for improving operational efficiency.

DevOps

DevOps Traffic Efficiency Servers

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Dynatrace

FEBRUARY 16, 2023

The goal is to turn more data into insights so the whole organization can make data-driven decisions and automate processes. Grail data lakehouse delivers massively parallel processing for answers at scale Modern cloud-native computing is constantly upping the ante on data volume, variety, and velocity.

Analytics

Analytics Innovation Metrics Database

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

Reconstructing a streaming session was a tedious and time consuming process that involved tracing all interactions (requests) between the Netflix app, our Content Delivery Network (CDN), and backend microservices. The process started with manual pull of member account information that was part of the session.

Infrastructure

Infrastructure Transportation Storage Open Source

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Some of DBLog’s features are: Processes captured log events in-order.

Database

Database Traffic Transportation Open Source

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). What DEM and business observability mean for the bottom line.

Monitoring

Monitoring Social Media IoT Metrics

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

The challenge, then, is to be able to ingest and process these events in a scalable manner, i.e., scaling with the number of devices, which will be the focus of this blog post. As such, we can see that the traffic load on the Device Management Platform’s control plane is very dynamic over time.

Latency

Latency Traffic Transportation Cloud

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Some of DBLog’s features are: Processes captured log events in-order.

Database

Database Traffic Transportation Open Source

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

We started seeing signs of scale issues, like: Slowness during peak traffic moments like 12 AM UTC, leading to increased operational burden. At Netflix, the peak traffic load can be a few orders of magnitude higher than the average load. Hence, the system has to withstand bursts in traffic while still maintaining the SLO requirements.

Java

Java Scalability Traffic Architecture

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

Prodicle Distribution Our service is required to be elastic and handle bursty traffic. We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases. Things got hairy. We wanted a scalable service that was near real-time, 2.

Traffic

Traffic Java Latency Google

Best practices for alerting

Dynatrace

JULY 22, 2019

Dynatrace automatically detects processes and services and will observe their behaviour. For instance, when there isn’t enough traffic (late at night), the AI will not act to avoid alert spamming. If you want to understand how Dynatrace detects errors, read my other blog on how to fine-tune it ! How does it work?

Best Practices

Best Practices Artificial Intelligence Monitoring Tuning

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Application and service monitoring What will be of particular interest to us here is “Process analysis” One of the key features of OneAgent is not only its ability to monitor the host itself and the system metrics but also to gain deep insight into the applications and services the machine is running.

Metrics

Metrics Database Monitoring Network

Dynatrace Cloud Automation Module provides observability-driven automation across the full lifecycle

Dynatrace

FEBRUARY 10, 2021

Dynatrace Cloud Automation leverages the AI and automation capabilities of the Dynatrace Software Intelligence Platform to enhance development, DevOps, and SRE teams’ processes with: Automated SLO validation and quality gates , to ensure high-quality code moves smoothly through the delivery pipeline and does not violate error budgets in production.

Cloud

Cloud DevOps Speed Metrics

Get out-of-the-box visibility into your ARM platform (Early Adopter)

Dynatrace

MAY 1, 2020

Network measurements with per-interface and per-process resolution. OneAgent for the ARM platform collects a number of network metrics: input and output traffic measured in bytes and packets, retransmissions, and connectivity. Network metrics are also collected for detected processes. Stay tuned for more details.

Java

Java Hardware Metrics Tuning

OneAgent for Linux on IBM Z now available in Early Adopter Release

Dynatrace

AUGUST 8, 2019

Network measurements with per-interface and per-process resolution. OneAgent for Z/Linux collects a number of network metrics: input and output traffic measured in bytes and packets, retransmissions, and connectivity. Network metrics are also collected for detected processes. Stay tuned for more announcements on this topic.

Availability

Availability Hardware Java Tuning

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. CDC and data source Change data capture or CDC , is a semantic for processing changes in a source for the purpose of replicating those changes to a sink.

Big Data

Big Data Government Processing Analytics

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

For Inter-Process Communication (IPC) between services, we needed the rich feature set that a mid-tier load balancer typically provides. In order for a service to talk to another, it needs to know two things: the name of the destination service, and whether or not the traffic should be secure.

Traffic

Traffic Latency Cloud C++

Dynatrace Application Security protects your applications in complex cloud environments

Dynatrace

DECEMBER 8, 2020

As a result, e xisting application security approaches can’t keep up with this speed and vari ability of modern development processes. . With DevSecOps processes having shifted security testing “left”, will the teams have enough time to manually analyze, assess, and manage risks based on sampled or scheduled scan results?

Cloud

Cloud Open Source Internet Internet

Dynatrace PurePath 4 integrates OpenTelemetry and the latest cloud-native technologies and provides analytics and AI at scale

Dynatrace

NOVEMBER 17, 2020

Dynatrace analyzes the response time of each service running within each process, displaying findings out-of-the-box for further investigation. So please stay tuned for updates. Technical scalability without limits. Highest availability and security out-of-the-box.

Analytics

Analytics Technology Technology Cloud

OneAgent for Windows—Enhancements to *.msi-based deployment

Dynatrace

MAY 9, 2019

And it added to the network traffic in terms of new version distribution. “How does the new process compare to the old one exactly?” Stay tuned for more announcements on other changes and improvements related to OneAgent installer for Windows, coming to this Dynatrace Blog page shortly! “How can I get the *.exe

Storage

Storage Tuning Traffic Architecture

In-product guidance accelerates Service Level Objectives (SLO) setup for confident deployments

Dynatrace

DECEMBER 9, 2020

Scale and automate SRE into your delivery processes with Dynatrace. Now let’s dive deeper into how Dynatrace can be used to support your Site Reliability Engineering process. This can be detected during any canary deployment or blue/green traffic routing to a new version.

Metrics

Metrics Engineering Google Monitoring

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Trending Sources

Title Launch Observability at Netflix Scale

Title Launch Observability at Netflix Scale

Introducing Impressions at Netflix

RabbitMQ vs. Kafka: Key Differences

Best Practices for Scaling RabbitMQ

Rebuilding Netflix Video Processing Pipeline with Microservices

Rapid Event Notification System at Netflix

Unlock end-to-end observability insights with Dynatrace PurePath 4 seamless integration of OpenTracing for Java

Using Dynatrace to master the 5 pillars of the AWS Well-Architected Framework (Part 1)

9 key DevOps metrics for success

Telltale: Netflix Application Monitoring Simplified

Keeping Netflix Reliable Using Prioritized Load Shedding

How Dynatrace boosts production resilience with Site Reliability Guardian

What is web application security? Everything you need to know.

Easy SLA and SLO reporting for all your API endpoints with public synthetic HTTP monitors

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Bending pause times to your will with Generational ZGC

Dynatrace simplifies StatsD, Telegraf, and Prometheus observability with Davis AI

Introducing Netflix TimeSeries Data Abstraction Layer

OneAgent for Linux on IBM Z (General Availability)

Automated observability, security, and reliability at scale

Introducing Netflix’s Key-Value Data Abstraction Layer

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Building Netflix’s Distributed Tracing Infrastructure

DBLog: A Generic Change-Data-Capture Framework

How digital experience monitoring helps deliver business observability

Towards a Reliable Device Management Platform

DBLog: A Generic Change-Data-Capture Framework

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Achieving observability in async workflows

Best practices for alerting

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace Cloud Automation Module provides observability-driven automation across the full lifecycle

Get out-of-the-box visibility into your ARM platform (Early Adopter)

OneAgent for Linux on IBM Z now available in Early Adopter Release

Data Movement in Netflix Studio via Data Mesh

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Dynatrace Application Security protects your applications in complex cloud environments

Dynatrace PurePath 4 integrates OpenTelemetry and the latest cloud-native technologies and provides analytics and AI at scale

OneAgent for Windows—Enhancements to *.msi-based deployment

In-product guidance accelerates Service Level Objectives (SLO) setup for confident deployments

Stay Connected