Systems, Traffic and Tuning - Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience.

Traffic

Traffic Latency Tuning Systems

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 2 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. This is where large-scale system migrations come into play.

Traffic

Traffic Metrics Systems Strategy

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Title Launch Observability at Netflix Scale

The Netflix TechBlog

MARCH 4, 2025

Part 3: System Strategies and Architecture By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques This blog post is a continuation of Part 2 , where we cleared the ambiguity around title launch observability at Netflix. The request schema for the observability endpoint.

Traffic

Traffic Strategy Entertainment Innovation

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Efficiency

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. The Replay Tester tool samples raw traffic streams from Mantis.

Traffic

Traffic Latency Metrics Cache

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. CRITICAL : This traffic affects the ability to play.

Traffic

Traffic Metrics Infrastructure Architecture

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet. By Andrei U.,

Monitoring

Monitoring Tuning Traffic Metrics

Kubernetes vs Docker: What’s the difference?

Dynatrace

SEPTEMBER 29, 2021

Think of containers as the packaging for microservices that separate the content from its environment – the underlying operating system and infrastructure. This opens the door to auto-scalable applications, which effortlessly matches the demands of rapidly growing and varying user traffic. What is Docker? Watch webinar now! Networking.

Open Source

Open Source DevOps Traffic Cloud

9 key DevOps metrics for success

Dynatrace

SEPTEMBER 28, 2021

As we look at today’s applications, microservices, and DevOps teams, we see leaders are tasked with supporting complex distributed applications using new technologies spread across systems in multiple locations. For most systems, an optimum MTTR could be less than one hour while others have an MTTR of less than one day.

DevOps

DevOps Metrics Traffic Efficiency

Using Dynatrace to master the 5 pillars of the AWS Well-Architected Framework (Part 1)

Dynatrace

NOVEMBER 24, 2020

Tracking changes to automated processes, including auditing impacts to the system, and reverting to the previous environment states seamlessly. Easy deployment of Dynatrace OneAgent with AWS Systems Manager Distributor , AWS Elastic Beanstalk , and AWS CloudFormation. Stay tuned. Fully conceptualizing capacity requirements.

AWS

AWS Artificial Intelligence Best Practices Lambda

Dynatrace Application Security detects and blocks attacks automatically in real-time

Dynatrace

FEBRUARY 10, 2022

WAFs protect the network perimeter and monitor, filter, or block HTTP traffic. Compared to intrusion detection systems (IDS/IPS), WAFs are focused on the application traffic. RASP solutions sit in or near applications and analyze application behavior and traffic. How to get started.

Traffic

Traffic Benchmarking Innovation Java

Efficient SLO event integration powers successful AIOps

Dynatrace

APRIL 5, 2024

For instance, consider how fine-tuned failure rate detection can provide insights for comprehensive understanding. Please refer to How to fine-tune failure detection (dynatrace.com) for further information. SLOs must be evaluated at 100%, even when there is currently no traffic. What characterizes a weak SLO?

Efficiency

Efficiency Traffic Tuning Metrics

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Note: Contrary to what the name may suggest, this system is not built as a general-purpose time series database. Those use cases are well served by the Netflix Atlas telemetry system. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.

Latency

Latency Storage Traffic Tuning

What is web application security? Everything you need to know.

Dynatrace

JUNE 9, 2021

The Marriott data breach, in which one of its reservation systems had been compromised and hundreds of millions of customer records, including credit card and passport numbers, were stolen. Web Application Firewall (WAF) helps protect a web application against malicious HTTP traffic. million Americans, 15.2

Open Source

Open Source Entertainment Tuning Internet

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

For example, to handle traffic spikes and pay only for what they use. Observability is essential to ensure the reliability, security and quality of any software system. Scale automatically based on the demand and traffic patterns. The elasticity of serverless services helps organizations scale as needed.

Serverless

Serverless Lambda Azure AWS

Dynatrace simplifies StatsD, Telegraf, and Prometheus observability with Davis AI

Dynatrace

OCTOBER 7, 2020

Stay tuned for an upcoming blog series where we’ll give you a more hands-on walkthrough of how to ingest any kind of data from StatsD, Telegraf, Prometheus, scripting languages, or our integrated REST API. Telegraf is a plugin-based system for collecting, processing, aggregating, and writing metrics. Stay tuned.

Open Source

Open Source Metrics Analytics Tuning

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

Due to its popularity, the number of workflows managed by the system has grown exponentially. We started seeing signs of scale issues, like: Slowness during peak traffic moments like 12 AM UTC, leading to increased operational burden. The scheduler on-call has to closely monitor the system during non-business hours.

Java

Java Scalability Traffic Architecture

OneAgent for Linux on IBM Z (General Availability)

Dynatrace

NOVEMBER 20, 2019

Typically, these shops run the z/OS operating system, but more recently, it’s not uncommon to see the Z hardware running special versions of Linux distributions. Our goal is to provide automatic answers including root-cause analysis of performance degradation across all these systems and environments.

Availability

Availability Hardware Java Tuning

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

message Item ( Bytes key, Bytes value, Metadata metadata, Integer chunk ) Database Agnostic Abstraction The KV abstraction is designed to hide the implementation details of the underlying database, offering a consistent interface to application developers regardless of the optimal storage system for that use case.

Latency

Latency Storage Cache Servers

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

It enables them to adapt to user feedback swiftly, fine-tune feature releases, and deliver exceptional user experiences, all while maintaining control and minimizing disruption. Consider an event-driven automation system designed for incident management. But it doesn’t stop there. All these actions aim to avert future incidents.

DevOps

DevOps Traffic Efficiency Servers

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

The Netflix TechBlog

MAY 26, 2020

VPC Flow Logs VPC Flow Logs is an AWS feature that captures information about the IP traffic going to and from network interfaces in a VPC. By default, each record captures a network internet protocol (IP) traffic flow (characterized by a 5-tuple on a per network interface basis) that occurs within an aggregation interval.

Network

Network Tuning AWS Traffic

Python at Netflix

The Netflix TechBlog

APRIL 29, 2019

Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. The configuration of these devices is controlled by several other systems including source of truth, application of configurations to devices, and back up.

Open Source

Open Source Network Infrastructure Big Data

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

which is difficult when troubleshooting distributed systems. Troubleshooting a session in Edgar When we started building Edgar four years ago, there were very few open-source distributed tracing systems that satisfied our needs. Investigating a video streaming failure consists of inspecting all aspects of a member account.

Infrastructure

Infrastructure Transportation Storage Open Source

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). Endpoint monitoring (EM). Endpoints can be physical (i.e., PC, smartphone, server) or virtual (virtual machines, cloud gateways).

Monitoring

Monitoring Social Media IoT Metrics

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

However, they are scattered across multiple systems, and there isn’t an easy way to tie related messages together. You’re joining tables, resolving status types, cross-referencing data manually with other systems, and by the end of it all you ask yourself why? Things got hairy. We wanted a scalable service that was near real-time, 2.

Traffic

Traffic Java Latency Google

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

To improve availability, we designed systems where components could fail separately and avoid single points of failure. In order for a service to talk to another, it needs to know two things: the name of the destination service, and whether or not the traffic should be secure. First, we’ve grown the number of different IPC clients.

Traffic

Traffic Latency Cloud C++

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

Think about items such as general system metrics (for example, CPU utilization, free memory, number of services), the connectivity status, details of our web server, or even more granular in-application tasks like database queries. DNS query time indicates the average response times of DNS requests across the system.

Metrics

Metrics Database Monitoring Network

Best practices for alerting

Dynatrace

JULY 22, 2019

Self-service content management systems, for instance, allow non-IT staff to make content changes on production systems. For instance, when there isn’t enough traffic (late at night), the AI will not act to avoid alert spamming. If you want to understand how Dynatrace detects errors, read my other blog on how to fine-tune it !

Best Practices

Best Practices Artificial Intelligence Monitoring Tuning

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

System Setup Architecture The following diagram summarizes the architecture description: Figure 1: Event-sourcing architecture of the Device Management Platform. As such, we can see that the traffic load on the Device Management Platform’s control plane is very dynamic over time.

Latency

Latency Traffic Transportation Cloud

OneAgent for Linux on IBM Z now available in Early Adopter Release

Dynatrace

AUGUST 8, 2019

Typically, these shops run the z/OS operating system, but more recently, it’s not uncommon to see the Z hardware running special versions of Linux distributions. Our goal is to provide automatic answers including root-cause analysis of performance degradation across all these systems and environments.

Availability

Availability Hardware Java Tuning

Why PostgreSQL Is a Top Choice for Enterprise-level Databases

Percona

MARCH 23, 2023

PostgreSQL is a free and open source object-relational database management system (ORDBMS) that has existed since the mid-1990s. Over the years, it has evolved into a robust and feature-rich database that offers several advantages over other database management systems. Reliability PostgreSQL is known for its reliability and stability.

Database

Database Open Source Traffic Small Business

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Blocking write traffic by locking tables. Writing events to any output.

Database

Database Traffic Transportation Open Source

Get out-of-the-box visibility into your ARM platform (Early Adopter)

Dynatrace

MAY 1, 2020

Our mission is to provide automatic answers, including root cause analysis, for performance degradation across all these systems and environments, regardless of the underlying hardware architecture. Stay tuned for more announcements on this topic. Stay tuned for more details. The plugin module is not available at this time.

Java

Java Hardware Metrics Tuning

Dynatrace Cloud Automation Module provides observability-driven automation across the full lifecycle

Dynatrace

FEBRUARY 10, 2021

Resilience : Critical production business systems must not fail. This capability provides version information along with an additional insight into traffic and problems per version. Dynatrace Cloud Automation allows easy analysis of the status and impact a release has on your business or on test results in any environment. What’s next.

Cloud

Cloud DevOps Speed Metrics

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. By integrating with studio content systems, we enabled the pipeline to leverage rich metadata from the creative side and create more engaging member experiences like interactive storytelling.

Processing

Processing Media Latency Innovation

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks. Blocking write traffic by locking tables. Writing events to any output.

Database

Database Traffic Transportation Open Source

Powering the Web: Two Decades of Open Source Publishing With WordPress and MySQL

Percona

JUNE 2, 2023

While not the first open source content management system (CMS), WordPress caught on like nothing before and helped spread open source to millions. And if your blog got Slashdotted or just a high level of traffic in general? Then you might need to delve into MySQL tuning and replicas. But mainstream users? Not so much.

Open Source

Open Source Traffic Tuning Database

Dynatrace Application Security protects your applications in complex cloud environments

Dynatrace

DECEMBER 8, 2020

Quickly understand the urgency of a vulnerability with answers to questions like What is the Common Vulnerability Scoring System (CVSS) score? But exclusively relying on the Common Vulnerability Scoring System (CVSS) rating will keep your team busy chasing false positives and make it hard to prioritize effectively.

Cloud

Cloud Open Source Internet Internet

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

The coupling problem Until recently, video quality measurements were generated as part of our Reloaded production system. This system is responsible for processing incoming media files, such as video, audio and subtitles, and making them playable on the streaming service. We call this system Cosmos.

Media

Media Innovation Metrics Latency

AWS re:Invent 2017: How Netflix Tunes EC2

Brendan Gregg

DECEMBER 31, 2017

My last talk for 2017 was at AWS re:Invent, on "How Netflix Tunes EC2 Instances for Performance," an updated version of my [2014] talk. A video of the talk is on youtube : The slides are on slideshare : I love this talk as I get to share more about what the Performance and Operating Systems team at Netflix does, rather than just my work.

Tuning

Tuning AWS Best Practices Network

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Trending Sources

Rapid Event Notification System at Netflix

Title Launch Observability at Netflix Scale

Title Launch Observability at Netflix Scale

Introducing Impressions at Netflix

Best Practices for Scaling RabbitMQ

RabbitMQ vs. Kafka: Key Differences

Migrating Netflix to GraphQL Safely

Keeping Netflix Reliable Using Prioritized Load Shedding

Telltale: Netflix Application Monitoring Simplified

Kubernetes vs Docker: What’s the difference?

9 key DevOps metrics for success

Using Dynatrace to master the 5 pillars of the AWS Well-Architected Framework (Part 1)

Dynatrace Application Security detects and blocks attacks automatically in real-time

Efficient SLO event integration powers successful AIOps

Introducing Netflix TimeSeries Data Abstraction Layer

What is web application security? Everything you need to know.

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace simplifies StatsD, Telegraf, and Prometheus observability with Davis AI

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

OneAgent for Linux on IBM Z (General Availability)

Introducing Netflix’s Key-Value Data Abstraction Layer

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

Python at Netflix

Building Netflix’s Distributed Tracing Infrastructure

How digital experience monitoring helps deliver business observability

Achieving observability in async workflows

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Best practices for alerting

Towards a Reliable Device Management Platform

OneAgent for Linux on IBM Z now available in Early Adopter Release

Why PostgreSQL Is a Top Choice for Enterprise-level Databases

DBLog: A Generic Change-Data-Capture Framework

Get out-of-the-box visibility into your ARM platform (Early Adopter)

Dynatrace Cloud Automation Module provides observability-driven automation across the full lifecycle

Rebuilding Netflix Video Processing Pipeline with Microservices

DBLog: A Generic Change-Data-Capture Framework

Powering the Web: Two Decades of Open Source Publishing With WordPress and MySQL

Dynatrace Application Security protects your applications in complex cloud environments

Netflix Video Quality at Scale with Cosmos Microservices

AWS re:Invent 2017: How Netflix Tunes EC2

Stay Connected