Event, Processing and Traffic - Technology Performance Pulse

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Dynatrace

JANUARY 21, 2025

Ensuring smooth operations is no small feat, whether you’re in charge of application performance, IT infrastructure, or business processes. For example, if you’re monitoring network traffic and the average over the past 7 days is 500 Mbps, the threshold will adapt to this baseline.

Traffic

Traffic Metrics Analytics Monitoring

Black Friday traffic exposes gaps in observability strategies

Dynatrace

SEPTEMBER 2, 2022

What’s the problem with Black Friday traffic? But that’s difficult when Black Friday traffic brings overwhelming and unpredictable peak loads to retailer websites and exposes the weakest points in a company’s infrastructure, threatening application performance and user experience. Why Black Friday traffic threatens customer experience.

Traffic

Traffic Strategy Retail Ecommerce

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

As Netflix expanded globally and the volume of title launches skyrocketed, the operational challenges of maintaining this manual process became undeniable. Metadata and assets must be correctly configured, data must flow seamlessly, microservices must process titles without error, and algorithms must function as intended.

Traffic

Traffic Scalability Strategy Monitoring

Title Launch Observability at Netflix Scale

The Netflix TechBlog

MARCH 4, 2025

Accurately Reflecting Production Behavior A key part of our solution is insights into production behavior, which necessitates our requests to the endpoint result in traffic to the real service functions that mimics the same pathways the traffic would take if it came from the usualcallers. We call this capability TimeTravel.

Traffic

Traffic Strategy Entertainment Innovation

Ensuring the Successful Launch of Ads on Netflix

The Netflix TechBlog

JUNE 1, 2023

To do this, we devised a novel way to simulate the projected traffic weeks ahead of launch by building upon the traffic migration framework described here. New content or national events may drive brief spikes, but, by and large, traffic is usually smoothly increasing or decreasing.

Traffic

Traffic Best Practices Systems Testing

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

They need event-driven automation that not only responds to events and triggers but also analyzes and interprets the context to deliver precise and proactive actions. We will also explore the evolution of DevOps automation and the significance of data-driven answers in unlocking streamlined, automated DevOps and SRE processes.

DevOps

DevOps Traffic Efficiency Servers

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ is designed for flexible routing and message reliability, while Kafka handles high-throughput event streaming and real-time data processing. Kafka is optimized for high-throughput event streaming , excelling in real-time analytics and large-scale data ingestion. What is Apache Kafka?

Latency

Latency Analytics Architecture Storage

Unlock the observability value of log data with processing at scale

Dynatrace

AUGUST 16, 2022

Even worse, if your service logs record critical events such as errors in a non-standard way, those errors might go unnoticed by your observability team. Whether a web server, mobile app, backend service, or other custom application, log data can provide you with deep insights into your software’s operations and events.

Processing

Processing Metrics Monitoring Java

Process more with less using smarter cluster overload prevention for Dynatrace Managed

Dynatrace

MAY 14, 2020

Turnkey cluster overload protection with adaptive traffic management and control. By vastly increasing the number of PurePaths that are processed by a Dynatrace Managed cluster, your initial sizing considerations for Dynatrace Managed nodes and clusters may however end up being inadequate for supporting such volume.

Processing

Processing Hardware Traffic Storage

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Optimizing RabbitMQ performance through strategies such as keeping queues short, enabling lazy queues, and monitoring health checks is essential for maintaining system efficiency and effectively managing high traffic loads.

Best Practices

Best Practices Traffic Strategy Efficiency

5 powerful use cases beyond debugging for Dynatrace Live Debugger

Dynatrace

MARCH 25, 2025

This powerful tool can be leveraged across various environments, including production, to enhance development processes and ensure robust application performance. Many developers attempt to mitigate this challenge with logs, but thats a tedious and error-prone process. Load generators simulate traffic.

Benchmarking

Benchmarking Code Open Source Engineering

Automate CI/CD pipelines with Dynatrace: Part 2, Deploy stage

Dynatrace

NOVEMBER 28, 2023

Even when the staging environment closely mirrors the production environment, achieving a complete replication of all potential scenarios, such as simulating extremely high traffic volumes to assess software performance, remains challenging. This can lead to a lack of insight into how the code will behave when exposed to heavy traffic.

Traffic

Traffic Best Practices Strategy Engineering

Six causes of major software outages–And how to avoid them

Dynatrace

AUGUST 8, 2024

As recent events have demonstrated, major software outages are an ever-present threat in our increasingly digital world. They may stem from software bugs, cyberattacks, surges in demand, issues with backup processes, network problems, or human errors. This often occurs during major events, promotions, or unexpected surges in usage.

Software

Software Software Infrastructure Network

COVID-19 and Digital Services: An Action Plan for the Unexpected

Dynatrace

APRIL 22, 2020

While most government agencies and commercial enterprises have digital services in place, the current volume of usage — including traffic to critical employment, health and retail/eCommerce services — has reached levels that many organizations have never seen before or tested against. So how do you know what to prepare for?

Traffic

Traffic Ecommerce Retail Government

Data Reprocessing Pipeline in Asset Management Platform @Netflix

The Netflix TechBlog

MARCH 10, 2023

Hence we built the data pipeline that can be used to extract the existing assets metadata and process it specifically to each new use case. Existing data got updated to be backward compatible without impacting the existing running production traffic. For asynchronous processing, events are sent to Apache Kafka topics to be processed.

Media

Media Traffic Processing Design

2023 Black Friday and Cyber Monday retail and e-commerce IT performance observations

Dynatrace

NOVEMBER 30, 2023

What was once an onslaught of consumer traffic between Black Friday and Cyber Monday has turned into a weeklong event, with most retailers offering deals well ahead of Black Friday. In the past, I tried to understand where in the page-loading process was the majority of time spent. However, logs alone won’t solve everything.

Retail

Retail Social Media Performance Benchmarking

Noisy Neighbor Detection with eBPF

The Netflix TechBlog

SEPTEMBER 10, 2024

One issue that often complicates this process is the "noisy neighbor" problem. The sched_wakeup and sched_wakeup_new hooks are invoked when a process changes state from 'sleeping' to 'runnable.' ' They let us identify when a process is ready to run and is waiting for CPU time.

Latency

Latency Metrics Programming Monitoring

Service level objectives: 5 SLOs to get started

Dynatrace

JUNE 1, 2023

Response time Response time refers to the total time it takes for a system to process a request or complete an operation. This ensures that customers can quickly navigate through product listings, add items to their cart, and complete the checkout process without experiencing noticeable delays. or above for the checkout process.

Latency

Latency Website Traffic DevOps

Simplify troubleshooting with AI-powered insights into connection pool performance (Early Adopter)

Dynatrace

DECEMBER 9, 2020

In addition to being available as metrics in custom charts , you can view these metrics at the process group instance level in the Dynatrace web UI. Aggregated connection pool metrics are available on the process group overview page. You can even integrate Dynatrace into your CI/CD pipeline using the Events API.

Traffic

Traffic Performance Database Metrics

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

In the Device Management Platform, this is achieved by having device updates be event-sourced through the control plane to the cloud so that NTS will always have the most up-to-date information about the devices available for testing. The RAE is configured to be effectively a router that devices under test (DUTs) are connected to.

Latency

Latency Traffic Transportation Cloud

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. This motivated the development of DBLog , which offers log and dump processing under a generic framework. Some of DBLog’s features are: Processes captured log events in-order. Interleaves log with dump events, by taking dumps in chunks.

Database

Database Traffic Transportation Open Source

Unlock end-to-end observability insights with Dynatrace PurePath 4 seamless integration of OpenTracing for Java

Dynatrace

DECEMBER 9, 2020

To address potentially high numbers of requests during online shopping events like Singles Day or Black Friday, it’s crucial that this online shop have a memory storage strategy that allows for speed, scaling, and resilience of all microservices, especially the shopping cart service.

Java

Java Traffic Architecture Strategy

Transparent and confident software delivery with Dynatrace Release Analysis

Dynatrace

APRIL 28, 2021

Deployed versions, release events, and release information are all correlated and displayed on a single page. Each entry represents a process group instance. The release inventory highlights releases that include detected problems and shows the throughput of those versions so that you see how much traffic is routed to each release.

Software

Software Software Strategy Metrics

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

We introduce a caching mechanism in the API gateway layer, allowing us to offload processing from singleton leader elected controllers without giving up strict data consistency and guarantees clients observe. cell): Titus Job Coordinator is a leader elected process managing the active state of the system. it will read version E?

Cache

Cache Latency Traffic Systems

Customer expectations for retail: Beyond digital experience

Dynatrace

AUGUST 28, 2023

IT teams spend months preparing for the peak traffic they anticipate will arrive with holiday shopping. Let’s shift our focus to the backend systems and business processes, the behind-the-scenes heroes of end-to-end customer experience. Order processing workflow is triggered by customer orders.

Retail

Retail Logistics Innovation Analytics

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. This motivated the development of DBLog , which offers log and dump processing under a generic framework. Some of DBLog’s features are: Processes captured log events in-order. Interleaves log with dump events, by taking dumps in chunks.

Database

Database Traffic Transportation Open Source

What is log management? How to tame distributed cloud system complexities

Dynatrace

SEPTEMBER 8, 2022

In cloud-native environments, there can also be dozens of additional services and functions all generating data from user-driven events. Event logging and software tracing help application developers and operations teams understand what’s happening throughout their application flow and system.

Cloud

Cloud Systems Analytics DevOps

Managing PostgreSQL® High Availability – Part I: PostgreSQL Automatic Failover

Scalegrid

SEPTEMBER 5, 2024

Ensuring high availability in PostgreSQL involves implementing automatic failover, a critical process that maintains database operability and preserves data accessibility when unexpected failures occur. In the event of a primary server failure, standby servers are prepared to assume control, which helps reduce system downtime.

Availability

Availability Servers Database Open Source

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Building on these foundational abstractions, we developed the TimeSeries Abstraction — a versatile and scalable solution designed to efficiently store and query large volumes of temporal event data with low millisecond latencies, all in a cost-effective manner across various use cases. Let’s dive into the various aspects of this abstraction.

Latency

Latency Storage Traffic Tuning

What is application security monitoring?

Dynatrace

MARCH 20, 2024

With the pace of digital transformation continuing to accelerate, organizations are realizing the growing imperative to have a robust application security monitoring process in place. Incident detection and response In the event of a security incident, there is a well-defined incident response process to investigate and mitigate the issue.

Monitoring

Monitoring Analytics Traffic Best Practices

What is security analytics?

Dynatrace

JUNE 10, 2024

For example, an organization might use security analytics tools to monitor user behavior and network traffic. Improved compliance A better understanding of data security across multiple applications and environments provides a unified view of events and information. This offers two advantages for compliance.

Analytics

Analytics Network Open Source Hardware

From syslog to AWS Firehose: Dynatrace log management innovations that enhance observability

Dynatrace

SEPTEMBER 5, 2024

It also enhances syslog messages with additional context and optimizes network traffic, improving overall system resilience and security. A $20 billion Germany-based financial services company told us they found the process of pushing Syslog messages to Dynatrace natively to be seamless.

Innovation

Innovation AWS Analytics Storage

Dynatrace adds support for AWS Transit Gateway with VPC Flow Logs

Dynatrace

JULY 25, 2022

VPC Flow Logs is a feature that gives you the capability to capture more robust IP traffic data that traverses your VPCs. When it comes to logs and metrics, the Dynatrace platform provides direct access to the log content of all mission-critical processes. Log Events. What is VPC Flow Logs. Why Dynatrace? Log Metrics.

AWS

AWS Transportation Network Traffic

Dynatrace adds support for VPC Flow Logs to Kinesis Data Firehose

Dynatrace

SEPTEMBER 7, 2022

VPC Flow Logs is an Amazon service that enables IT pros to capture information about the IP traffic that traverses network interfaces in a virtual private cloud, or VPC. By default, each record captures a network internet protocol (IP), a destination, and the source of the traffic flow that occurs within your environment.

Traffic

Traffic AWS Network Cloud

Network performance monitoring top of mind for CloudOps teams

Dynatrace

MAY 19, 2023

Network traffic growth is the main reason for increasing spending, largely because of the adoption of hybrid and multi-cloud architectures. What are the issues with traffic losses and connectivity drops? Without the network, nothing will happen,” Ziemianowicz said. What has been happening on the network when apps experience issues?

Network

Network Monitoring Performance Traffic

Simplified observability for your SNMP devices

Dynatrace

MARCH 22, 2021

Events and alerts. Some SNMP-enabled devices are designed to report events on their own with so-called SNMP traps. This allows for almost instant notification as soon as an important event is reported. It’s essential to focus on those events that provide useful information or report potential device problems.

Metrics

Metrics Network Infrastructure Traffic

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

To ensure high standards, it’s essential that your organization establish automated validations in an early phase of the software development process—ideally when code is written. While the first guardian validates the traffic, the second guardian checks the business transactions generated during the observation period.

DevOps

DevOps Traffic Latency Best Practices

Kubernetes OOMKilled troubleshooting: Diagnosing out-of-memory issues automatically

Dynatrace

DECEMBER 5, 2022

Each tenant gets its own e-commerce site deployed on a shared Kubernetes cluster, isolated through separate namespaces and additional traffic isolation. There was not much traffic during the weekend, but as Monday came along, Dynatrace started sending alerts about a high HTTP failure rate across almost every tenant on the backend service.

Java

Java Traffic Education Testing

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Change Data Capture(CDC) source connector reads from studio applications’ database transaction logs and emits the change events.

Big Data

Big Data Government Processing Analytics

Real user monitoring vs. synthetic monitoring: Understanding best practices

Dynatrace

JUNE 27, 2022

Real user monitoring (RUM) is a performance monitoring process that collects detailed data about users’ interactions with an application. RUM, however, has some limitations, including the following: RUM requires traffic to be useful. Complex transaction and process monitoring that might have deeper dependencies.

Best Practices

Best Practices Monitoring Wireless Traffic

Leverage automated and intelligent observability for OpenTelemetry for Go with Dynatrace PurePath 4

Dynatrace

JANUARY 28, 2021

With Dynatrace OneAgent you also benefit from support for traffic routing and traffic control. OneAgent implements network zones to create traffic routing rules and limit cross-data-center traffic. Our OneAgent OpenTelemetry for Go integration currently focuses on capturing and enrichment of in-process spans.

Traffic

Traffic Open Source Servers Cloud

Rapid Event Notification System at Netflix

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Trending Sources

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Black Friday traffic exposes gaps in observability strategies

Introducing Impressions at Netflix

Title Launch Observability at Netflix Scale

Title Launch Observability at Netflix Scale

Ensuring the Successful Launch of Ads on Netflix

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

RabbitMQ vs. Kafka: Key Differences

Unlock the observability value of log data with processing at scale

Process more with less using smarter cluster overload prevention for Dynatrace Managed

Best Practices for Scaling RabbitMQ

5 powerful use cases beyond debugging for Dynatrace Live Debugger

Automate CI/CD pipelines with Dynatrace: Part 2, Deploy stage

Six causes of major software outages–And how to avoid them

COVID-19 and Digital Services: An Action Plan for the Unexpected

Data Reprocessing Pipeline in Asset Management Platform @Netflix

2023 Black Friday and Cyber Monday retail and e-commerce IT performance observations

Noisy Neighbor Detection with eBPF

Service level objectives: 5 SLOs to get started

Simplify troubleshooting with AI-powered insights into connection pool performance (Early Adopter)

Towards a Reliable Device Management Platform

DBLog: A Generic Change-Data-Capture Framework

Unlock end-to-end observability insights with Dynatrace PurePath 4 seamless integration of OpenTracing for Java

Transparent and confident software delivery with Dynatrace Release Analysis

Consistent caching mechanism in Titus Gateway

Customer expectations for retail: Beyond digital experience

DBLog: A Generic Change-Data-Capture Framework

What is log management? How to tame distributed cloud system complexities

Managing PostgreSQL® High Availability – Part I: PostgreSQL Automatic Failover

Introducing Netflix TimeSeries Data Abstraction Layer

What is application security monitoring?

What is security analytics?

From syslog to AWS Firehose: Dynatrace log management innovations that enhance observability

Dynatrace adds support for AWS Transit Gateway with VPC Flow Logs

Dynatrace adds support for VPC Flow Logs to Kinesis Data Firehose

Network performance monitoring top of mind for CloudOps teams

Simplified observability for your SNMP devices

How Dynatrace boosts production resilience with Site Reliability Guardian

Kubernetes OOMKilled troubleshooting: Diagnosing out-of-memory issues automatically

Data Movement in Netflix Studio via Data Mesh

Real user monitoring vs. synthetic monitoring: Understanding best practices

Leverage automated and intelligent observability for OpenTelemetry for Go with Dynatrace PurePath 4

Stay Connected