Systems and Traffic - Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience.

Traffic

Traffic Latency Tuning Systems

Load Testing Essentials for High-Traffic Applications

DZone

DECEMBER 30, 2024

When you consider marketing campaigns, seasonal spikes, or social media virality episodes, this demand can overshoot projections and bring systems to a grinding halt. Todays applications must simultaneously serve millions of users, so high performance is a hard requirement for this heavy load.

Traffic

Traffic Social Media Testing Media

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 2 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. This is where large-scale system migrations come into play.

Traffic

Traffic Metrics Systems Strategy

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Black Friday traffic exposes gaps in observability strategies

Dynatrace

SEPTEMBER 2, 2022

What’s the problem with Black Friday traffic? But that’s difficult when Black Friday traffic brings overwhelming and unpredictable peak loads to retailer websites and exposes the weakest points in a company’s infrastructure, threatening application performance and user experience. Why Black Friday traffic threatens customer experience.

Traffic

Traffic Strategy Retail Ecommerce

Congestion Control in Cloud Scale Distributed Systems

DZone

DECEMBER 19, 2023

Distributed systems are composed of multiple systems that are wired together to provide a specific functionality. Systems that operate at a cloud scale can get expected or unexpected surges of traffic from one or multiple callers and are expected to perform in a predictable manner.

Systems

Systems Cloud Traffic Performance

How To Design For High-Traffic Events And Prevent Your Website From Crashing

Smashing Magazine

JANUARY 7, 2025

How To Design For High-Traffic Events And Prevent Your Website From Crashing How To Design For High-Traffic Events And Prevent Your Website From Crashing Saad Khan 2025-01-07T14:00:00+00:00 2025-01-07T22:04:48+00:00 This article is sponsored by Cloudways Product launches and sales typically attract large volumes of traffic.

Traffic

Traffic Website Design Cache

Best Practices for Designing Resilient APIs for Scalability and Reliability

DZone

JANUARY 8, 2025

API resilience is about creating systems that can recover gracefully from disruptions, such as network outages or sudden traffic spikes, ensuring they remain reliable and secure. This has become critical since APIs serve as the backbone of todays interconnected systems.

Best Practices

Best Practices Design Scalability Architecture

Title Launch Observability at Netflix Scale

The Netflix TechBlog

MARCH 4, 2025

Part 3: System Strategies and Architecture By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques This blog post is a continuation of Part 2 , where we cleared the ambiguity around title launch observability at Netflix. The request schema for the observability endpoint.

Traffic

Traffic Strategy Entertainment Innovation

A Comprehensive Guide to Database Sharding: Building Scalable Systems

DZone

OCTOBER 2, 2024

In this article, we’ll dive deep into the concept of database sharding, a critical technique for scaling databases to handle large volumes of data and high levels of traffic. By the end of this guide, you’ll have a comprehensive understanding of database sharding, enabling you to implement it effectively in your systems.

Database

Database Systems Scalability Traffic

Dynatrace Cost & Carbon Optimization certified for accuracy and transparency

Dynatrace

MARCH 5, 2025

Integration with existing systems and processes : Integration with existing IT infrastructure, observability solutions, and workflows often requires significant investment and customization. Network traffic power calculations rely on static power estimations for both public and private networks. Public network traffic uses 1.0

Energy

Energy Analytics Traffic Cloud

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Dynatrace

JANUARY 21, 2025

For example, if you’re monitoring network traffic and the average over the past 7 days is 500 Mbps, the threshold will adapt to this baseline. An anomaly will be identified if traffic suddenly drops below 200 Mbps or above 800 Mbps, helping you identify unusual spikes or drops.

Traffic

Traffic Metrics Analytics Monitoring

Chaos Engineering With Litmus: A CNCF Incubating Project

DZone

FEBRUARY 6, 2025

System resilience stands as the key requirement for e-commerce platforms during scaling operations to keep services operational and deliver performance excellence to users. We have developed a microservices architecture platform that encounters sporadic system failures when faced with heavy traffic events.

Engineering

Engineering Traffic Architecture Network

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Ensuring the Successful Launch of Ads on Netflix

The Netflix TechBlog

JUNE 1, 2023

To do this, we devised a novel way to simulate the projected traffic weeks ahead of launch by building upon the traffic migration framework described here. New content or national events may drive brief spikes, but, by and large, traffic is usually smoothly increasing or decreasing.

Traffic

Traffic Best Practices Systems Testing

The keys to selecting a platform for end-to-end observability

Dynatrace

DECEMBER 2, 2024

Clearly, continuing to depend on siloed systems, disjointed monitoring tools, and manual analytics is no longer sustainable. This enables proactive changes such as resource autoscaling, traffic shifting, or preventative rollbacks of bad code deployment ahead of time.

Artificial Intelligence

Artificial Intelligence DevOps Architecture Cloud

Breaking AWS Lambda: Chaos Engineering for Serverless Devs

DZone

MARCH 24, 2025

Our "serverless" order processing system built on AWS Lambda and API Gateway was humming along, handling 1,000 transactions/minute. A sudden spike in traffic caused Lambda timeouts, API Gateway threw 5xx errors, and customers started tweeting, Why cant I check out?! Then, disaster struck.

Lambda

Lambda Serverless AWS Engineering

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Scalability

What is log management? How to tame distributed cloud system complexities

Dynatrace

SEPTEMBER 8, 2022

Log management is an organization’s rules and policies for managing and enabling the creation, transmission, analysis, storage, and other tasks related to IT systems’ and applications’ log data. Distributed cloud systems are complex, dynamic, and difficult to manage without the proper tools. What is log management?

Cloud

Cloud Systems Analytics DevOps

Five-nines availability: Always-on infrastructure delivers system availability during the holidays’ peak loads

Dynatrace

NOVEMBER 22, 2022

For retail organizations, peak traffic can be a mixed blessing. While high-volume traffic often boosts sales, it can also compromise uptimes. The nirvana state of system uptime at peak loads is known as “five-nines availability.” How can IT teams deliver system availability under peak loads that will satisfy customers?

Infrastructure

Infrastructure Availability Systems Retail

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Cache Open Source

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. The Replay Tester tool samples raw traffic streams from Mantis.

Traffic

Traffic Latency Metrics Cache

Build automated self-healing systems with xMatters and Dynatrace (Part 3 of 3)

Dynatrace

SEPTEMBER 20, 2019

One is the currently-running production environment receiving all user traffic (let’s say the “blue” one), the other is a clone of it (“green”), but idle. Once the testing results are successful, application traffic is routed from blue to green. Response time for blue/green environment traffic.

Systems

Systems Traffic DevOps Database

COVID-19 and Digital Services: An Action Plan for the Unexpected

Dynatrace

APRIL 22, 2020

All of this puts a lot of pressure on IT systems and applications. Step 1: Understand Traffic Patterns and Potential Spikes; Remove Team Silos. The impact of traffic spikes is illustrated by the load that eCommerce web sites typically see during Black Friday. The next step is to understand when your system is going to break.

Traffic

Traffic Ecommerce Retail Government

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Tuning

Tuning Latency Efficiency Storage

5 powerful use cases beyond debugging for Dynatrace Live Debugger

Dynatrace

MARCH 25, 2025

You can verify any system settings that might impact your tests and see them in action. Load generators simulate traffic. Maybe you want to monitor performance under different system loads. Or maybe you want to correlate an event with other events in your system. In many ways, it’s more of an art than a science.

Benchmarking

Benchmarking Code Open Source Engineering

Load Management With Istio Using FluxNinja Aperture

DZone

APRIL 4, 2023

Service meshes are becoming increasingly popular in cloud-native applications as they provide a way to manage network traffic between microservices. It offers several features, including: Prioritized load shedding: Drops traffic that is deemed less important to ensure that the most critical traffic is served.

Traffic

Traffic Network Architecture Monitoring

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure By Manuel Correa , Arthur Gonigberg , and Daniel West Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. CRITICAL : This traffic affects the ability to play.

Traffic

Traffic Metrics Infrastructure Architecture

A Dynatrace champions guide to get ahead of digital marketing campaigns

Dynatrace

JULY 1, 2020

In my last blog , I’ve provided an example of this happening, whereby the traffic spiked and quadrupled the usual incoming traffic. These are all interesting metrics from marketing point of view, and also highly interesting to you as they allow you to engage with the teams that are driving the traffic against your IT-system.

Traffic

Traffic Analytics Metrics Servers

Choosing the Appropriate AWS Load Balancer: ALB vs. NLB

DZone

SEPTEMBER 14, 2023

With the advent of cloud computing, managing network traffic and ensuring optimal performance have become critical aspects of system architecture. Amazon Web Services (AWS), a leading cloud service provider, offers a suite of load balancers to manage network traffic effectively for applications running on its platform.

AWS

AWS Traffic Network Architecture

Optimizing Server Management With HAProxy’s Advanced Health Checks

DZone

DECEMBER 11, 2023

HAProxy is one of the cornerstones in complex distributed systems, essential for achieving efficient load balancing and high availability. This open-source software, lauded for its reliability and high performance, is a vital tool in the arsenal of network administrators, adept at managing web traffic across diverse server environments.

Servers

Servers Traffic Open Source Games

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Six causes of major software outages–And how to avoid them

Dynatrace

AUGUST 8, 2024

Possible scenarios A Distributed Denial of Service (DDoS) attack overwhelms servers with traffic, making a website or service unavailable. Ransomware encrypts essential data, locking users out of systems and halting operations until a ransom is paid. This often occurs during major events, promotions, or unexpected surges in usage.

Software

Software Software Infrastructure Network

Architecture Patterns: The Circuit-Breaker

DZone

NOVEMBER 3, 2023

In the world of distributed systems, the likelihood of components failing or becoming unresponsive is higher compared to monolithic systems. Therefore, resilience — the ability of a system to handle and recover from failures — becomes critically important in distributed environments.

Architecture

Architecture Software Engineering Traffic Engineering

9 key DevOps metrics for success

Dynatrace

SEPTEMBER 28, 2021

As we look at today’s applications, microservices, and DevOps teams, we see leaders are tasked with supporting complex distributed applications using new technologies spread across systems in multiple locations. For most systems, an optimum MTTR could be less than one hour while others have an MTTR of less than one day.

DevOps

DevOps Metrics Traffic Efficiency

The new normal of digital experience delivery – lessons learned from monitoring mission-critical websites during COVID-19

Dynatrace

MAY 6, 2020

Over the last two month s, w e’ve monito red key sites and applications across industries that have been receiving surges in traffic , including government, health insurance, retail, banking, and media. The following day, a normally mundane Wednesday , traffic soared to 128,000 sessions. Media p erformance .

Website

Website Monitoring Retail Media

Kubernetes vs Docker: What’s the difference?

Dynatrace

SEPTEMBER 29, 2021

Think of containers as the packaging for microservices that separate the content from its environment – the underlying operating system and infrastructure. This opens the door to auto-scalable applications, which effortlessly matches the demands of rapidly growing and varying user traffic. What is Docker? Networking.

Open Source

Open Source DevOps Traffic Cloud

Seeing through hardware counters: a journey to threefold performance increase

The Netflix TechBlog

NOVEMBER 9, 2022

A quick canary test was free of errors and showed lower latency, which is expected given that our standard canary setup routes an equal amount of traffic to both the baseline running on 4xl and the canary on 12xl. a contiguous chunk of data (typically 64 bytes on x86 systems) transferred to and from the cache.

Hardware

Hardware Cache Performance Latency

Detecting RegreSSHion with Dynatrace (CVE-2024-6387)

Dynatrace

JULY 2, 2024

The Qualys Threat Research Unit (TRU) has discovered a Remote Unauthenticated Code Execution (RCE) vulnerability in OpenSSH server (sshd) in glibc-based Linux systems. This can result in a complete system takeover, malware installation, data manipulation, and the creation of backdoors for persistent access.

AWS

AWS Network Traffic Servers

What is a service mesh?

Dynatrace

MAY 21, 2021

This becomes even more challenging when the application receives heavy traffic, because a single microservice might become overwhelmed if it receives too many requests too quickly. The Envoy proxies also collect and report telemetry on all traffic among the services in the mesh. Why do you need a service mesh?

Traffic

Traffic DevOps Infrastructure Network

Service Mesh and Management Practices in Microservices

DZone

OCTOBER 27, 2023

In the dynamic world of microservices architecture, efficient service communication is the linchpin that keeps the system running smoothly. It comprises a suite of capabilities, such as managing traffic, enabling service discovery, enhancing security, ensuring observability, and fortifying resilience.

Traffic

Traffic Best Practices Architecture Network

Architected for resiliency: How Dynatrace withstands data center outages

Dynatrace

JUNE 15, 2021

The fact is, Reliability and Resiliency must be rooted in the architecture of a distributed system. The final status update was at 6:54PM PDT with a very detailed description of the temperature rise that caused the shutdown initially, followed by the fire suppression system dispersing some chemicals which prolonged the full recovery process.

AWS

AWS Traffic Architecture Azure

Apollo Router Performance Monitoring with OpenTelemetry and Splunk APM

DZone

APRIL 13, 2023

This self-hosted graph routing solution is highly configurable, making it an ideal choice for developers who require a high-performance routing system. With its ability to handle large amounts of traffic and complex data, the Apollo router is quickly becoming a popular choice among developers seeking a reliable and efficient routing solution.

Monitoring

Monitoring Performance Traffic Efficiency

7 Best Performance Testing Tools to Look Out for in 2021

DZone

DECEMBER 28, 2020

The system could work efficiently with a specific number of concurrent users; however, it may get dysfunctional with extra loads during peak traffic. Performances testing helps establish the scalability, stability, and speed of the software application.

Performance Testing

Performance Testing Testing Tools Testing Performance

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Load Testing Essentials for High-Traffic Applications

Trending Sources

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Rapid Event Notification System at Netflix

Black Friday traffic exposes gaps in observability strategies

Congestion Control in Cloud Scale Distributed Systems

How To Design For High-Traffic Events And Prevent Your Website From Crashing

Best Practices for Designing Resilient APIs for Scalability and Reliability

Title Launch Observability at Netflix Scale

A Comprehensive Guide to Database Sharding: Building Scalable Systems

Dynatrace Cost & Carbon Optimization certified for accuracy and transparency

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Chaos Engineering With Litmus: A CNCF Incubating Project

Title Launch Observability at Netflix Scale

Ensuring the Successful Launch of Ads on Netflix

The keys to selecting a platform for end-to-end observability

Breaking AWS Lambda: Chaos Engineering for Serverless Devs

Best Practices for Scaling RabbitMQ

What is log management? How to tame distributed cloud system complexities

Five-nines availability: Always-on infrastructure delivers system availability during the holidays’ peak loads

Supporting Diverse ML Systems at Netflix

Migrating Netflix to GraphQL Safely

Build automated self-healing systems with xMatters and Dynatrace (Part 3 of 3)

COVID-19 and Digital Services: An Action Plan for the Unexpected

Introducing Impressions at Netflix

5 powerful use cases beyond debugging for Dynatrace Live Debugger

Load Management With Istio Using FluxNinja Aperture

Keeping Netflix Reliable Using Prioritized Load Shedding

A Dynatrace champions guide to get ahead of digital marketing campaigns

Choosing the Appropriate AWS Load Balancer: ALB vs. NLB

Optimizing Server Management With HAProxy’s Advanced Health Checks

RabbitMQ vs. Kafka: Key Differences

Six causes of major software outages–And how to avoid them

Architecture Patterns: The Circuit-Breaker

9 key DevOps metrics for success

The new normal of digital experience delivery – lessons learned from monitoring mission-critical websites during COVID-19

Kubernetes vs Docker: What’s the difference?

Seeing through hardware counters: a journey to threefold performance increase

Detecting RegreSSHion with Dynatrace (CVE-2024-6387)

What is a service mesh?

Service Mesh and Management Practices in Microservices

Architected for resiliency: How Dynatrace withstands data center outages

Apollo Router Performance Monitoring with OpenTelemetry and Splunk APM

7 Best Performance Testing Tools to Look Out for in 2021

Stay Connected