Exercise, Latency and Systems - Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. It provides a good read on the availability and latency ranges under different production conditions.

Traffic

Traffic Latency Tuning Systems

Service level objectives: 5 SLOs to get started

Dynatrace

JUNE 1, 2023

It represents the percentage of time a system or service is expected to be accessible and functioning correctly. Response time Response time refers to the total time it takes for a system to process a request or complete an operation. This SLO enables a smooth and uninterrupted exercise-tracking experience.

Latency

Latency Website Traffic DevOps

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Sample system diagram for an Alexa voice command. Where aws ends and the internet begins is an exercise left to the reader. The other main use case was RENO, the Rapid Event Notification System mentioned above. Dynomite had great performance, but it required manual scaling as the system grew.

Latency

Latency Cache Tuning Efficiency

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. Being able to canary a new route let us verify latency and error rates were within acceptable limits.

Latency

Latency Cache Java Traffic

Interpreting A/B test results: false negatives and power

The Netflix TechBlog

OCTOBER 26, 2021

We then used simple thought exercises based on flipping coins to build intuition around false positives and related concepts such as statistical significance, p-values, and confidence intervals. As a result, if the test treatment results in a small reduction in the latency metric, it’s hard to successfully identify?

Testing

Testing Metrics Latency Design

Service level objective examples: 5 SLO examples for faster, more reliable apps

Dynatrace

JUNE 1, 2023

It represents the percentage of time a system or service is expected to be accessible and functioning correctly. Response time Response time refers to the total time it takes for a system to process a request or complete an operation. This SLO enables a smooth and uninterrupted exercise-tracking experience.

Traffic

Traffic Website Latency DevOps

Real user monitoring vs. synthetic monitoring: Understanding best practices

Dynatrace

JUNE 27, 2022

However, not all user monitoring systems are created equal. connectivity, access, user count, latency) of geographic regions. These development and testing practices ensure the performance of critical applications and resources to deliver loyalty-building user experiences. What is real user monitoring? Synthetic monitoring drawbacks.

Best Practices

Best Practices Monitoring Wireless Traffic

Bring Your Own Cloud (BYOC) vs. Dedicated Hosting at ScaleGrid

Scalegrid

APRIL 16, 2020

Deploying your application and database on the same VPC also provides the lowest possible latency path. Use Follower Clusters keep two independent database systems (of the same type) in sync so you can analyze, optimize and test app performance for MySQL, PostgreSQL and MongoDB® database. Expert Tip. Security Groups. No problem.

Cloud

Cloud Azure AWS Database

Amazon EC2 Cluster GPU Instances - All Things Distributed

All Things Distributed

NOVEMBER 15, 2010

Werner Vogels weblog on building scalable and robust distributed systems. For example, the most fundamental abstraction trade-off has always been latency versus throughput. The throughput of this pipeline is more important than the latency of the individual operations. All Things Distributed. Comments ().

AWS

AWS Programming Latency Architecture

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

All Things Distributed

NOVEMBER 21, 2017

Redis's microsecond latency has made it a de facto choice for caching. Four years ago, as part of our AWS fast data journey, we introduced Amazon ElastiCache for Redis , a fully managed, in-memory data store that operates at microsecond latency. The system is more robust. TB of in-memory capacity in a single cluster.

Games

Games Retail Latency Education

Automating chaos experiments in production

The Morning Paper

JULY 4, 2019

Are you ready to take your system assurance programme to the next level? In all cases we need to be able to carefully monitor the impact on the system, and back out if things start going badly wrong. Netflix’s system is deployed on the public cloud as complex set of interacting microservices.

Latency

Latency Engineering Metrics Traffic

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

Internet

Internet Internet AWS Performance

Evaluating the Evaluation: A Benchmarking Checklist

Brendan Gregg

JUNE 30, 2018

sounds like a homework exercise of purely academic value. If you develop a habit of reading only the operation rate and latency numbers from a lengthy benchmark report (or you have a shell script to do this that feeds a GUI), it's easy to miss other details in the report such as the error rate. ### 4. What's the limiter?" No packets.

Benchmarking

Benchmarking Latency Cache Network

COVID-19 Hazard Analysis using STPA

Adrian Cockcroft

MARCH 17, 2020

Picture taken by Adrian March 17, 2020 A resilient system continues to operate successfully in the presence of failures. There are many possible failure modes, and each exercises a different aspect of resilience. Hence, one way to reduce risk is to make systems more observable. The first technique is the most generally useful.

Healthcare

Healthcare Government Airlines Systems

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

A resilient system continues to operate successfully in the presence of failures. There are many possible failure modes, and each exercises a different aspect of resilience. Hence, one way to reduce risk is to make systems more observable. This discussion focuses on hardware, software and operational failure modes.

Latency

Latency Systems Engineering Hardware

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

A resilient system continues to operate successfully in the presence of failures. There are many possible failure modes, and each exercises a different aspect of resilience. Hence, one way to reduce risk is to make systems more observable. This discussion focuses on hardware, software and operational failure modes.

Latency

Latency Systems Engineering Hardware

A persistent problem: managing pointers in NVM

The Morning Paper

DECEMBER 8, 2019

At the start of November I was privileged to attend HPTS (the High Performance Transaction Systems) conference in Asilomar. Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data. PLOS’19.

Hardware

Hardware Programming Media Storage

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

The Morning Paper

JANUARY 23, 2020

1:18pm a key observation was made that an API call to populate the homepage sidebar saw a huge jump in latency. The process tracing exercise included: Examning IRC transcripts from multiple channels. First look for any correlation to the last change made to the system.

Internet

Internet Internet Cache Engineering

Evaluating the Evaluation: A Benchmarking Checklist

Brendan Gregg

JUNE 29, 2018

sounds like a homework exercise of purely academic value. If you develop a habit of reading only the operation rate and latency numbers from a lengthy benchmark report (or you have a shell script to do this that feeds a GUI), it's easy to miss other details in the report such as the error rate. ### 4. What's the limiter?" No packets.

Benchmarking

Benchmarking Latency Cache Network

Taiji: managing global user traffic for large-scale Internet services at the edge

The Morning Paper

NOVEMBER 14, 2019

Taiji’s routing table is a materialized representation of how user traffic at various edge nodes ought to be distributed over available data centers to balance data center utilization and minimize latency. For example, balance utilisation across all data centers, or optimise for network latency. a chance to warm up.

Traffic

Traffic Internet Internet Latency

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

Hardware

Hardware Transportation Performance Latency

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

Hardware

Hardware Transportation Performance Latency

Transforming enterprise integration with reactive streams

O'Reilly Software

MARCH 7, 2018

Build a more scalable, composable, and functional architecture for interconnecting systems and applications. Welcome to a new world of data-driven systems. Today, data needs to be available at all times, serving its users—both humans and computer systems—across all time zones, continuously, in close to real time.

Transportation

Transportation Java Programming Architecture

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished. Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency. Instead, we found puzzle after puzzle. FMAs/cycle.

Latency

Latency Hardware Code Testing

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

There was no deep goal — just a desire to see the maximum GFLOPS in action. The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished. Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is: B1.8:

Latency

Latency Hardware Code Testing

Good Management Can Work Miracles

The Agile Manager

AUGUST 23, 2007

When it all comes together, the overall benefit of the resulting system can be incredible, such as an increase in life expectancy. IT reward systems are also often based on individual performance tied to granular and highly focused statements of accomplishment. As Thomas Teal wrote succinctly , “Good management works miracles.”

Innovation

Innovation Technology Technology Latency

MezzFS?—?Mounting object storage in Netflix’s media processing platform

The Netflix TechBlog

MARCH 6, 2019

Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. MezzFS can be configured to cache objects on the local disk. Regional caching? —?Netflix

Media

Media Storage Processing Cache

Technology Performance Pulse

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Service level objectives: 5 SLOs to get started

Trending Sources

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Seamlessly Swapping the API backend of the Netflix Android app

Interpreting A/B test results: false negatives and power

Service level objective examples: 5 SLO examples for faster, more reliable apps

Real user monitoring vs. synthetic monitoring: Understanding best practices

Bring Your Own Cloud (BYOC) vs. Dedicated Hosting at ScaleGrid

Amazon EC2 Cluster GPU Instances - All Things Distributed

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

Automating chaos experiments in production

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

Evaluating the Evaluation: A Benchmarking Checklist

COVID-19 Hazard Analysis using STPA

Failure Modes and Continuous Resilience

Failure Modes and Continuous Resilience

A persistent problem: managing pointers in NVM

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)

Evaluating the Evaluation: A Benchmarking Checklist

Taiji: managing global user traffic for large-scale Internet services at the edge

Why I hate MPI (from a performance analysis perspective)

Why I hate MPI (from a performance analysis perspective)

Transforming enterprise integration with reactive streams

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Good Management Can Work Miracles

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Stay Connected