Latency, Storage and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Message brokers handle validation, routing, storage, and delivery, ensuring efficient and reliable communication. Message Broker vs. Distributed Event Streaming Platform RabbitMQ functions as a message broker, managing message confirmation, routing, storage, and delivery within a queue. What is RabbitMQ?

Latency

Latency Analytics Architecture Storage

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

This dual-path approach leverages Kafkas capability for low-latency streaming and Icebergs efficient management of large-scale, immutable datasets, ensuring both real-time responsiveness and comprehensive historical data availability. million impression events globally every second, with each event approximately 1.2KB in size.

Tuning

Tuning Latency Efficiency Storage

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

At this scale, we can gain a significant amount of performance and cost benefits by optimizing the storage layout (records, objects, partitions) as the data lands into our warehouse. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Storage

Storage Latency Efficiency Data Engineering

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Additionally, the time-sensitive nature of these investigations precludes the use of cold storage, which cannot meet the stringent SLAs required.

Traffic

Traffic Scalability Strategy Monitoring

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 4, 2020

Compare Latency. lower latency compared to DigitalOcean for PostgreSQL. On average, ScaleGrid provides over 30% more storage vs. DigitalOcean for PostgreSQL at the same affordable price. Now, let’s take a look at the throughput and latency performance of our comparison. PostgreSQL DigitalOcean Latency Averages (ms).

Database

Database Latency Benchmarking Performance

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. Logging is selective to cases where the old and new responses do not match.

Traffic

Traffic Latency Tuning Systems

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Scalegrid

JUNE 22, 2020

Compare Latency. On average, ScaleGrid achieves almost 30% lower latency over DigitalOcean for the same deployment configurations. ScaleGrid provides 30% more storage on average vs. DigitalOcean for MySQL at the same affordable price. Read-Intensive Latency Benchmark. Balanced Workload Latency Benchmark.

Database

Database Benchmarking Latency Performance

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. Sometimes these locations landed on mount points which, due to capacity, availability, or access constraints, weren’t well suited for large runtime storage. See details below. See details below.

Storage

Storage Latency Operating System Network

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. As most key-value storage engines support efficiently deleting a namespace (e.g.

Latency

Latency Storage Big Data Tuning

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

From chunk encoding to assembly and packaging, the result of each previous processing step must be uploaded to cloud storage and then downloaded by the next processing step. Uploading and downloading data always come with a penalty, namely latency.

Cloud

Cloud Media Storage Cache

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. With 24/7 expert support, ScaleGrid assists with troubleshooting, performance tuning, and migration processes. Keeping queues short maintains a responsive and efficient RabbitMQ setup.

Best Practices

Best Practices Traffic Strategy Efficiency

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload. As Pushy’s portfolio grew, we experienced some pain points with Dynomite.

Latency

Latency Cache Tuning Efficiency

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Our distributed tracing infrastructure is grouped into three sections: tracer library instrumentation, stream processing, and storage.

Infrastructure

Infrastructure Transportation Storage Open Source

What’s New at ScaleGrid – September 2024

Scalegrid

SEPTEMBER 10, 2024

Additionally, we’ve added the Philadelphia AWS Local Zone , helping to reduce latency for customers operating in the eastern U.S. This enables ScaleGrid users in Australia and nearby regions to access lower-latency services and improved performance. Stay tuned for more exciting updates in the months to come! <p>The

Latency

Latency AWS Storage Tuning

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

AWS offers a broad set of global, cloud-based services including computing, storage, networking, Internet of Things (IoT), and many others. Amazon Simple Storage Service (S3). The example below visualizes average latency by API name and stage for a specific AWS API Gateway. Stay tuned for updates in Q1 2020. Requirements.

AWS

AWS Metrics IoT Storage

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

DZone

JULY 10, 2019

In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices.

Best Practices

Best Practices Storage Latency Tuning

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

AWS offers a broad set of global, cloud-based services including computing, storage, networking, Internet of Things (IoT), and many others. Amazon Simple Storage Service (S3). The example below visualizes average latency by API name and stage for a specific AWS API Gateway. Stay tuned for updates in Q1 2020. Requirements.

AWS

AWS Metrics IoT Storage

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). One use case for STM is to model the behavior of a customer in the form of a flow of transactions along the buyer’s journey.

Monitoring

Monitoring Social Media IoT Metrics

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Storage The type of storage and disk used for database servers can have a significant impact on performance and reliability. Cloud Different cloud providers offer a range of instance types and sizes, each with varying amounts of CPU, memory, and storage. If you see concurrency issues, you can tune this variable.

Performance

Performance Hardware Tuning Storage

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases. The watermarking functionality, at the start, was a simple offering with various Google Drive integrations for storage and links.

Traffic

Traffic Java Latency Google

Save Money in AWS RDS: Don’t Trust the Defaults

Percona

MAY 1, 2023

I’ll show you some MySQL settings to tune to get better performance, and cost savings, with AWS RDS. This message is normally a side effect of a storage subsystem that is not capable of keeping up with the number of writes (e.g., Want to save money on your AWS RDS bill? The settings might not be optimal. IOPs) required by MySQL.

AWS

AWS Hardware Storage Tuning

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? To be fair, that is also true with PostgreSQL; it hasn’t been tuned either, and it, too, can also perform much better.

Tuning

Tuning Cache Servers Benchmarking

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

This enables us to use our scale to increase throughput and reduce latencies. Here, based on the video length, the throughput and latency requirements, available scale etc., Stay tuned for more details on these algorithmic innovations. VQS is called using the measureQuality endpoint. The workflow is initiated.

Media

Media Innovation Metrics Latency

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. Similarly, an increased throughput signifies an intensive workload on a server and a larger latency.

Metrics

Metrics Monitoring Latency Cache

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

Scalegrid

NOVEMBER 4, 2024

Each ran with the following specs: 8GB RAM 2 vCPU 120GB SSD Configuration Details : Each PostgreSQL instance for Scalegrid and Amazon RDS was set up with default tuning parameters for PostgreSQL versions 13, 14, and 15. Network Latency : We ran both machines in the same region and conducted the tests from within the same box in that region.

Benchmarking

Benchmarking AWS Tuning Metrics

How To Scale a Single-Host PostgreSQL Database With Citus

Percona

NOVEMBER 3, 2023

xlarge 4vCPU 8GB-RAM Storage: EBS volume (root) 80GB gp2 (IOPS 240/3000) As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down. And now, execute the benchmark: -- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench The results are not pretty.

Database

Database Benchmarking Latency C++

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

All Things Distributed

APRIL 9, 2015

The Amazon ML console and API provide data and model visualization tools, as well as wizards to guide you through the process of creating machine learning models, measuring their quality and fine-tuning the predictions to match your application requirements. Details on the AWS Blog. The Amazon Elastic File System. for a while already.

Lambda

Lambda Cloud IoT AWS

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

Performant – DynamoDB consistently delivers single-digit millisecond latencies even as your traffic volume increases. DynamoDB automatically re-distributes your data to healthy servers to ensure there are always multiple replicas of your data without you needing to intervene.

Internet

Internet Internet AWS Performance

Accelerate Machine Learning with Amazon SageMaker

All Things Distributed

NOVEMBER 29, 2017

Though the AWS Cloud gives you access to the storage and processing power required for ML, the process for building, training, and deploying ML models has unique challenges that often block successful use of this powerful new technology. One-click, on-demand distributed training that sets up and tears down the cluster after training.

Tuning

Tuning AWS Scalability Infrastructure

Updated Azure SQL Database Tier Options

SQL Performance

APRIL 27, 2020

The basic tier provides up to 5 DTUs with standard storage. The standard tier supports from 10 up to 3000 DTUs with standard storage and the premium tier supports 125 up to 4000 DTUs with premium storage, which is orders of magnitude faster than standard storage. vCore Pricing Tier. GB per vCore. HyperScale Database.

Azure

Azure Database Serverless Hardware

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

HammerDB

MAY 5, 2019

However in the Skylake microarchitecture (you can see a list of CPUs here ) the PAUSE instruction changed and in the documentation it says “the latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.”

Testing

Testing Tuning Storage Latency

A case for managed and model-less inference serving

The Morning Paper

JUNE 13, 2019

Making queries to an inference engine has many of the same throughput, latency, and cost considerations as making queries to a datastore, and more and more applications are coming to depend on such queries. The following figure highlights how just one of these variables, batch size, impacts throughput and latency on ResNet50.

Hardware

Hardware Latency Serverless Energy

Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook

The Morning Paper

MARCH 10, 2020

This benchmark can synthetically generate more precise key-value queries that represent the reads and writes of key-value stores to the underlying storage system. The paper examines three different uses of RocksDB at Facebook: UDB , the underlying storage engine for the MySQL databases storing the social graph data. Three workloads.

Benchmarking

Benchmarking Storage Cache Open Source

Evaluating the Evaluation: A Benchmarking Checklist

Brendan Gregg

JUNE 30, 2018

A co-worker introduced me to Craig Hanson and Pat Crain's performance mantras, which neatly summarize much of what we do in performance analysis and tuning. Networks, PCIe busses, CPU interconnects, memory busses, and storage devices (both throughput and IOPS), all have fixed limits. They are: **Performance mantras**. Don't do it.

Benchmarking

Benchmarking Latency Cache Network

Friends don't let friends build data pipelines

Abhishek Tiwari

JULY 12, 2018

These nodes and edges require a good amount of compute and storage which is typically distributed across a large number servers either running in the cloud or your own data center. If tuned for performance, there is a good change reliability is compromised - and vice versa. In a nutshell, a data pipeline is a distributed system.

Latency

Latency Analytics Scalability Engineering

MongoDB Best Practices: Security, Data Modeling, & Schema Design

Percona

APRIL 17, 2023

The main objective of this post is to share my experience over the past years tuning MongoDB and centralize the diverse sources that I crossed in this journey in a unique place. The CFQ works well for many general use cases but lacks latency guarantees. Spoiler alert: This post focuses on MongoDB 3.6.X tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_intvl

Best Practices

Best Practices Design Tuning Database

Evaluating the Evaluation: A Benchmarking Checklist

Brendan Gregg

JUNE 29, 2018

A co-worker introduced me to Craig Hanson and Pat Crain's performance mantras, which neatly summarize much of what we do in performance analysis and tuning. Networks, PCIe busses, CPU interconnects, memory busses, and storage devices (both throughput and IOPS), all have fixed limits. They are: **Performance mantras**. Don't do it.

Benchmarking

Benchmarking Latency Cache Network

Solaris to Linux Migration 2017

Brendan Gregg

SEPTEMBER 5, 2017

Here's some output from my zfsdist tool, in bcc/BPF, which measures ZFS latency as a histogram on Linux: # zfsdist. Tracing ZFS operation latency. Both Xen and KVM have had many performance and security improvements, and workloads can now be tuned to run at almost bare metal speeds (say, a 3% loss or less). Hit Ctrl-C to end. ^C

Virtualization

Virtualization AWS Engineering Hardware

HammerDB MySQL and MariaDB Best Practice for Performance and Scalability

HammerDB

OCTOBER 12, 2018

As is also the case this limitation is at the database level (especially the storage engine) rather than the hardware level. InnoDB is the storage engine that will deliver the best OLTP throughput and should be chosen for this test. . maximum transition latency: Cannot determine or is not supported. . innodb_file_per_table.

Best Practices

Best Practices Scalability Performance C++

Netflix’s Distributed Counter Abstraction

RabbitMQ vs. Kafka: Key Differences

Trending Sources

Introducing Impressions at Netflix

Optimizing data warehouse storage

Introducing Netflix’s Key-Value Data Abstraction Layer

Title Launch Observability at Netflix Scale

Comparing PostgreSQL DigitalOcean Performance & Pricing – ScaleGrid vs. DigitalOcean Managed Databases

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Introducing Netflix TimeSeries Data Abstraction Layer

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Faster time to value with enhanced handling of OneAgent runtime data

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Cloud Packaging in the Terabyte Era

Best Practices for Scaling RabbitMQ

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Building Netflix’s Distributed Tracing Infrastructure

What’s New at ScaleGrid – September 2024

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

How digital experience monitoring helps deliver business observability

InnoDB Performance Optimization Basics

Achieving observability in async workflows

Save Money in AWS RDS: Don’t Trust the Defaults

The Most Important MySQL Setting

Netflix Video Quality at Scale with Cosmos Microservices

Crucial Redis Monitoring Metrics You Must Watch

PostgreSQL Benchmark: ScaleGrid vs. Amazon RDS

How To Scale a Single-Host PostgreSQL Database With Citus

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

Accelerate Machine Learning with Amazon SageMaker

Updated Azure SQL Database Tier Options

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

A case for managed and model-less inference serving

Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook

Evaluating the Evaluation: A Benchmarking Checklist

Friends don't let friends build data pipelines

MongoDB Best Practices: Security, Data Modeling, & Schema Design

Evaluating the Evaluation: A Benchmarking Checklist

Solaris to Linux Migration 2017

HammerDB MySQL and MariaDB Best Practice for Performance and Scalability

Stay Connected