Engineering, Latency and Storage - Technology Performance Pulse

Build systems more reliably with Dynatrace: Chaos Engineering

Dynatrace

AUGUST 21, 2024

These releases often assumed ideal conditions such as zero latency, infinite bandwidth, and no network loss, as highlighted in Peter Deutsch’s eight fallacies of distributed systems. Chaos engineering is a practice that extends beyond traditional failure testing by identifying unpredictable issues.

Engineering

Engineering Systems Latency Metrics

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

At this scale, we can gain a significant amount of performance and cost benefits by optimizing the storage layout (records, objects, partitions) as the data lands into our warehouse. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Storage

Storage Latency Efficiency Data Engineering

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Every image you hover over isnt just a visual placeholder; its a critical data point that fuels our sophisticated personalization engine. The enriched data is seamlessly accessible for both real-time applications via Kafka and historical analysis through storage in an Apache Iceberg table.

Tuning

Tuning Latency Efficiency Storage

Mastering Disk Space Management with MongoDB® Storage Engines

Scalegrid

MAY 11, 2024

MongoDB offers several storage engines that cater to various use cases. The default storage engine in earlier versions was MMAPv1, which utilized memory-mapped files and document-level locking. Choosing the appropriate storage engine can have a significant impact on application performance.

Storage

Storage Engineering Cache Database

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. It also serves as central configuration of access patterns such as consistency or latency targets. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Efficiency

Designing Instagram

High Scalability

JANUARY 11, 2022

Machine Learning Engineer at Amazon and has led several machine-learning initiatives across the Amazon ecosystem. FUN FACT : In this talk , Rodrigo Schmidt, director of engineering at Instagram talks about the different challenges they have faced in scaling the data infrastructure at Instagram. This is a guest post by Ankit Sirmorya.

Design

Design Media Storage Logistics

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? While reengineering these systems to accommodate this additional axis is possible, it would entail increased costs.

Traffic

Traffic Scalability Strategy Monitoring

Growth Engineering at Netflix?—?Automated Imagery Generation

The Netflix TechBlog

FEBRUARY 9, 2021

Growth Engineering at Netflix?—?Automated In the Growth Engineering team, we refer to this as the top of the signup funnel. For more background on the signup funnel and Growth Engineering’s role in the signup funnel, please read our initial post on the topic: Growth Engineering at Netflix? Accelerating Innovation.

Engineering

Engineering Storage Latency Entertainment

Nine ways technology executives can get significant business value with the right observability platform

Dynatrace

MAY 21, 2024

That’s because it does not require any pre-prepared schemas, and access to cold/hot storage is fully automatic and with zero latency. Insights are therefore dispersed in a multitude of data lakes, storage systems, and reporting platforms. Moreover, it is fast, powered by its massively parallel processing data lakehouse.

Technology

Technology Technology Analytics Storage

Edgar: Solving Mysteries Faster with Observability

The Netflix TechBlog

SEPTEMBER 2, 2020

For engineers, instead of whodunit, the question is often “what failed and why?” An engineer can find herself digging through logs, poring over traces, and staring at dozens of dashboards. Edgar provides a powerful and consumable user experience to both engineers and non-engineers alike.

Latency

Latency Transportation Engineering Traffic

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

From chunk encoding to assembly and packaging, the result of each previous processing step must be uploaded to cloud storage and then downloaded by the next processing step. Uploading and downloading data always come with a penalty, namely latency.

Cloud

Cloud Media Storage Cache

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. The data warehouse is not designed to serve point requests from microservices with low latency.

Latency

Latency Storage Big Data Tuning

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

a Netflix member via Twitter This is an example of a question our on-call engineers need to answer to help resolve a member issue?—?which We needed to increase engineering productivity via distributed request tracing. That is the first question our engineering teams asked us when integrating the tracer library.

Infrastructure

Infrastructure Transportation Storage Open Source

Improved Alerting with Atlas Streaming Eval

The Netflix TechBlog

APRIL 27, 2023

Engineers want their alerting system to be realtime, reliable, and actionable. A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! It opens doors to support more exciting use-cases.

Storage

Storage Cache Metrics Database

The Anna Key-Value Store Now Has 355x the Performance of DynamoDB for the Dollar

High Scalability

SEPTEMBER 8, 2018

They've posted about Anna's new superpowers in Going Fast and Cheap: How We Made Anna Autoscale : Using Anna v0 as an in-memory storage engine, we set out to address the cloud storage problems described above. Each storage server collects statistics about the requests it serves, the data it stores, etc. Related Articles.

Storage

Storage Performance AWS Cloud

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

OCTOBER 17, 2018

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog.

Big Data

Big Data Latency Transportation Traffic

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

While clustering across wide-area networks (WANs) is discouraged due to latency issues, leased links can mitigate some connectivity challenges. Keeping queues short minimizes latency and enhances the overall efficiency of message delivery in RabbitMQ. Keeping queues short maintains a responsive and efficient RabbitMQ setup.

Best Practices

Best Practices Traffic Strategy Efficiency

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

The network latency between cluster nodes should be around 10 ms or less. – A Dynatrace customer, Head of Performance Engineering. For Premium HA, this has been extended from 10 ms latency (in the same network region) to around 100 ms network latency due to asynchronous data replication between regions.

Availability

Availability Hardware Latency Traffic

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload. As Pushy’s portfolio grew, we experienced some pain points with Dynomite.

Latency

Latency Cache Tuning Efficiency

Netflix at AWS re:Invent 2019

The Netflix TechBlog

NOVEMBER 22, 2019

by Shefali Vyas Dalal AWS re:Invent is a couple weeks away and our engineers & leaders are thrilled to be in attendance yet again this year! Over the years, this platform took on support for both elastic online services and fully featured batch workloads supporting use cases across Netflix engineering.

AWS

AWS Entertainment Open Source Benchmarking

Service level objectives: 5 SLOs to get started

Dynatrace

JUNE 1, 2023

Note : you might hear the term latency used instead of response time. Both latency and response time are critical to ensure reliability. Latency typically refers to the time it takes for a single request to travel from its source to its destination. Latency primarily focuses on the time spent in transit.

Latency

Latency Website Traffic DevOps

These 7 Edge Data Challenges Will Test Companies the Most in 2025

VoltDB

DECEMBER 11, 2024

By bringing computation closer to the data source, edge-based deployments reduce latency, enhance real-time capabilities, and optimize network bandwidth. Data Overload and Storage Limitations As IoT and especially industrial IoT -based devices proliferate, the volume of data generated at the edge has skyrocketed.

IoT

IoT Energy Logistics Latency

Unlock the power of contextual log analytics

Dynatrace

OCTOBER 2, 2024

A Service Reliability Engineer (SRE) manually reviews cloud-native front-end application warnings. There is no need to think about schema and indexes, re-hydration, or hot/cold storage. OpenPipeline’s high-performance filtering and preprocessing provide full ingest and storage control for the Dynatrace platform.

Analytics

Analytics AWS DevOps Cloud

Amazon DynamoDB ? a Fast and Scalable NoSQL Database.

All Things Distributed

JANUARY 18, 2012

Amazon DynamoDB offers low, predictable latencies at any scale. In response, we began to develop a collection of storage and database technologies to address the demanding scalability and reliability requirements of the Amazon.com ecommerce platform. Customers can typically achieve average service-side in the single-digit milliseconds.

Scalability

Scalability Database Ecommerce Latency

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

When a new leader is elected it loads all data from external storage. In that scenario, the system would need to deal with the data propagation latency directly, for example, by use of timeouts or client-originated update tracking mechanisms. Active data includes jobs and tasks that are currently running.

Cache

Cache Latency Traffic Systems

Implementing AWS well-architected pillars with automated workflows

Dynatrace

SEPTEMBER 13, 2023

These workflows also utilize Davis® , the Dynatrace causal AI engine, and all your observability and security data across all platforms, in context, at scale, and in real-time. Storing frequently accessed data in faster storage, usually in-memory caching, improves data retrieval speed and overall system performance. Beyond

AWS

AWS Efficiency Azure Cloud

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

Besides the traditional system hardware, storage, routers, and software, ITOps also includes virtual components of the network and cloud infrastructure. This includes response time, accuracy, speed, throughput, uptime, CPU utilization, and latency. The primary goal of ITOps is to provide a high-performing, consistent IT environment.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

Compression Methods in MongoDB: Snappy vs. Zstd

Percona

MARCH 29, 2023

Compression in any database is necessary as it has many advantages, like storage reduction, data transmission time, etc. Storage reduction alone results in significant cost savings, and we can save more data in the same space. By default, MongoDB provides a snappy block compression method for storage and network communication.

Storage

Storage Network Open Source Latency

Netflix Drive

The Netflix TechBlog

MAY 5, 2021

Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. Finally, once the encoded copy is prepared, this copy can be persisted by Netflix Drive to a persistent storage tier in the cloud.

Media

Media Storage Architecture Cloud

Achieving observability in async workflows

The Netflix TechBlog

MAY 14, 2021

Prodicle is one of the many applications that is at the exciting intersection of connecting the world of content productions to Netflix Studio Engineering. We are expected to process 1,000 watermarks for a single distribution in a minute, with non-linear latency growth as the number of watermarks increases.

Traffic

Traffic Java Latency Google

Evolution of ML Fact Store

The Netflix TechBlog

APRIL 26, 2022

The first version of our logger library optimized for storage by deduplicating facts and optimized for network i/o using different compression methods for each fact. Since we were optimizing at the logging level for storage and performance, we had less data and metadata to play with to optimize the query performance.

Storage

Storage Design Scalability Latency

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. One can perform this comparison live on the request path or offline based on the latency requirements of the particular use case.

Traffic

Traffic Metrics Systems Strategy

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Uber Engineering

MARCH 12, 2017

With the evolution of storage formats like Apache Parquet and Apache ORC and query engines like Presto and Apache Impala , the Hadoop ecosystem has the potential to become a general-purpose, unified serving layer for workloads that can tolerate latencies … The post Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop appeared (..)

Processing

Processing Latency Storage Engineering

Titan Graph Database Integration with DynamoDB: World-class Performance, Availability, and Scale for New Workloads

All Things Distributed

AUGUST 20, 2015

Today, we are releasing a plugin that allows customers to use the Titan graph engine with Amazon DynamoDB as the backend storage layer. It opens up the possibility to enjoy the value that graph databases bring to relationship-centric use cases, without worrying about managing the underlying storage. Enter graph databases.

Database

Database Logistics Availability Social Media

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). One use case for STM is to model the behavior of a customer in the form of a flow of transactions along the buyer’s journey.

Monitoring

Monitoring Social Media IoT Metrics

Redis® Monitoring Strategies for 2025

Scalegrid

JANUARY 21, 2025

Identifying key Redis metrics such as latency, CPU usage, and memory metrics is crucial for effective Redis monitoring. To monitor Redis instances effectively, collect Redis metrics focusing on cache hit ratio, memory allocated, and latency threshold.

Strategy

Strategy Monitoring Latency DevOps

USENIX LISA2021 Computing Performance: On the Horizon

Brendan Gregg

JULY 4, 2021

AWS Graviton2); for memory with the arrival of DDR5 and High Bandwidth Memory (HBM) on-processor; for storage including new uses for 3D Xpoint as a 3D NAND accelerator; for networking with the rise of QUIC and eXpress Data Path (XDP); and so on. Ford, et al., “TCP

Performance

Performance Latency Hardware Storage

Service level objective examples: 5 SLO examples for faster, more reliable apps

Dynatrace

JUNE 1, 2023

Note : you might hear the term latency used instead of response time. Both latency and response time are critical to ensure reliability. Latency typically refers to the time it takes for a single request to travel from its source to its destination. Latency primarily focuses on the time spent in transit.

Traffic

Traffic Website Latency Virtualization

A one size fits all database doesn't fit anyone

All Things Distributed

JUNE 21, 2018

Use cases such as gaming, ad tech, and IoT lend themselves particularly well to the key-value data model where the access patterns require low-latency Gets/Puts for known key values. The purpose of DynamoDB is to provide consistent single-digit millisecond latency for any scale of workloads.

Database

Database AWS Games Latency

Redis® Monitoring Strategies for 2024

Scalegrid

DECEMBER 21, 2023

Identifying key Redis® metrics such as latency, CPU usage, and memory metrics is crucial for effective Redis monitoring. To monitor Redis® instances effectively, collect Redis metrics focusing on cache hit ratio, memory allocated, and latency threshold.

Strategy

Strategy Monitoring Latency DevOps

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Storage The type of storage and disk used for database servers can have a significant impact on performance and reliability. Cloud Different cloud providers offer a range of instance types and sizes, each with varying amounts of CPU, memory, and storage. Setting oom_score_adj to -800. Try Percona Distribution for MySQL today!

Performance

Performance Hardware Tuning Storage

Netflix Video Quality at Scale with Cosmos Microservices

The Netflix TechBlog

NOVEMBER 2, 2021

Such coupling problems abound with our Reloaded architecture, and hence the Media Cloud Engineering and Encoding Technologies teams have been working together to develop a solution that addresses many of the concerns with our previous architecture. This enables us to use our scale to increase throughput and reduce latencies.

Media

Media Innovation Metrics Latency

Amazon EC2 Cluster GPU Instances - All Things Distributed

All Things Distributed

NOVEMBER 15, 2010

For example, the most fundamental abstraction trade-off has always been latency versus throughput. Modern CPUs strongly favor lower latency of operations with clock cycles in the nanoseconds and we have built general purpose software architectures that can exploit these low latencies very well.Â Where to go from here?

AWS

AWS Programming Latency Architecture

Build systems more reliably with Dynatrace: Chaos Engineering

Optimizing data warehouse storage

Trending Sources

Introducing Impressions at Netflix

Mastering Disk Space Management with MongoDB® Storage Engines

Introducing Netflix’s Key-Value Data Abstraction Layer

Designing Instagram

Title Launch Observability at Netflix Scale

Growth Engineering at Netflix?—?Automated Imagery Generation

Nine ways technology executives can get significant business value with the right observability platform

Edgar: Solving Mysteries Faster with Observability

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Cloud Packaging in the Terabyte Era

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Building Netflix’s Distributed Tracing Infrastructure

Improved Alerting with Atlas Streaming Eval

The Anna Key-Value Store Now Has 355x the Performance of DynamoDB for the Dollar

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Best Practices for Scaling RabbitMQ

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Netflix at AWS re:Invent 2019

Service level objectives: 5 SLOs to get started

These 7 Edge Data Challenges Will Test Companies the Most in 2025

Unlock the power of contextual log analytics

Amazon DynamoDB ? a Fast and Scalable NoSQL Database.

Consistent caching mechanism in Titus Gateway

Implementing AWS well-architected pillars with automated workflows

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Compression Methods in MongoDB: Snappy vs. Zstd

Netflix Drive

Achieving observability in async workflows

Evolution of ML Fact Store

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Titan Graph Database Integration with DynamoDB: World-class Performance, Availability, and Scale for New Workloads

How digital experience monitoring helps deliver business observability

Redis® Monitoring Strategies for 2025

USENIX LISA2021 Computing Performance: On the Horizon

Service level objective examples: 5 SLO examples for faster, more reliable apps

A one size fits all database doesn't fit anyone

Redis® Monitoring Strategies for 2024

InnoDB Performance Optimization Basics

Netflix Video Quality at Scale with Cosmos Microservices

Amazon EC2 Cluster GPU Instances - All Things Distributed

Stay Connected