Code, Latency and Tuning - Technology Performance Pulse

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. There is also a risk of impact on device QoE, especially on low-resource devices.

Traffic

Traffic Latency Tuning Systems

Performance Tuning Java Applications in Linux

DZone

DECEMBER 4, 2019

You may also like: How to Properly Plan JVM Performance Tuning. While Performance Tuning an application both Code and Hardware running the code should be accounted for. Reduce the amount of code in critical sections. For low latency, applications use Concurrent Mark and Sweep Algorithm — CMS or G1 GC.

Tuning

Tuning Java Performance Hardware

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

The Netflix TechBlog

SEPTEMBER 10, 2024

As the scale of the messages being processed increased and we were making more code changes in the message processor, we found ourselves looking for something more flexible. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered. It served Pushy’s needs well for many years.

Latency

Latency Cache Tuning Efficiency

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

To ensure high standards, it’s essential that your organization establish automated validations in an early phase of the software development process—ideally when code is written. In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™.

DevOps

DevOps Traffic Latency Best Practices

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Traces are used for performance analysis, latency optimization, and root cause analysis. Instrumentation involves adding code to your application to collect this tracking information, akin to installing security cameras in a store to monitor customer movement and behavior. Contextualize data. Employ efficient sampling.

Latency

Latency Best Practices Metrics Open Source

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

Dynatrace Configuration as Code enables complete automation of the Dynatrace platform’s configuration, ensuring that software is secure and reliable. With Configuration as Code, developers can manage their observability and security tasks with config files that can be developed alongside source code conveniently and at scale.

Best Practices

Best Practices Code Infrastructure Latency

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. Local development tools including specialized test runners, code generators, and a command line interface. Productivity?—?Local Delivery?—?A

Serverless

Serverless Media Latency Social Media

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. Until now, all OneAgent runtime files were stored in a fixed, hard-coded location. Improved code module injection resiliency. Stay tuned for upcoming news about these changes. See details below.

Storage

Storage Latency Operating System Network

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. Figure 1 shows how we use Bulldozer to move data at Netflix. Moving data with Bulldozer at Netflix.

Latency

Latency Storage Big Data Tuning

Applying Netflix DevOps Patterns to Windows

The Netflix TechBlog

AUGUST 22, 2019

We had several goals in mind when trying to improve the baking methodology: Configuration as code Leverage Spinnaker for Continuous Delivery Eliminate Toil Configuration as Code The first part of our new Windows baking solution is Packer. We now have the software and instance configuration as code.

DevOps

DevOps AWS Tuning Infrastructure

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. This testing stage took about two weeks.

Processing

Processing Media Latency Innovation

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

When an application is triggered, it can cause latency as the application starts. This creates latency when they need to restart. Customizable, no-code dashboards in Dynatrace give you direct insight into every service without scanning through the countless logs generated across your applications.

Serverless

Serverless Efficiency Lambda AWS

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Higher latency and cold start issues due to the initialization time of the functions. Instrument your functions using either our cloud-native integrations, which give you automatic instrumentation simply by adding the Dynatrace AWS Lambda Layer or use a monitoring-as-code approach utilizing OpenTelemetry to add distributed tracing.

Serverless

Serverless Lambda Azure AWS

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Telltale learns what constitutes typical health for an application, no alert tuning required. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others.

Monitoring

Monitoring Tuning Traffic Metrics

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

MQTT is an OASIS standard messaging protocol for the Internet of Things (IoT) and was designed as a highly lightweight yet reliable publish/subscribe messaging transport that is ideal for connecting remote devices with a small code footprint and minimal network bandwidth. million elements.

Latency

Latency Traffic Transportation Cloud

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

Dynatrace

JULY 24, 2023

For example, consider an e-commerce website that automatically sends personalized discount codes to customers who abandon their shopping carts. This event-driven automation triggers the action of sending the discount code only when the customer abandons the cart abandonment, minimizing revenue loss and increasing conversion rates.

DevOps

DevOps Traffic Efficiency Servers

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

However, this method limited us to instrumenting the code manually and collecting specific sets of data we defined upfront. The other sections on that page (such as Disk analysis) provide further information and charts on topics such as available disk space, latency, dropped network packets, refused connections, and more.

Metrics

Metrics Database Monitoring Network

Streaming SQL in Data Mesh

The Netflix TechBlog

NOVEMBER 3, 2023

Additionally, instead of implementing business logic by composing multiple individual Processors together, users could express their logic in a single SQL query, avoiding the additional resource and latency overhead that came from multiple Flink jobs and Kafka topics. Stay tuned for more updates!

Processing

Processing Engineering Infrastructure Latency

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Tuning Options. Tuning SSRS is much like any other application. Disk latency for ReportServer and ReportServerTempDB are very important. General Tuning.

Tuning

Tuning Servers Database Best Practices

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

Organizations commonly use SLOs in production environments to ensure released code stays within error budgets. You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. Join us for the on-demand performance clinic, Automating SLOs as code–from Ops to Dev with Dynatrace.

Metrics

Metrics Best Practices DevOps Infrastructure

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. <code> 127.0.0.1:6379> cmdstat_append:calls=797,usec=4480,usec_per_call=5.62

Metrics

Metrics Monitoring Latency Cache

AI Essentials for Tech Executives

O'Reilly

FEBRUARY 18, 2025

On April 24, OReilly Media will be hosting Coding with AI: The End of Software Development as We Know It a live virtual tech conference spotlighting how AI is already supercharging developers, boosting productivity, and providing real value to their organizations. Were experiencing high latency in responses.

Latency

Latency Tuning Metrics Testing

Zero Configuration Service Mesh with On-Demand Cluster Discovery

The Netflix TechBlog

AUGUST 29, 2023

IPC clients are instantiated targeting that VIP or SVIP, and the Eureka client code handles the translation of that VIP to a set of IP and port pairs by fetching them from the Eureka server. There is a downside to fetching this data on-demand: this adds latency to the first request to a cluster.

Traffic

Traffic Latency Cloud C++

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

What we see here, though, is the emergence of the first iterations of the LLM SDLC: Were not yet changing our embeddings, fine-tuning, or business logic; were not using unit tests, CI/CD, or even a serious evaluation framework, but were building, deploying, monitoring, evaluating, and iterating! We tested both retrieval quality (e.g.,

Systems

Systems Development Tuning Monitoring

Snap: a microkernel approach to host networking

The Morning Paper

NOVEMBER 10, 2019

Here are the bombshell paragraphs: Our datacenter applications seek ever more CPU-efficient and lower-latency communication, which Pony Express delivers. The desire for CPU efficiency and lower latencies is easy to understand. Once the whole fleet has turned over, the code for the now unused version(s) can be removed.

Network

Network Transportation Latency Entertainment

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses. Stay tuned! I don’t expect all of that, but the core can clearly make use of more than 20 GB/s. Why is the single-core bandwidth increasing so slowly?

Benchmarking

Benchmarking Cache Latency Tuning

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

As our business scales globally, the demand for data is growing and the needs for scalable low latency incremental processing begin to emerge. So we don’t need to hard code the lookback window in the business logic. There are three common issues that the dataset owners usually face.

Processing

Processing Big Data Efficiency Engineering

Software engineering for machine learning: a case study

The Morning Paper

JULY 7, 2019

In addition to availability, our respondents focus most heavily on supporting the following data attributes: “accessibility, accuracy, authoritativeness, freshness, latency, structuredness, ontological typing, connectedness, and semantic joinability.” To address this, rigorous rollout processes are required.

Software Engineering

Software Engineering Engineering Software Software

Growth Engineering at Netflix- Creating a Scalable Offers Platform

The Netflix TechBlog

FEBRUARY 9, 2021

This platform unlocks tremendous business value since product-oriented teams are now free to use the platform to experiment with different product offerings for our global audience, with little to no code changes required. This shape facilitates code reuse at the UI layer as well as the service layers. The world is constantly changing.

Engineering

Engineering Scalability Architecture Innovation

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

Preventing code reuse across databases. The same code is used for MySQL and PostgreSQL and can be used for other similar databases as well. Passive instances across regions are also possible, though it is recommended to operate in the same region as the database host in order to keep the change capture latencies low.

Database

Database Traffic Transportation Open Source

DBLog: A Generic Change-Data-Capture Framework

The Netflix TechBlog

DECEMBER 17, 2019

This prevents code reuse across databases. The same code is used for MySQL and PostgreSQL and can be used for other similar databases as well. Passive instances across regions are also possible, though it is recommended to operate in the same region as the database host in order to keep the change capture latencies low.

Database

Database Traffic Transportation Open Source

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. The broken Java stacks turned out to be beneficial: They helped group together the os::javaTimeMillis() calls which otherwise might have have been scattered on top of different Java code paths, appearing as thin stacks everywhere.

Speed

Speed Java AWS Virtualization

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

All Things Distributed

APRIL 9, 2015

The Amazon ML console and API provide data and model visualization tools, as well as wizards to guide you through the process of creating machine learning models, measuring their quality and fine-tuning the predictions to match your application requirements.

Lambda

Lambda Cloud IoT AWS

bpftrace (DTrace 2.0) for Linux 2018

Brendan Gregg

OCTOBER 8, 2018

Screenshot: tracing read latency for PID 181: # bpftrace -e 'kprobe:vfs_read /pid == 30153/ { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'. Since I helped developed bpftrace, I'm aware of how fresh my own code is and how likely I introduced bugs.

C++

C++ Virtualization Programming Latency

Monitoring Serverless Applications

Dotcom-Montior

NOVEMBER 11, 2020

Serverless computing can be a huge benefit to organizations that don’t have the necessary resources or teams to manage physical resources, like servers/hardware, and all the maintenance and licensing that goes along with that, allowing them to focus on developing their code and applications. Benefits of a Serverless Model. Scalability.

Serverless

Serverless Monitoring Lambda Latency

Using Modern Image Formats: AVIF And WebP

Smashing Magazine

SEPTEMBER 29, 2021

Tip: When evaluating quality, compression and fine-tuning of modern formats, Squoosh.app ’s ability to perform a visual side-by-side comparison is helpful. The goal was to develop a new open-source video coding format that is both state-of-the-art and royalty-free. It calls Rust code in the browser using a WebWorker.

Open Source

Open Source Speed Website Google

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

HammerDB

MAY 5, 2019

However in the Skylake microarchitecture (you can see a list of CPUs here ) the PAUSE instruction changed and in the documentation it says “the latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.”

Testing

Testing Tuning Latency Storage

Plan Your Multi Cloud Strategy

Scalegrid

MARCH 22, 2024

They can also bolster uptime and limit latency issues or potential downtimes. Adopting Infrastructure as Code (IaaC) makes transitioning to a multi-cloud architecture more efficient, allowing streamlined setup processes.

Strategy

Strategy Cloud Government Innovation

Friends don't let friends build data pipelines

Abhishek Tiwari

JULY 12, 2018

Here are 8 fallacies of data pipeline The pipeline is reliable Topology is stateless Pipeline is infinitely scalable Processing latency is minimum Everything is observable There is no domino effect Pipeline is cost-effective Data is homogeneous The pipeline is reliable The inconvenient truth is that pipeline is not reliable.

Latency

Latency Analytics Scalability Engineering

Solaris to Linux Migration 2017

Brendan Gregg

SEPTEMBER 5, 2017

It uses a Solaris Porting Layer (SPL) to provide a Solaris-kernel interface on Linux, so that unmodified ZFS code can execute. There's also a ZFS send/recv code path that should try to use the TASK_INTERRUPTIBLE flag (as suggested by a coworker), to avoid a kernel hang (can't kill -9 the process). Tracing ZFS operation latency.

Virtualization

Virtualization AWS Engineering Hardware

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing) Introduction: In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( [link] ) to the Xeon Phi x200 (Knights Landing) processors here at TACC.

Latency

Latency Hardware Code Testing

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

Introduction: In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( [link] ) to the Xeon Phi x200 (Knights Landing) processors here at TACC. of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Latency

Latency Hardware Code Testing

Netflix’s Distributed Counter Abstraction

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Trending Sources

Performance Tuning Java Applications in Linux

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

How Dynatrace boosts production resilience with Site Reliability Guardian

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Automated observability, security, and reliability at scale

The Netflix Cosmos Platform

Faster time to value with enhanced handling of OneAgent runtime data

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Applying Netflix DevOps Patterns to Windows

Rebuilding Netflix Video Processing Pipeline with Microservices

What is serverless computing? Driving efficiency without sacrificing observability

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Telltale: Netflix Application Monitoring Simplified

Towards a Reliable Device Management Platform

DevOps automation: From event-driven automation to answer-driven automation [with causal AI]

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Streaming SQL in Data Mesh

Tuning SQL Server Reporting Services

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Crucial Redis Monitoring Metrics You Must Watch

AI Essentials for Tech Executives

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Snap: a microkernel approach to host networking

The evolution of single-core bandwidth in multicore processors

Incremental Processing using Netflix Maestro and Apache Iceberg

Software engineering for machine learning: a case study

Growth Engineering at Netflix- Creating a Scalable Offers Platform

DBLog: A Generic Change-Data-Capture Framework

DBLog: A Generic Change-Data-Capture Framework

The Speed of Time

Expanding the Cloud: Amazon Machine Learning Service, the Amazon Elastic Filesystem and more

bpftrace (DTrace 2.0) for Linux 2018

Monitoring Serverless Applications

Using Modern Image Formats: AVIF And WebP

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

Plan Your Multi Cloud Strategy

Friends don't let friends build data pipelines

Solaris to Linux Migration 2017

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Stay Connected