Code, Hardware and Latency - Technology Performance Pulse

Seeing through hardware counters: a journey to threefold performance increase

The Netflix TechBlog

NOVEMBER 9, 2022

A quick canary test was free of errors and showed lower latency, which is expected given that our standard canary setup routes an equal amount of traffic to both the baseline running on 4xl and the canary on 12xl. What’s worse, average latency degraded by more than 50%, with both CPU and latency patterns becoming more “choppy.”

Hardware

Hardware Cache Performance Latency

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

Sustainable memory bandwidth using multi-threaded code has closely followed the peak DRAM bandwidth, typically delivering best case throughput of 75%-85% of the peak DRAM bandwidth in each generation. The example below is for a 2005-era processor with 60 ns memory latency and 6.4 cache lines -> 5.6

Latency

Latency Hardware Cache Systems

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

Dynatrace

OCTOBER 23, 2023

It enables multiple operating systems to run simultaneously on the same physical hardware and integrates closely with Windows-hosted services. Therefore, they experience how the application code functions and how the application operations depend on the underlying hardware resources and the operating system managed by Hyper-V.

Efficiency

Efficiency Virtualization Hardware Performance

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

AWS Lambda is a serverless compute service that can run code in response to predetermined events or conditions and automatically manage all the computing resources required for those processes. Customizing and connecting these services requires code. What is AWS Lambda? Where does Lambda fit in the AWS ecosystem?

Lambda

Lambda AWS Serverless Hardware

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

This allows teams to sidestep much of the cost and time associated with managing hardware, platforms, and operating systems on-premises, while also gaining the flexibility to scale rapidly and efficiently. When an application is triggered, it can cause latency as the application starts. This creates latency when they need to restart.

Serverless

Serverless Efficiency Lambda AWS

Time to First Byte: What It Is and Why It Matters

CSS Wizardry

AUGUST 7, 2019

The first—and often most surprising for people to learn—thing that I want to draw your attention to is that TTFB counts one whole round trip of latency. The reason is because mobile networks are, as a rule, high latency connections. Armed with this knowledge, we can soon understand why TTFB can often increase so dramatically on mobile.

Latency

Latency Ecommerce Servers Mobile

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

Complementing the hardware is the software on the RAE and in the cloud, and bridging the software on both ends is a bi-directional control plane. When a new hardware device is connected, the Local Registry detects and collects a set of information about it, such as networking information and ESN.

Latency

Latency Traffic Transportation Cloud

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

In these modern environments, every hardware, software, and cloud infrastructure component and every container, open-source tool, and microservice generates records of every activity. Observability relies on telemetry derived from instrumentation that comes from the endpoints and services in your multi-cloud computing environments.

Metrics

Metrics Open Source Monitoring Cloud

What is cloud migration?

Dynatrace

SEPTEMBER 30, 2021

It requires purchasing, powering, and configuring physical hardware, training and retaining the staff capable of servicing and securing the machines, operating a data center, and so on. They need enough hardware to serve their anticipated volume and keep things running smoothly without buying too much or too little. Reduced cost.

Cloud

Cloud Traffic Best Practices Strategy

Performance Tuning Java Applications in Linux

DZone

DECEMBER 4, 2019

While Performance Tuning an application both Code and Hardware running the code should be accounted for. Reduce the amount of code in critical sections. For low latency, applications use Concurrent Mark and Sweep Algorithm — CMS or G1 GC. Thread Contention. Prefer synchronized blocks over synchronized methods.

Java

Java Tuning Performance Hardware

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

The Morning Paper

MAY 12, 2019

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., The paper examines the implications of microservices at the hardware, OS and networking stack, cluster management, and application framework levels, as well as the impact of tail latency.

Open Source

Open Source Hardware Benchmarking Systems

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Key Takeaways Critical performance indicators such as latency, CPU usage, memory utilization, hit rate, and number of connected clients/slaves/evictions must be monitored to maintain Redis’s high throughput and low latency capabilities. <code> 127.0.0.1:6379> cmdstat_append:calls=797,usec=4480,usec_per_call=5.62

Metrics

Metrics Monitoring Latency Cache

This spring: High-Performance and Low-Latency C++ (Stockholm) and ACCU (Bristol)

Sutter's Mill

FEBRUARY 13, 2017

Tue-Thu Apr 25-27: High-Performance and Low-Latency C++ (Stockholm). On April 25-27, I’ll be in Stockholm (Kista) giving a three-day seminar on “High-Performance and Low-Latency C++.”

Latency

Latency C++ Hardware Performance

Growth Engineering at Netflix?—?Automated Imagery Generation

The Netflix TechBlog

FEBRUARY 9, 2021

Server-generated assets, since client-side generation would require the retrieval of many individual images, which would increase latency and time-to-render. To reduce latency, assets should be generated in an offline fashion and not in real time. First, the fields can be coded by hand.

Engineering

Engineering Storage Latency Entertainment

5 Steps to Accelerate your Cloud Migration with Dynatrace

Dynatrace

AUGUST 5, 2019

There is no code or configuration change necessary to capture data and detect existing services. Lift & Shift is where you basically just move physical or virtual hosts to the cloud – essentially you just run your host on somebody else’s hardware. We let the OneAgent run and then leverage the data for the following key use cases.

Cloud

Cloud Traffic Database Network

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

Sustainable memory bandwidth using multi-threaded code has closely followed the peak DRAM bandwidth, typically delivering best case throughput of 75%-85% of the peak DRAM bandwidth in each generation. The example below is for a 2005-era processor with 60 ns memory latency and 6.4 cache lines -> 5.6

Latency

Latency Hardware Cache Systems

Embrace event-driven computing: Amazon expands DynamoDB with streams, cross-region replication, and database triggers

All Things Distributed

JULY 14, 2015

In traditional database architectures, database engines often run a small search engine or data warehouse engines on the same hardware as the database. However, in the past, you had to write code to manage the data changes and deal with keeping the search engine and data warehousing engines in sync. DynamoDB Cross-region Replication.

Database

Database Lambda AWS IoT

Snap: a microkernel approach to host networking

The Morning Paper

NOVEMBER 10, 2019

Here are the bombshell paragraphs: Our datacenter applications seek ever more CPU-efficient and lower-latency communication, which Pony Express delivers. The desire for CPU efficiency and lower latencies is easy to understand. Once the whole fleet has turned over, the code for the now unused version(s) can be removed.

Network

Network Transportation Latency Entertainment

Seamless offloading of web app computations from mobile device to edge clouds via HTML5 Web Worker migration

The Morning Paper

JANUARY 30, 2020

Edge servers are the middle ground – more compute power than a mobile device, but with latency of just a few ms. Wasm functions contain native codes compiled at runtime, so they should not be directly migrated as normal JavaScript objects. Why would we want to live migrate web workers? Is the migration worth it though?

Mobile

Mobile Cloud Latency Games

Trip report: February 2025 ISO C++ standards meeting (Hagenberg, Austria)

Sutter's Mill

FEBRUARY 17, 2025

The paper also provides std::observable() as a manual way of adding such a checkpoint in code. Importantly, user code gets this benefit just by building with a hardened C++26 standard library without any code changes. C++26 hardened standard library The second is another big step for language and library safety in C++26.

C++

C++ Programming Code Google

Narrowing the gap between serverless and its state with storage functions

The Morning Paper

JANUARY 28, 2020

Shredder is " a low-latency multi-tenant cloud store that allows small units of computation to be performed directly within storage nodes. " A tenant should not be able to see the code or data of other tenants (isolation). " Running end-user compute inside the datastore is not without its challenges of course.

Serverless

Serverless Storage Latency Cloud

Linux Load Averages: Solving the Mystery

Brendan Gregg

AUGUST 8, 2017

Nowadays, the source code to old operating systems can also be found online. Linux is also hard coding the 1, 5, and 15 minute constants. This state is used by code paths that want to avoid interruptions by signals, which includes tasks blocked on disk I/O and some locks. This, too, was a dead end. They aren't idle.

Latency

Latency Metrics C++ Systems

The Speed of Time

Brendan Gregg

SEPTEMBER 25, 2021

A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. The broken Java stacks turned out to be beneficial: They helped group together the os::javaTimeMillis() calls which otherwise might have have been scattered on top of different Java code paths, appearing as thin stacks everywhere.

Speed

Speed Java AWS Virtualization

How To Build Resilient JavaScript UIs

Smashing Magazine

AUGUST 3, 2021

Different browsers running on different platforms and hardware, respecting our user preferences and browsing modes (Safari Reader/ assistive technologies), being served to geo-locations with varying latency and intermittency increase the likeness of something not working as intended. More after jump!

Network

Network Engineering Availability Code

Understanding What Kubernetes Is Used For: The Key to Cloud-Native Efficiency

Percona

NOVEMBER 9, 2023

Applications are packaged into a single, lightweight container with their dependencies, typically including the application’s code, customizations, libraries, and runtime environment. Your workloads, encapsulated in containers, can be deployed freely across different clouds or your own hardware.

Efficiency

Efficiency Cloud Healthcare Open Source

Why Traditional Monitoring Isn’t Enough for Modern Web Applications

Dotcom-Montior

MAY 12, 2020

Monitoring of page load time, page length, response time, and request code can also be observed with the traditional HTTP monitoring. Network latency. Hardware resources. Network Latency. Network latency can be affected due to. Hardware Resources. If that is available, then a positive response is received.

Monitoring

Monitoring Entertainment Hardware Traffic

Millions of tiny databases

The Morning Paper

MARCH 3, 2020

This work is latency critical, because volume IO is blocked until it is complete. Larger cells have better tolerance of tail latency (e.g. Studies across three decades have found that software, operations, and scale drive downtime in systems designed to tolerate hardware faults. Cells have seven nodes. Before and After.

Database

Database AWS Network Design

Monitoring Distributed Systems

Dotcom-Montior

NOVEMBER 24, 2021

Software and hardware components are autonomous and execute tasks concurrently. A distributed system comprises of a variety of hardware and software components with different operating systems and technologies, meaning the processors are separate and independent of each other. State is distributed through the system. Concurrency.

Systems

Systems Monitoring Hardware Network

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses. I don’t expect all of that, but the core can clearly make use of more than 20 GB/s. Why is the single-core bandwidth increasing so slowly? On a VE20B (8 cores, 1.6

Benchmarking

Benchmarking Cache Latency Tuning

Progress Delayed Is Progress Denied

Alex Russell

APRIL 29, 2021

After 20 years of neck-in-neck competition, often starting from common code lineages, there just isn't that much left to wring out of the system. For heavily latency-sensitive use-cases like WebXR, this is a critical component in delivering a good experience. is access to hardware devices. Offscreen Canvas. Compression Streams.

Media

Media Games Education Engineering

A thorough introduction to bpftrace

Brendan Gregg

AUGUST 18, 2019

It was created by Alastair Robertson, a talented UK-based developer who has previously won various coding competitions. For example, iostat(1), or a monitoring agent, may tell you your average disk latency, but not the distribution of this latency. hardware Hardware counter-based instrumentation.

Latency

Latency C++ Cache Programming

What is Serverless Architecture?

cdemi

FEBRUARY 20, 2017

Let's talk about the elephant in the room; Serverless doesn't really mean that there are no Software or Hardware servers. Performance - Serverless Functions that are used less frequently may suffer from warmup response latency, where the infrastructure needs some time to deploy the function. Amazon: AWS Lambda. IBM: OpenWhisk.

Serverless

Serverless Architecture Lambda Azure

How Google PageSpeed Works: Improve Your Score and Search Engine Ranking

CSS - Tricks

JULY 25, 2019

Estimated Input Latency. Estimated Input Latency. Where possible, remove unused JavaScript code or focus on only delivering a script that will be run by the current page. This approach is known as code splitting and is extremely effective in improving TTI. They are: Time to Interactive ( TTI ). Speed Index. Speed Index.

Google

Google Engineering Speed Mobile

Monitoring Serverless Applications

Dotcom-Montior

NOVEMBER 11, 2020

Serverless computing can be a huge benefit to organizations that don’t have the necessary resources or teams to manage physical resources, like servers/hardware, and all the maintenance and licensing that goes along with that, allowing them to focus on developing their code and applications. Benefits of a Serverless Model. Scalability.

Serverless

Serverless Monitoring Lambda Latency

Comparing HammerDB TPROC-C results with sysbench-tpcc

HammerDB

JULY 6, 2024

In a recent project comparing systems for MariaDB performance, a user had originally been using a tool called sysbench-tpcc to compare hardware platforms before migrating to HammerDB. This is a brief post to highlight the metrics to use to do the comparison using a separate hardware platform for illustration purposes. hammerdbcli auto./scripts/tcl/maria/tprocc/maria_tprocc_build.tcl

C++

C++ Hardware Benchmarking Virtualization

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

Bandwidth, performance analysis has two recurring themes: How fast should this code (or “simple” variations on this code) run on this hardware? Interacting components in the execution of an MPI job — a brief outline (from memory): The user source code, which contains an ordered set of calls to MPI routines.

Hardware

Hardware Transportation Performance Latency

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

Bandwidth, performance analysis has two recurring themes: How fast should this code (or “simple” variations on this code) run on this hardware? Interacting components in the execution of an MPI job — a brief outline (from memory): The user source code, which contains an ordered set of calls to MPI routines.

Hardware

Hardware Transportation Performance Latency

Time protection: the missing OS abstraction

The Morning Paper

APRIL 14, 2019

The paper sets out what we can do in software given today’s hardware, and along the way also highlights areas where cooperation from hardware will be needed in the future. Microarchitectural channels. Side-channels are similar, but the sender does not actively cooperate). Threat scenarios. IPC) input and output channels.

Hardware

Hardware Cache Latency Speed

Friends don't let friends build data pipelines

Abhishek Tiwari

JULY 12, 2018

Here are 8 fallacies of data pipeline The pipeline is reliable Topology is stateless Pipeline is infinitely scalable Processing latency is minimum Everything is observable There is no domino effect Pipeline is cost-effective Data is homogeneous The pipeline is reliable The inconvenient truth is that pipeline is not reliable.

Latency

Latency Analytics Scalability Engineering

Boosted race trees for low energy classification

The Morning Paper

MAY 28, 2019

The goal is to produce a low-energy hardware classifier for embedded applications doing local processing of sensor data. The resulting system can integrate seamlessly into a scikit-learn based development process, and dramatically reduces the total energy usage required for classification with very low latency. Introducing race logic.

Energy

Energy Hardware Efficiency Architecture

Can You Afford It?: Real-world Web Performance Budgets

Alex Russell

OCTOBER 22, 2017

For this page to be done loading it needs to be responsive to user input — the “interactive” in “Time to Interactive” Browsers process user input by generating DOM events that application code listens to. Simulated packet loss and variable latency, however, can make benchmarking extremely difficult and slow.

Performance

Performance Network Benchmarking Mobile

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing) Introduction: In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( [link] ) to the Xeon Phi x200 (Knights Landing) processors here at TACC.

Latency

Latency Hardware Code Testing

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

JANUARY 22, 2018

Introduction: In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( [link] ) to the Xeon Phi x200 (Knights Landing) processors here at TACC. Instead, we found puzzle after puzzle. Instead, we found puzzle after puzzle.

Latency

Latency Hardware Code Testing

Solaris to Linux Migration 2017

Brendan Gregg

SEPTEMBER 5, 2017

It uses a Solaris Porting Layer (SPL) to provide a Solaris-kernel interface on Linux, so that unmodified ZFS code can execute. There's also a ZFS send/recv code path that should try to use the TASK_INTERRUPTIBLE flag (as suggested by a coworker), to avoid a kernel hang (can't kill -9 the process). Tracing ZFS operation latency.

Virtualization

Virtualization AWS Engineering Hardware

Seeing through hardware counters: a journey to threefold performance increase

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Trending Sources

Optimize your environment: Unveiling Dynatrace Hyper-V extension for enhanced performance and efficient troubleshooting

What is AWS Lambda?

What is serverless computing? Driving efficiency without sacrificing observability

Time to First Byte: What It Is and Why It Matters

Towards a Reliable Device Management Platform

What is observability? Not just logs, metrics and traces

What is cloud migration?

Performance Tuning Java Applications in Linux

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

Crucial Redis Monitoring Metrics You Must Watch

This spring: High-Performance and Low-Latency C++ (Stockholm) and ACCU (Bristol)

Growth Engineering at Netflix?—?Automated Imagery Generation

5 Steps to Accelerate your Cloud Migration with Dynatrace

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Embrace event-driven computing: Amazon expands DynamoDB with streams, cross-region replication, and database triggers

Snap: a microkernel approach to host networking

Seamless offloading of web app computations from mobile device to edge clouds via HTML5 Web Worker migration

Trip report: February 2025 ISO C++ standards meeting (Hagenberg, Austria)

Narrowing the gap between serverless and its state with storage functions

Linux Load Averages: Solving the Mystery

The Speed of Time

How To Build Resilient JavaScript UIs

Understanding What Kubernetes Is Used For: The Key to Cloud-Native Efficiency

Why Traditional Monitoring Isn’t Enough for Modern Web Applications

Millions of tiny databases

Monitoring Distributed Systems

The evolution of single-core bandwidth in multicore processors

Progress Delayed Is Progress Denied

A thorough introduction to bpftrace

What is Serverless Architecture?

How Google PageSpeed Works: Improve Your Score and Search Engine Ranking

Monitoring Serverless Applications

Comparing HammerDB TPROC-C results with sysbench-tpcc

Why I hate MPI (from a performance analysis perspective)

Why I hate MPI (from a performance analysis perspective)

Time protection: the missing OS abstraction

Friends don't let friends build data pipelines

Boosted race trees for low energy classification

Can You Afford It?: Real-world Web Performance Budgets

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Solaris to Linux Migration 2017

Stay Connected