Cache, Code and Hardware - Technology Performance Pulse

Seeing through hardware counters: a journey to threefold performance increase

The Netflix TechBlog

NOVEMBER 9, 2022

We also see much higher L1 cache activity combined with 4x higher count of MACHINE_CLEARS. a usage pattern occurring when 2 cores reading from / writing to unrelated variables that happen to share the same L1 cache line. Cache line is a concept similar to memory page?—? Thread 0’s cache in this example.

Hardware

Hardware Cache Performance Latency

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

Sustainable memory bandwidth using multi-threaded code has closely followed the peak DRAM bandwidth, typically delivering best case throughput of 75%-85% of the peak DRAM bandwidth in each generation. GB/s peak DRAM bandwidth, requiring 6 concurrent 64-byte cache line accesses to be pending at all times to maintain full bandwidth.

Latency

Latency Hardware Cache Systems

AWS serverless services: Exploring your options

Dynatrace

OCTOBER 7, 2021

Instead of worrying about infrastructure management functions, such as capacity provisioning and hardware maintenance, teams can focus on application design, deployment, and delivery. Using a low-code visual workflow approach, organizations can orchestrate key services, automate critical processes, and create new serverless applications.

Serverless

Serverless AWS Lambda Storage

How to Optimize Digital Experience and Operations with Dynatrace

Dynatrace

AUGUST 30, 2019

Reducing CPU Utilization to now only consume 15% of initially provisioned hardware. Missing Cache Settings – Make sure you cache resources that don’t change often on the browser or use a CDN. Missing caching layers, e.g. provide a read-only cache for static data. Well – there are many answers to this.

Cache

Cache Database Architecture Government

Distance-Based ISA for Efficient Register Management

ACM Sigarch

APRIL 2, 2025

To create a CPU core that can execute a large number of instructions in parallel, it is necessary to improve both the architecturewhich includes the overall CPU design and the instruction set architecture (ISA) designand the microarchitecture, which refers to the hardware design that optimizes instruction execution.

Efficiency

Efficiency Hardware Architecture Design

Time to First Byte: What It Is and Why It Matters

CSS Wizardry

AUGUST 7, 2019

only to find that the resource they’re requesting isn’t in that PoP ’s cache. Application runtime: It’s kind of obvious really, but the time it takes to run your actual application code is going to be a large contributor to your TTFB. Routing: If you are using a CDN—and you should be!—a

Latency

Latency Ecommerce Servers Mobile

Six things that slow down your site's UX (and why you have no control over them)

Speed Curve

FEBRUARY 10, 2025

Photo by Freepik Part of the answer is this: You have a lot of control over the design and code for the pages on your site, plus a decent amount of control over the first and middle mile of the network your pages travel over. For a myriad of reasons, older hardware can't always accommodate faster speeds. but couldn't find anything.

Hardware

Hardware Internet Internet Speed

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

Effective management of memory stores with policies like LRU/LFU proactive monitoring of the replication process and advanced metrics such as cache hit ratio and persistence indicators are crucial for ensuring data integrity and optimizing Redis’s performance. <code> 127.0.0.1:6379> <code> 127.0.0.1:6379>

Metrics

Metrics Monitoring Latency Cache

The Return of the Frame Pointers

Brendan Gregg

MARCH 16, 2024

Apart from library code, maybe your application doesn't have frame pointers either, in which case everything is broken. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway.

Java

Java Cache Google Hardware

Use Distributed Caching to Accelerate Online Web Sites

ScaleOut Software

APRIL 22, 2020

The Solution: Distributed Caching. A widely used technology called distributed caching meets this need by storing frequently accessed data in memory on a server farm instead of within a database. It’s not enough simply to lash together a set of servers hosting a collection of in-memory caches.

Cache

Cache Storage Servers Database

Use Distributed Caching to Accelerate Online Web Sites

ScaleOut Software

APRIL 22, 2020

The Solution: Distributed Caching. A widely used technology called distributed caching meets this need by storing frequently accessed data in memory on a server farm instead of within a database. It’s not enough simply to lash together a set of servers hosting a collection of in-memory caches.

Cache

Cache Storage Servers Database

Embrace event-driven computing: Amazon expands DynamoDB with streams, cross-region replication, and database triggers

All Things Distributed

JULY 14, 2015

Streams provide you with the underlying infrastructure to create new applications, such as continuously updated free-text search indexes, caches, or other creative extensions requiring up-to-date table changes. An AWS Lambda function is a simpler option that you can use, as it only requires you to code the logic, set it, and forget it.

Database

Database Lambda AWS IoT

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

The Morning Paper

MAY 12, 2019

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., The paper examines the implications of microservices at the hardware, OS and networking stack, cluster management, and application framework levels, as well as the impact of tail latency. ASPLOS’19.

Open Source

Open Source Hardware Benchmarking Systems

From Heavy Metal to Irrational Exuberance

ACM Sigarch

OCTOBER 12, 2020

First, its origin was in a monoculture (the browser) wher e there was no need for compatibility with legacy code. Unfortunately, languages like Python have proven resistant to efficient implementation, partly because of their design, and partly because of limitations imposed by the need to interop with C code. MICRO 15 , Gope et al.,

C++

C++ Benchmarking Hardware Architecture

MySQL Capacity Planning

Percona

AUGUST 8, 2023

Hardware considerations The first thing we have to consider here is the resources that the underlying host provides to the database. Global caches like the InnoDB buffer pool and MyISAM key cache and session-level caches like the sort buffer, join buffer, random read buffer, etc. Do these queries use more caches?

Traffic

Traffic Cache Monitoring Database

Software Testing Errors to look out for (with examples)

Testsigma

JUNE 3, 2021

Hardware error. We focus on software so much that we forget about the hardware failures. If the hardware gets disconnected or stops working then we cannot expect correct output from the software. Example: Printers and other hardware devices return bits of information that something is not right. Hardware issues.

Software

Software Software Testing Hardware

The Ultimate Guide to Database High Availability

Percona

JUNE 22, 2023

Defining high availability In general terms, high availability refers to the continuous operation of a system with little to no interruption to end users in the event of hardware or software failures, power outages, or other disruptions. If a primary server fails, a backup server can take over and continue to serve requests.

Availability

Availability Database Open Source Hardware

Progress Delayed Is Progress Denied

Alex Russell

APRIL 29, 2021

After 20 years of neck-in-neck competition, often starting from common code lineages, there just isn't that much left to wring out of the system. Enable developers to compress data efficiently without downloading large amounts of code to the browser. is access to hardware devices. Compression Streams. Keyboard Lock API.

Media

Media Games Education Engineering

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

John McCalpin

SEPTEMBER 17, 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. leaving half of the Logical Processors idle).

Hardware

Hardware Performance Cache Availability

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

John McCalpin

SEPTEMBER 17, 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. leaving half of the Logical Processors idle).

Hardware

Hardware Performance Cache Availability

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

Benchmarking

Benchmarking Cache Latency Tuning

This spring: High-Performance and Low-Latency C++ (Stockholm) and ACCU (Bristol)

Sutter's Mill

FEBRUARY 13, 2017

” This contains updated and new material that reflects the latest C++ standards and compilers, with a focus to using modern C++11/14/17 effectively on modern hardware and memory architectures. On April 25-27, I’ll be in Stockholm (Kista) giving a three-day seminar on “High-Performance and Low-Latency C++.”

Latency

Latency C++ Hardware Performance

How Google PageSpeed Works: Improve Your Score and Search Engine Ranking

CSS - Tricks

JULY 25, 2019

Cache-Headers missing? Where possible, remove unused JavaScript code or focus on only delivering a script that will be run by the current page. This approach is known as code splitting and is extremely effective in improving TTI. Service workers that will cache the bytecode result of a parsed and compiled script.

Google

Google Engineering Speed Mobile

Compiler bug? Linker bug? Windows Kernel bug.

Randon ASCII

FEBRUARY 25, 2018

In this particular investigation, which spanned twenty months, we suspected hardware failure, compiler bugs, linker bugs, and other possibilities. Jumping too quickly to blaming hardware or build tools is a classic mistake, but in this case the mistake was that we weren’t thinking big enough.

Programming

Programming Hardware Cache Code

Extending relational query processing with ML inference

The Morning Paper

FEBRUARY 20, 2020

statement adds the source code for the model pipeline (Python in the example) to the database. A runtime code generator creates a SQL query incorporating all of these optimisations. For single or very small numbers of predictions, Raven is faster due to SQL Server’s caching. An INSERT INTO model. categorical encoding).

Processing

Processing Hardware Database Servers

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

John McCalpin

FEBRUARY 17, 2025

Sustainable memory bandwidth using multi-threaded code has closely followed the peak DRAM bandwidth, typically delivering best case throughput of 75%-85% of the peak DRAM bandwidth in each generation. GB/s peak DRAM bandwidth, requiring 6 concurrent 64-byte cache line accesses to be pending at all times to maintain full bandwidth.

Latency

Latency Hardware Cache Systems

Time protection: the missing OS abstraction

The Morning Paper

APRIL 14, 2019

The paper sets out what we can do in software given today’s hardware, and along the way also highlights areas where cooperation from hardware will be needed in the future. Microarchitectural channels. There are also stateless interconnects including buses and on-chip networks. Threat scenarios. IPC) input and output channels.

Hardware

Hardware Cache Latency Speed

HammerDB for Managers

HammerDB

JUNE 27, 2022

It enables the user to measure database performance and make comparative judgements about database hardware and software. These factors meant that often when looking for database performance information, the results for a particular combination of software and hardware were not available. Cached vs Scaled Workloads.

Benchmarking

Benchmarking Open Source C++ Cache

Solving Common Cross-Platform Issues When Working With Flutter

Smashing Magazine

JUNE 18, 2020

More specifically, we’re going to talk about storage and UI differences, which are the ones that most often cause confusion to developers when writing Flutter code that they want to be cross-platform. Running Different Code On Different Platforms. In this article, we’re going to see some of those differences and how to overcome them.

Storage

Storage Mobile Website Java

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

Bandwidth, performance analysis has two recurring themes: How fast should this code (or “simple” variations on this code) run on this hardware? Interacting components in the execution of an MPI job — a brief outline (from memory): The user source code, which contains an ordered set of calls to MPI routines.

Hardware

Hardware Transportation Performance Latency

Why I hate MPI (from a performance analysis perspective)

John McCalpin

AUGUST 1, 2018

Bandwidth, performance analysis has two recurring themes: How fast should this code (or “simple” variations on this code) run on this hardware? Interacting components in the execution of an MPI job — a brief outline (from memory): The user source code, which contains an ordered set of calls to MPI routines.

Hardware

Hardware Transportation Performance Latency

A thorough introduction to bpftrace

Brendan Gregg

AUGUST 18, 2019

It was created by Alastair Robertson, a talented UK-based developer who has previously won various coding competitions. hardware Hardware counter-based instrumentation. Hence static instrumentation, where event points are hard-coded and become a stable API. tracepoint Kernel static instrumentation points. bashreadline.bt

Latency

Latency C++ Cache Programming

RUM vs APM

KeyCDN

JANUARY 23, 2020

A wide range of users with different operating systems, browsers, hardware configurations and other variables provides a wide sample size that helps developers discover as many issues as possible. Some APM solutions can monitor code during development to ensure performance. What is real user monitoring (RUM)? Usage performance.

Strategy

Strategy Metrics Monitoring Servers

Examining the Performance Impact of an Adhoc Workload

SQL Performance

MAY 22, 2019

It’s one of the things we look at during a health audit, and Kimberly has a great query from her Plan cache and optimizing for adhoc workloads post that’s part of our toolkit. About 1GB of the plan cache is for prepared and procedure plans, and they only take up about 300MB worth of space. DBCC FREEPROCCACHE ; GO. EXEC dbo. [

Cache

Cache Performance Hardware Servers

A persistent problem: managing pointers in NVM

The Morning Paper

DECEMBER 8, 2019

Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data. The beauty of persistent memory is that we can use memory layouts for persistent data (with some considerations for volatile caches etc. What about security?

Hardware

Hardware Programming Media Storage

Can You Afford It?: Real-world Web Performance Budgets

Alex Russell

OCTOBER 22, 2017

For this page to be done loading it needs to be responsive to user input — the “interactive” in “Time to Interactive” Browsers process user input by generating DOM events that application code listens to. The gzip compression factor for JS code is between 5x and 7x. Execute the script.

Performance

Performance Network Benchmarking Mobile

Architectural Myopia

ACM Sigarch

FEBRUARY 23, 2018

According to the article “ When Women Stopped Coding “, the percent of women in Computer Science 30 years ago was nearly twice what it is now. Hardware engineers design and implement solutions in RTL, while software engineers attempt to solve the problem either at the OS or application level.

Architecture

Architecture Cache Hardware Virtualization

SQL Server I/O Basics Chapter #1

SQL Server According to Bob

JANUARY 11, 2020

Stable media is commonly physical disk storage, but other devices and certain caching facilities qualify as well. Many high-end disk subsystems provide high-speed cache facilities to reduce the latency of read and write operations. This cache is often supported by a battery-powered backup facility.

Servers

Servers Cache Media Hardware

How Improving Website Performance Can Help Save The Planet

Smashing Magazine

JANUARY 15, 2019

how much data does the browser have to download to display your website) and resource usage of the hardware serving and receiving the website. This represents a relatively meager saving, but by establishing a philosophy of pruning unwanted code and requests from our pages, we can make much more significant performance improvements.

Website Performance

Website Performance Website Energy Performance

Intel discloses “vector+SIMD” instructions for future processors

John McCalpin

NOVEMBER 5, 2016

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code: for (i=0; i<N; i++) { for (j=0; j<N; j++) { for (k=0; k<N; k++) { C[i][j] += A[i][k] * B[k][j]; } } }.

Cache

Cache C++ Latency Hardware

How to Assess MySQL Performance

HammerDB

APRIL 19, 2023

GHz 4th Generation Intel Xeon Scalable processors (code-named Sapphire Rapids) Up to 20% higher compute performance than z1d instances Up to 50 Gbps of networking speed Up to 40 Gbps of bandwidth to the Amazon Elastic Block Store (EBS) We can also verify these capabilities by running some simple benchmarks on the different subsystems.

Performance

Performance Benchmarking Cache Storage

24-core CPU and I can’t type an email (part one)

Randon ASCII

AUGUST 16, 2018

My VirtualScan program just called NtQueryVirtualMemory in the obvious loop to scan the address space of a specified process, the code worked, it took a really long time to scan the gmail process (10-15 seconds), and it triggered the hang. MiB, 32112, 98 code blocks. The CFG memory block is best thought of a cache with bounded size.

Programming

Programming Virtualization Code Processing

SQL 2016 – It Just Runs Faster Announcement

SQL Server According to Bob

JUNE 3, 2016

My development collogues and I are starting a regular blog series, outlining the vast range of scalability improvements, allowing SQL Server 2016 to run across a wide array of hardware configurations, faster and better than previous releases of SQL Server. The following table is taken from an ASP.NET, session state cache, stress test.

Hardware

Hardware Software Engineering Transportation Scalability

Emerging Fault Modes: Challenges and Research Opportunities

ACM Sigarch

JULY 17, 2023

An example of a specification is the correct operation of the hardware of a microprocessor. An example protection technique that can provide this capability is an Error Correcting Code (ECC). An SDC is the worst possible outcome of a fault, as it can have an arbitrary impact on the correctness of software running on the hardware.

Energy

Energy Hardware Best Practices Architecture

Seeing through hardware counters: a journey to threefold performance increase

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Trending Sources

AWS serverless services: Exploring your options

How to Optimize Digital Experience and Operations with Dynatrace

Distance-Based ISA for Efficient Register Management

Time to First Byte: What It Is and Why It Matters

Six things that slow down your site's UX (and why you have no control over them)

Crucial Redis Monitoring Metrics You Must Watch

The Return of the Frame Pointers

Use Distributed Caching to Accelerate Online Web Sites

Use Distributed Caching to Accelerate Online Web Sites

Embrace event-driven computing: Amazon expands DynamoDB with streams, cross-region replication, and database triggers

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

From Heavy Metal to Irrational Exuberance

MySQL Capacity Planning

Software Testing Errors to look out for (with examples)

The Ultimate Guide to Database High Availability

Progress Delayed Is Progress Denied

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

The evolution of single-core bandwidth in multicore processors

This spring: High-Performance and Low-Latency C++ (Stockholm) and ACCU (Bristol)

How Google PageSpeed Works: Improve Your Score and Search Engine Ranking

Compiler bug? Linker bug? Windows Kernel bug.

Extending relational query processing with ML inference

Single-core memory bandwidth: Latency, Bandwidth, and Concurrency

Time protection: the missing OS abstraction

HammerDB for Managers

Solving Common Cross-Platform Issues When Working With Flutter

Why I hate MPI (from a performance analysis perspective)

Why I hate MPI (from a performance analysis perspective)

A thorough introduction to bpftrace

RUM vs APM

Examining the Performance Impact of an Adhoc Workload

A persistent problem: managing pointers in NVM

Can You Afford It?: Real-world Web Performance Budgets

Architectural Myopia

SQL Server I/O Basics Chapter #1

How Improving Website Performance Can Help Save The Planet

Intel discloses “vector+SIMD” instructions for future processors

How to Assess MySQL Performance

24-core CPU and I can’t type an email (part one)

SQL 2016 – It Just Runs Faster Announcement

Emerging Fault Modes: Challenges and Research Opportunities

Stay Connected