Engineering, Latency and Operating System - Technology Performance Pulse

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

As organizations continue to modernize their technology stacks, many turn to Kubernetes , an open source container orchestration system for automating software deployment, scaling, and management. Five of the most common include cluster instability, resource and cost management, security, observability, and stress on engineering teams.

Engineering

Engineering DevOps Operating System Cloud

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

The Akamas vision is that only an autonomous optimization approach powered by AI can effectively enable performance engineers, SREs, and architects to identify the best configurations that ensure maximum service performance and resilience, at the lowest possible cost and at business speed. below 500ms) and error rates (e.g. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Mastering Disk Space Management with MongoDB® Storage Engines

Scalegrid

MAY 11, 2024

MongoDB offers several storage engines that cater to various use cases. The default storage engine in earlier versions was MMAPv1, which utilized memory-mapped files and document-level locking. The newer, pluggable storage engine, WiredTiger, addresses this by using prefix compression, collection-level locking, and row-based storage.

Storage

Storage Engineering Cache Database

Maximize user experience with out-of-the-box service-performance SLOs

Dynatrace

AUGUST 25, 2023

According to the Google Site Reliability Engineering (SRE) handbook, monitoring the four golden signals is crucial in delivering high-performing software solutions. These signals ( latency, traffic, errors, and saturation ) provide a solid means of proactively monitoring operative systems via SLOs and tracking business success.

Performance

Performance Latency Traffic Metrics

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Dynatrace

JUNE 25, 2020

This means that Dynatrace continues full operation when a majority of nodes are up and a maximum of two nodes are down at a time. The network latency between cluster nodes should be around 10 ms or less. – A Dynatrace customer, Head of Performance Engineering. Dynatrace is a Tier 0 application for us. What’s next?

Availability

Availability Hardware Latency Traffic

Lessons learned from enterprise service-level objective management

Dynatrace

MAY 19, 2022

Every organization’s goal is to keep its systems available and resilient to support business demands. A service-level objective ( SLO ) is the new contract between business, DevOps, and site reliability engineers (SREs). In their new dashboard, they added dimensions for load, latency, and open problems for each component.

Automotive

Automotive Latency Architecture Mobile

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

As a bonus, operations staff never needs to update operating systems or hardware, because AWS manages servers with no stoppage of application functionality. AWS continues to improve how it handles latency issues. One factor that dissuades many from using Lambda is the need to restart containers.

Lambda

Lambda AWS Serverless Hardware

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

Uploading and downloading data always come with a penalty, namely latency. Figure 3: Video Processing with Index and Virtual Assembly Using virtual assembly greatly improves the latency performance of the ProRes 422 HQ proxy generation by removing one round trip of cloud downloading and cloud uploading by the physical assembler.

Cloud

Cloud Media Storage Cache

Front-End: Cache Strategies You Should Know

DZone

MAY 1, 2023

Caches are very useful software components that all engineers must know. It is a transversal component that applies to all the tech areas and architecture layers such as operating systems, data platforms, backend, frontend, and other components.

Cache

Cache Strategy Latency Operating System

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

STM generates traffic that replicates the typical path or behavior of a user on a network to measure performance for example, response times, availability, packet loss, latency, jitter, and other variables). One use case for STM is to model the behavior of a customer in the form of a flow of transactions along the buyer’s journey.

Monitoring

Monitoring Social Media IoT Metrics

Redis® Monitoring Strategies for 2025

Scalegrid

JANUARY 21, 2025

Identifying key Redis metrics such as latency, CPU usage, and memory metrics is crucial for effective Redis monitoring. With these essential support systems in place, you can effectively monitor your databases with up-to-date data about their health and functioning status at all times.

Strategy

Strategy Monitoring Latency DevOps

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

With the advent of generative AI, therell be significant opportunities for product managers, designers, executives, and more traditional software engineers to contribute to and build AI-powered software. Evaluation is the engine, not the afterthought. An easy fix for this involved engineering the system prompt.

Systems

Systems Development Tuning Monitoring

Redis® Monitoring Strategies for 2024

Scalegrid

DECEMBER 21, 2023

Identifying key Redis® metrics such as latency, CPU usage, and memory metrics is crucial for effective Redis monitoring. With these essential support systems in place, you can effectively monitor your databases with up-to-date data about their health and functioning status at all times.

Strategy

Strategy Monitoring Latency DevOps

Back-to-Basics Weekend Reading - U-Net: A User-Level Network Interface

All Things Distributed

OCTOBER 25, 2013

In the back to basics readings this week I am re-reading a paper from 1995 about the work that I did together with Thorsten on solving the problem of end-to-end low-latency communication on high-speed networks. The lack of low-latency made that distributed systems (e.g.

Network

Network Latency Operating System Speed

InnoDB Performance Optimization Basics

Percona

MARCH 23, 2023

Nowadays, solid-state drives (SSDs) or non-volatile memory express (NVMe) drives are preferred over traditional hard disk drives (HDDs) for database servers due to their faster read and write speeds, lower latency, and improved reliability. Operating system Linux is the most common operating system for high-performance MySQL servers.

Performance

Performance Hardware Tuning Storage

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth. Why is the single-core bandwidth increasing so slowly?

Benchmarking

Benchmarking Cache Latency Tuning

10 Lessons from 10 Years of Amazon Web Services

All Things Distributed

MARCH 11, 2016

Marvin Theimer, Amazon Distinguished Engineer, once jokingly said that the evolution of Amazon S3 could best be described as starting off as a single engine Cessna plane, but over time the plane was upgraded to a 737, then a group of 747s, all the way to the large fleet of Airbus 380s that it is now. Expect the unexpected.

AWS

AWS Hardware Retail Virtualization

Linux Load Averages: Solving the Mystery

Brendan Gregg

AUGUST 8, 2017

Nowadays, the source code to old operating systems can also be found online. For everyone familiar with other operating systems and their CPU load averages, including this state is at first deeply confusing. **Why?** One system with a ratio of 1.5 Latency was acceptable and no one complained.

Latency

Latency Metrics C++ Systems

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

The success of our early results with the Dynamo database encouraged us to write Amazon's Dynamo whitepaper and share it at the 2007 ACM Symposium on Operating Systems Principles (SOSP conference), so that others in the industry could benefit. This was the genesis of the Amazon Dynamo database.

Internet

Internet Internet AWS Performance

5 tips for architecting fast data applications

O'Reilly Software

APRIL 4, 2018

The output expectations will assist in the choice of processing engine while the process tolerance will add restrictions in terms of processing semantics and error handling. In 2016, Apache Spark introduced Structured Streaming , a new streaming engine based on the SparkSQL abstractions and runtime optimizations.

Architecture

Architecture Scalability Google Operating System

The convoy phenomenon

The Morning Paper

JUNE 30, 2019

Here’s the set-up as relayed to me by Pat (with permission): At work, I am part of a good sized team working on a large system implementation. One of the very senior engineers with 25+ years experience mentioned a problem with the system. The system just crawled forever and never seemed to get out of this state.

Traffic

Traffic Latency Programming Scalability

What Is a Workload in Cloud Computing

Scalegrid

JANUARY 12, 2024

These vendors serve data center players and offer advanced options, such as ScaleGrid’s engine, which ensures that different elements work well together automatically, eliminating the need for manual effort in managing heterogeneous environments.

Cloud

Cloud Virtualization Storage Efficiency

Monitoring Distributed Systems

Dotcom-Montior

NOVEMBER 24, 2021

Within an organization, the responsibility of monitoring these large distributed systems typically falls on site reliability engineering (SRE) teams. Types of Distributed Systems. Concurrency refers to the system’s ability to carry out multiple tasks in parallel and manage the access and usage of shared resources.

Systems

Systems Monitoring Hardware Network

Top 3 Challenges in Cross Browser Testing and How to Tackle Them

Testsigma

DECEMBER 12, 2020

The browsers work differently because of their different base engines. Let alone browsers, the website may get into trouble for different resolutions, different operating systems and different browser versions too!! Maybe just changing the code according to browsers and operating systems. Regular Browser Updates.

Testing

Testing Operating System Website Latency

Software-defined far memory in warehouse scale computers

The Morning Paper

MAY 21, 2019

This boils down to a single digit µs latency toleration in the tail for far memory, and in addition to security and privacy concerns, rules out remote memory solutions. Thus we’re fundamentally trading (de)-compression latency at access time for the ability to pack more data in memory.

Software

Software Software Google Hardware

Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures

Adrian Cockcroft

JANUARY 20, 2023

This story starts over twenty years ago, when I was a Distinguished Engineer at Sun Microsystems and Shahin Khan asked me to be the Chief Architect for the High Performance Technical Computing team he was running. To me this positions Fugaku as the first of a new mainstream, rather than a special purpose system.

Architecture

Architecture Latency Benchmarking AWS

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

HammerDB

MAY 5, 2019

However in the Skylake microarchitecture (you can see a list of CPUs here ) the PAUSE instruction changed and in the documentation it says “the latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.”

Testing

Testing Tuning Latency Storage

Open Source at AWS re:Invent

Adrian Cockcroft

NOVEMBER 18, 2019

AWS Developer Relations on how the shift from Robot Operating System (ROS) 1 to ROS 2 will change the landscape for all robot lovers. Join Lee Packham, AWS Solutions Architect and Enrico Huijbers, AWS Software Development Engineer to find out how easy it is.

Open Source

Open Source AWS Lambda Serverless

Open Source at AWS re:Invent

Adrian Cockcroft

NOVEMBER 18, 2019

AWS Developer Relations on how the shift from Robot Operating System (ROS) 1 to ROS 2 will change the landscape for all robot lovers. Join Lee Packham, AWS Solutions Architect and Enrico Huijbers, AWS Software Development Engineer to find out how easy it is.

Open Source

Open Source AWS Lambda Serverless

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

The system needs to maintain a safety margin that is capable of absorbing failure via defense in depth, and failure modes need to be prioritized to take care of the most likely and highest impact risks. In addition to the common financial calculation of risk as the product of probability and severity, engineering risk includes detectability.

Latency

Latency Systems Engineering Hardware

The evolution of single-core bandwidth in multicore processors

John McCalpin

APRIL 25, 2023

This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth. Why is the single-core bandwidth increasing so slowly?

Benchmarking

Benchmarking Cache Latency Tuning

Failure Modes and Continuous Resilience

Adrian Cockcroft

NOVEMBER 11, 2019

The system needs to maintain a safety margin that is capable of absorbing failure via defense in depth, and failure modes need to be prioritized to take care of the most likely and highest impact risks. In addition to the common financial calculation of risk as the product of probability and severity, engineering risk includes detectability.

Latency

Latency Systems Engineering Hardware

MongoDB Best Practices: Security, Data Modeling, & Schema Design

Percona

APRIL 17, 2023

In this blog post, we will discuss the best practices on the MongoDB ecosystem applied at the Operating System (OS) and MongoDB levels. Operating System (OS) settings Swappiness Swappiness is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100.

Best Practices

Best Practices Design Tuning Database

Aurora vs RDS: How to Choose the Right AWS Database Solution

Percona

JULY 1, 2023

There are also cases where although the workload and operational needs seem to best fit to one solution, there are other limiting factors that may be blockers (or at least need special handling). What we should really compare is the MySQL and Aurora database engines provided by Amazon RDS. RDS MySQL is 5.5,

AWS

AWS Database Serverless Storage

SQL Server I/O Basics Chapter #2

SQL Server According to Bob

JANUARY 11, 2020

Subsystem / Path The I/O subsystem or path includes those components that are used to support an I/O operation. SQL Server copy-on-write actions are used to maintain snapshot databases in SQL Server 2005.

Servers

Servers Cache Database Media

Software Testing Trends 2021 – What can we expect?

Testsigma

FEBRUARY 12, 2021

According to Gartner, the greatest technological developments in 2021 will influence the future from technology affecting how people operate, to AI engineering and hyperautomation. This obligated QA engineers, in particular, to pay more attention to the user interface. According to Statista, approximately 2.87

Artificial Intelligence

Artificial Intelligence Software Software IoT

Front-End Performance Checklist 2019 [PDF, Apple Pages, MS Word]

Smashing Magazine

JANUARY 7, 2019

Deviation metrics As noted by Wikipedia engineers , data of how much variance exists in your results could inform you how reliable your instruments are, and how much attention you should pay to deviations and outlers. Estimated Input Latency tells us if we are hitting that threshold, and ideally, it should be below 50ms.

Performance

Performance Cache Network Metrics

Enhancing Kubernetes cluster management key to platform engineering success

Optimizing your Kubernetes clusters without breaking the bank

Trending Sources

Mastering Disk Space Management with MongoDB® Storage Engines

Maximize user experience with out-of-the-box service-performance SLOs

Predictive CPU isolation of containers at Netflix

Dynatrace Managed turnkey Premium High Availability for globally distributed data centers (Early Adopter)

Lessons learned from enterprise service-level objective management

What is AWS Lambda?

Netflix Cloud Packaging in the Terabyte Era

Front-End: Cache Strategies You Should Know

How digital experience monitoring helps deliver business observability

Redis® Monitoring Strategies for 2025

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Redis® Monitoring Strategies for 2024

Back-to-Basics Weekend Reading - U-Net: A User-Level Network Interface

InnoDB Performance Optimization Basics

The evolution of single-core bandwidth in multicore processors

10 Lessons from 10 Years of Amazon Web Services

Linux Load Averages: Solving the Mystery

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

5 tips for architecting fast data applications

The convoy phenomenon

What Is a Workload in Cloud Computing

Monitoring Distributed Systems

Top 3 Challenges in Cross Browser Testing and How to Tackle Them

Software-defined far memory in warehouse scale computers

Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures

Testing MySQL 8.0.16 on Skylake with innodb_spin_wait_pause_multiplier

Open Source at AWS re:Invent

Open Source at AWS re:Invent

Failure Modes and Continuous Resilience

The evolution of single-core bandwidth in multicore processors

Failure Modes and Continuous Resilience

MongoDB Best Practices: Security, Data Modeling, & Schema Design

Aurora vs RDS: How to Choose the Right AWS Database Solution

SQL Server I/O Basics Chapter #2

Software Testing Trends 2021 – What can we expect?

Front-End Performance Checklist 2019 [PDF, Apple Pages, MS Word]

Stay Connected