Latency, Monitoring and Tuning - Technology Performance Pulse

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

RabbitMQ can be deployed in distributed environments and includes monitoring tools through a built-in dashboard and CLI. Its partitioned log architecture supports both queuing and publish-subscribe models, allowing it to handle large-scale event processing with minimal latency.

Latency

Latency Analytics Architecture Storage

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Latency

Latency Cache Infrastructure Strategy

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Highlighting NewReleases For new content, impression history helps us monitor initial user interactions and adjust our merchandising efforts accordingly. Automating Performance Tuning with Autoscalers Tuning the performance of our Apache Flink jobs is currently a manual process.

Tuning

Tuning Latency Efficiency Storage

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

The Challenge of Title Launch Observability As engineers, were wired to track system metrics like error rates, latencies, and CPU utilizationbut what about metrics that matter to a titlessuccess? Option 1: Log Processing Log processing offers a straightforward solution for monitoring and analyzing title launches.

Traffic

Traffic Scalability Strategy Monitoring

Telltale: Netflix Application Monitoring Simplified

The Netflix TechBlog

AUGUST 13, 2020

You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. By Andrei U.,

Monitoring

Monitoring Tuning Traffic Metrics

How digital experience monitoring helps deliver business observability

Dynatrace

APRIL 26, 2022

Digital experience monitoring (DEM) allows an organization to optimize customer experiences by taking into account the context surrounding digital experience metrics. What is digital experience monitoring? Primary digital experience monitoring tools.

Monitoring

Monitoring Social Media IoT Metrics

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. Logging is selective to cases where the old and new responses do not match.

Traffic

Traffic Latency Tuning Systems

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Optimizing RabbitMQ performance through strategies such as keeping queues short, enabling lazy queues, and monitoring health checks is essential for maintaining system efficiency and effectively managing high traffic loads. Monitoring the cluster nodes preemptively addresses potential issues, ensuring the system operates smoothly.

Best Practices

Best Practices Traffic Strategy Efficiency

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Dynatrace

JULY 22, 2024

Unlike traditional monitoring, which focuses on watching individual metrics for system health indicators with no overall context, observability goes deeper , analyzing telemetry data for a comprehensive view of the system’s internal state in context of the wider system. There are three main types of telemetry data: Metrics.

Latency

Latency Best Practices Metrics Open Source

Migrating Netflix to GraphQL Safely

The Netflix TechBlog

JUNE 14, 2023

To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render. The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. We then collect and analyze the performance of the two clusters.

Traffic

Traffic Latency Metrics Cache

Crucial Redis Monitoring Metrics You Must Watch

Scalegrid

JANUARY 25, 2024

You will need to know which monitoring metrics for Redis to watch and a tool to monitor these critical server metrics to ensure its health. This blog post lists the important database metrics to monitor. Effective monitoring of key performance indicators plays a crucial role in maintaining this optimal speed of operation.

Metrics

Metrics Monitoring Latency Cache

Faster time to value with enhanced handling of OneAgent runtime data

Dynatrace

SEPTEMBER 23, 2020

In parallel to the continuous stream of new improvements related to Dynatrace monitoring capabilities, we’re also continuously improving our internal mechanisms. Storage mount points in a system might be larger or smaller, local or remote, with high or low latency, and various speeds. Customizable location of large runtime files.

Storage

Storage Latency Operating System Network

What is serverless computing? Driving efficiency without sacrificing observability

Dynatrace

JANUARY 26, 2021

When an application is triggered, it can cause latency as the application starts. This creates latency when they need to restart. Monitoring serverless applications. Because serverless applications typically run in specialized environments, administrators worry about having adequate monitoring and observability capabilities.

Serverless

Serverless Efficiency Lambda AWS

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

Dynatrace

APRIL 25, 2023

Higher latency and cold start issues due to the initialization time of the functions. Connect Dynatrace to your cloud-vendor to gather relevant infrastructure monitoring data, which gives you essential health insights. Enable faster development and deployment cycles by abstracting away the infrastructure complexity.

Serverless

Serverless Lambda Azure AWS

How Dynatrace boosts production resilience with Site Reliability Guardian

Dynatrace

MAY 17, 2023

We use monitored demo applications to deliver constant load and a defined set of business transactions. In this case, the four golden signals (latency, traffic, errors, and saturation) are derived from span attributes and DQL metric queries via Dynatrace Grail™. The queries are depicted below (sensitive data has been removed).

DevOps

DevOps Traffic Latency Best Practices

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

At Dynatrace, we’re constantly improving our AWS monitoring capabilities. Monitor and understand additional AWS services. Supporting services include every service that isn’t available with out-of-the-box Dynatrace monitoring. The additional services you can now monitor out of the box with Dynatrace are listed below.

AWS

AWS Metrics IoT Storage

How BizDevOps can “shift left” using SLOs to automate quality gates

Dynatrace

MAY 5, 2021

For example, improving latency by as little as 0.1 latency is the number one reason consumers abandon mobile sites. Monitoring and an increasing level of intelligence will mix business and development in meaningful ways, adding more value to the BizDevOps flow. Meanwhile, in the U.S., How Intuit puts Dynatrace to work.

Benchmarking

Benchmarking Latency Speed Software

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. via built-in logging, tracing, monitoring, alerting and error classification. containers) in advance of demand to reduce startup latencies in Stratum.

Serverless

Serverless Media Latency Social Media

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Netflix TechBlog

JUNE 13, 2023

A small percentage of production traffic is redirected to the two new clusters, allowing us to monitor the new version’s performance and compare it against the current version. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.

Traffic

Traffic Metrics Systems Strategy

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Dynatrace

MAY 17, 2023

With that, we could make use of the full set of OpenTelemetry’s features to instrument and monitor our applications in the Dynatrace back end, including traces with spans and metrics. OneAgent is the native telemetry data collector and monitoring solution of Dynatrace.

Metrics

Metrics Database Monitoring Network

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

. “And as the cost is going down, we’re also monitoring to see what’s happening to application performance.” You can ask for the best configuration to reduce latency or improve the user experience.” ” For Doni, it’s all about balance. It’s not just a cost-reduction tool.

Engineering

Engineering DevOps Operating System Cloud

Automated observability, security, and reliability at scale

Dynatrace

JULY 18, 2023

Configuration files allow for the automatic creation, update, and management of configurations for dashboards, synthetic monitors, alerts, SLOs, and security settings across multiple environments. Stay tuned for more examples and easy-to-adopt automations provided in our public Github project.

Best Practices

Best Practices Code Infrastructure Latency

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Dynatrace

DECEMBER 22, 2019

At Dynatrace, we’re constantly improving our AWS monitoring capabilities. Monitor and understand additional AWS services. Supporting services include every service that isn’t available with out-of-the-box Dynatrace monitoring. The additional services you can now monitor out of the box with Dynatrace are listed below.

AWS

AWS Metrics IoT Storage

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. divide the input video into small chunks 2.

Processing

Processing Media Latency Innovation

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

This separation allows us to tune system configuration and scaling policies independently for different event priorities and traffic patterns. Observability At Netflix, we put a strong emphasis on building robust monitoring into our systems to provide a clear view of system health.

Systems

Systems Traffic Architecture Mobile

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Additionally, it became easy to provide deep links to different monitoring and deployment systems in Edgar due to consistent tagging.

Infrastructure

Infrastructure Transportation Storage Open Source

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

O'Reilly

MARCH 25, 2025

Weve seen this across dozens of companies, and the teams that break out of this trap all adopt some version of Evaluation-Driven Development (EDD), where testing, monitoring, and evaluation drive every decision from the start. Why early observability (logging and monitoring) is crucial for diagnosing issues.

Systems

Systems Development Tuning Monitoring

The Most Important MySQL Setting

Percona

APRIL 7, 2023

If we were to select the most important MySQL setting, if we were given a freshly installed MySQL or Percona Server for MySQL and could only tune a single MySQL variable, which one would it be? To be fair, that is also true with PostgreSQL; it hasn’t been tuned either, and it, too, can also perform much better.

Tuning

Tuning Cache Servers Benchmarking

Tuning SQL Server Reporting Services

SQL Performance

SEPTEMBER 17, 2019

This article will cover many areas that database administrators need to be aware of in order to properly license, recover, and tune a Reporting Services installation. Tuning Options. Tuning SSRS is much like any other application. Disk latency for ReportServer and ReportServerTempDB are very important. General Tuning.

Tuning

Tuning Servers Database Best Practices

It’s All About Replication Lag in PostgreSQL

Percona

APRIL 13, 2023

In PostgreSQL, replication lag can occur due to various reasons such as network latency, slow disk I/O, long-running transactions, etc. Replication lag can occur due to various reasons, such as: Network latency: Network latency is the delay caused by the time it takes for data to travel between the primary and standby databases.

Latency

Latency Tuning Open Source Network

Keeping Netflix Reliable Using Prioritized Load Shedding

The Netflix TechBlog

NOVEMBER 2, 2020

Service throttling Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us.

Traffic

Traffic Metrics Infrastructure Architecture

Towards a Reliable Device Management Platform

The Netflix TechBlog

AUGUST 30, 2021

the former for access to the Kafka clusters and the latter for service monitoring and alerts. By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. million elements. this is configurable through enable.auto.commit.

Latency

Latency Traffic Transportation Cloud

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Dynatrace

DECEMBER 2, 2021

With reliable SLOs, you can set up automation to monitor and measure SLIs and set alerts if certain indicators are trending toward violation. You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. These trends also help you adjust business objectives and SLAs.

Metrics

Metrics Best Practices DevOps Infrastructure

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Most of the business views created on top of the Iceberg tables can tolerate a few minutes of latency. Please stay tuned! Dehghani, Zhamak.

Big Data

Big Data Government Processing Analytics

Monitoring Serverless Applications

Dotcom-Montior

NOVEMBER 11, 2020

Developers don’t have to put in additional time to fine-tuning the system, or rely on other teams for support, as it’s done automatically with the cloud provider. However, when the time comes for resources to be requested, there can be latency in the time it takes to for that code to start back up. Monitoring.

Serverless

Serverless Monitoring Lambda Latency

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Percona

SEPTEMBER 1, 2023

While there is no magic bullet for MySQL performance tuning, there are a few areas that can be focused on upfront that can dramatically improve the performance of your MySQL installation. What are the Benefits of MySQL Performance Tuning? A finely tuned database processes queries more efficiently, leading to swifter results.

Tuning

Tuning Database Performance Hardware

Best Practices for a Seamless MongoDB Upgrade

Percona

NOVEMBER 2, 2023

Improved performance : MongoDB continually fine-tunes its database engine, resulting in faster query execution and reduced latency. Regulatory compliance : Upgrading your database is vital for compliance with various legal and regulatory standards, where data management and security play a pivotal role.

Best Practices

Best Practices Hardware Tuning Scalability

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

As our business scales globally, the demand for data is growing and the needs for scalable low latency incremental processing begin to emerge. We will also add managed backfill support into IPS to help users to build, monitor, and validate the backfill. There are three common issues that the dataset owners usually face.

Processing

Processing Big Data Efficiency Engineering

Most Common RabbitMQ Use Cases

Scalegrid

AUGUST 27, 2024

The software also extends capabilities allowing fine-tuning consumption parameters through QoS (Quality of Service) prefetch limits catered toward balancing load among numerous consumers, thus preventing overwhelming any single consumer entity. Take Softonic’s platform as an example.

IoT

IoT Ecommerce Games Scalability

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

All Things Distributed

OCTOBER 2, 2017

Integration with AWS CloudWatch, AWS CloudTrail, and AWS Config enables support for monitoring, audit, and configuration management. Performant – DynamoDB consistently delivers single-digit millisecond latencies even as your traffic volume increases.

Internet

Internet Internet AWS Performance

Software engineering for machine learning: a case study

The Morning Paper

JULY 7, 2019

In addition to availability, our respondents focus most heavily on supporting the following data attributes: “accessibility, accuracy, authoritativeness, freshness, latency, structuredness, ontological typing, connectedness, and semantic joinability.” To address this, rigorous rollout processes are required.

Software Engineering

Software Engineering Engineering Software Software

Software-defined far memory in warehouse scale computers

The Morning Paper

MAY 21, 2019

This boils down to a single digit µs latency toleration in the tail for far memory, and in addition to security and privacy concerns, rules out remote memory solutions. Thus we’re fundamentally trading (de)-compression latency at access time for the ability to pack more data in memory. ML-based auto-tuning.

Software

Software Software Google Hardware

Accelerate Machine Learning with Amazon SageMaker

All Things Distributed

NOVEMBER 29, 2017

After this, there is often a long process of training that includes tuning the knobs and levers, called hyperparameters, that control the different aspects of the training algorithm. Built-in model tuning (hyperparameter optimization) that can automatically adjust hundreds of different combinations of algorithm parameters.

Tuning

Tuning AWS Scalability Infrastructure

RabbitMQ vs. Kafka: Key Differences

Netflix’s Distributed Counter Abstraction

Trending Sources

Introducing Impressions at Netflix

Title Launch Observability at Netflix Scale

Telltale: Netflix Application Monitoring Simplified

How digital experience monitoring helps deliver business observability

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Best Practices for Scaling RabbitMQ

OpenTelemetry 101: A nontechnical guide for IT leaders and enthusiasts

Migrating Netflix to GraphQL Safely

Crucial Redis Monitoring Metrics You Must Watch

Faster time to value with enhanced handling of OneAgent runtime data

What is serverless computing? Driving efficiency without sacrificing observability

Introducing Netflix TimeSeries Data Abstraction Layer

Build and operate multicloud FaaS with enhanced, intelligent end-to-end observability

How Dynatrace boosts production resilience with Site Reliability Guardian

Get up to 300 new metrics out of the box with AWS supporting services (GA)

How BizDevOps can “shift left” using SLOs to automate quality gates

The Netflix Cosmos Platform

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The road to observability demo part 3: Collect, instrument, and analyze telemetry data automatically with Dynatrace

Enhancing Kubernetes cluster management key to platform engineering success

Automated observability, security, and reliability at scale

Get up to 300 new metrics out of the box with AWS supporting services (GA)

Rebuilding Netflix Video Processing Pipeline with Microservices

Rapid Event Notification System at Netflix

Building Netflix’s Distributed Tracing Infrastructure

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

The Most Important MySQL Setting

Tuning SQL Server Reporting Services

It’s All About Replication Lag in PostgreSQL

Keeping Netflix Reliable Using Prioritized Load Shedding

Towards a Reliable Device Management Platform

What are SLOs? How service-level objectives work with SLIs to deliver on SLAs

Data Movement in Netflix Studio via Data Mesh

Monitoring Serverless Applications

MySQL Performance Tuning 101: Key Tips to Improve MySQL Database Performance

Best Practices for a Seamless MongoDB Upgrade

Incremental Processing using Netflix Maestro and Apache Iceberg

Most Common RabbitMQ Use Cases

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications

Software engineering for machine learning: a case study

Software-defined far memory in warehouse scale computers

Accelerate Machine Learning with Amazon SageMaker

Stay Connected