Architecture, Engineering and Latency - Technology Performance Pulse

How to Scale Elasticsearch to Solve Your Scalability Issues

DZone

FEBRUARY 26, 2025

One such open-source, distributed search and analytics engine is Elasticsearch, which is very efficient at handling data in large sets and high-velocity queries. However, the process for effectively scaling Elasticsearch can be nuanced, since one needs a proper understanding of the architecture behind it and of performance tradeoffs.

Scalability

Scalability Open Source Latency Architecture

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. We designed experimental scenarios inspired by chaos engineering.

Engineering

Engineering Tuning Latency Open Source

Site reliability engineering: 5 things you need to know

Dynatrace

FEBRUARY 4, 2021

What is site reliability engineering? Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. Dynatrace news. SRE focuses on automation.

Engineering

Engineering DevOps Government Latency

Site reliability engineering: 5 things to you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. Organizations can then integrate these skilled engineers at key points in the DevOps life cycle.

Engineering

Engineering DevOps Government Latency

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Every image you hover over isnt just a visual placeholder; its a critical data point that fuels our sophisticated personalization engine. Architecture Overview The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset.

Tuning

Tuning Latency Efficiency Storage

Enhancing Kubernetes cluster management key to platform engineering success

Dynatrace

MARCH 29, 2024

Five of the most common include cluster instability, resource and cost management, security, observability, and stress on engineering teams. Engineering teams are overwhelmed with stuff to do.” You can ask for the best configuration to reduce latency or improve the user experience.”

Engineering

Engineering DevOps Operating System Cloud

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

The Netflix TechBlog

SEPTEMBER 29, 2022

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform. Over the past 2.5

Latency

Latency Systems Serverless Media

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models. Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs.

Tuning

Tuning Efficiency Latency Strategy

Introducing Netflix’s Key-Value Data Abstraction Layer

The Netflix TechBlog

SEPTEMBER 18, 2024

These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. Data Model At its core, the KV abstraction is built around a two-level map architecture. Useful for keeping “n-newest” or prefix path deletion.

Latency

Latency Storage Cache Servers

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Dynatrace

NOVEMBER 28, 2022

The new Amazon capability enables customers to improve the startup latency of their functions from several seconds to as low as sub-second (up to 10 times faster) at P99 (the 99th latency percentile). This can cause latency outliers and may lead to a poor end-user experience for latency-sensitive applications.

Lambda

Lambda AWS Serverless Latency

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

This decoupling is crucial in modern architectures where scalability and fault tolerance are paramount. The architecture of RabbitMQ is meticulously designed for complex message routing, enabling dynamic and flexible interactions between producers and consumers. Keeping queues short maintains a responsive and efficient RabbitMQ setup.

Best Practices

Best Practices Traffic Strategy Scalability

Scalable Annotation Service?—?Marken

The Netflix TechBlog

JANUARY 25, 2023

The service should be able to serve real-time, aka UI, applications so CRUD and search operations should be achieved with low latency. Our service will be used by a lot of internal UI applications hence the latency for CRUD and search operations must be low. Teams should be able to define their data model for annotation.

Scalability

Scalability Latency Media Architecture

Growth Engineering at Netflix?—?Automated Imagery Generation

The Netflix TechBlog

FEBRUARY 9, 2021

Growth Engineering at Netflix?—?Automated In the Growth Engineering team, we refer to this as the top of the signup funnel. For more background on the signup funnel and Growth Engineering’s role in the signup funnel, please read our initial post on the topic: Growth Engineering at Netflix? Accelerating Innovation.

Engineering

Engineering Storage Latency Entertainment

Implementing service-level objectives to improve software quality

Dynatrace

DECEMBER 27, 2022

As more organizations embrace microservices-based architecture to deliver goods and services digitally, maintaining customer satisfaction has become exponentially more challenging. Latency is the time that it takes a request to be served. SLOs aid decision making. SLOs promote automation. Define SLOs for each service. Reliability.

Software

Software Software Benchmarking Latency

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

Motivation With the rapid growth in Netflix member base and the increasing complexity of our systems, our architecture has evolved into an asynchronous one that enables both online and offline computation. Personalized Experience Refresh Netflix Recommendation engine continuously refreshes recommendations for every member.

Systems

Systems Traffic Architecture Mobile

Optimizing your Kubernetes clusters without breaking the bank

Dynatrace

JANUARY 14, 2022

The Akamas vision is that only an autonomous optimization approach powered by AI can effectively enable performance engineers, SREs, and architects to identify the best configurations that ensure maximum service performance and resilience, at the lowest possible cost and at business speed. below 500ms) and error rates (e.g. lower than 2%.).

Latency

Latency Tuning Efficiency AWS

Designing Instagram

High Scalability

JANUARY 11, 2022

Machine Learning Engineer at Amazon and has led several machine-learning initiatives across the Amazon ecosystem. Architecture. FUN FACT : In this talk , Rodrigo Schmidt, director of engineering at Instagram talks about the different challenges they have faced in scaling the data infrastructure at Instagram. High Level Design.

Design

Design Media Storage Logistics

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Dynatrace

SEPTEMBER 18, 2020

Cloud-based application architectures commonly leverage microservices. By leveraging the Dynatrace Davis AI causation engine to watch for unforeseen changes in underlying API responsiveness, Dynatrace automatically identifies slowdowns in the performance of your API manager and points you to their root cause.

Infrastructure

Infrastructure Latency Metrics Cloud

Lessons learned from enterprise service-level objective management

Dynatrace

MAY 19, 2022

A service-level objective ( SLO ) is the new contract between business, DevOps, and site reliability engineers (SREs). Example 1: Architecture boundaries. First, they took a big step back and looked at their end-to-end architecture (Figure 2). SLO dashboard defined by architectural boundary. So, what did they do?

Automotive

Automotive Latency Architecture Mobile

The Netflix Cosmos Platform

The Netflix TechBlog

MARCH 1, 2021

It supports both high throughput services that consume hundreds of thousands of CPUs at a time, and latency-sensitive workloads where humans are waiting for the results of a computation. The subsystems all communicate with each other asynchronously via Timestone, a high-scale, low-latency priority queuing system.

Serverless

Serverless Media Latency Social Media

Edgar: Solving Mysteries Faster with Observability

The Netflix TechBlog

SEPTEMBER 2, 2020

For engineers, instead of whodunit, the question is often “what failed and why?” An engineer can find herself digging through logs, poring over traces, and staring at dozens of dashboards. While this abundance of dashboards and information is by no means unique to Netflix, it certainly holds true within our microservices architecture.

Latency

Latency Transportation Engineering Traffic

For your eyes only: improving Netflix video quality with neural networks

The Netflix TechBlog

NOVEMBER 17, 2022

Our approach to NN-based video downscaling The deep downscaler is a neural network architecture designed to improve the end-to-end video quality by learning a higher-quality video downscaler. Architecture of the deep downscaler model, consisting of a preprocessing block followed by a resizing block.

Network

Network Media Innovation Efficiency

Growth Engineering at Netflix- Creating a Scalable Offers Platform

The Netflix TechBlog

FEBRUARY 9, 2021

The Growth Engineering team is responsible for executing growth initiatives that help us anticipate and adapt to this change. For more background on Growth Engineering and the signup funnel, please have a look at our previous blog post that covers the basics. We need to be constantly adapting and innovating as a result of this change.

Engineering

Engineering Scalability Architecture Innovation

How to maximize serverless benefits and overcome its challenges

Dynatrace

OCTOBER 10, 2022

Reduced latency. Serverless architecture makes it possible to host code anywhere, rather than relying on an origin server. By using cloud providers with multiple server sites, organizations can reduce function latency for end users. Architectural complexity. Optimizes resources. Difficult to monitor.

Serverless

Serverless Infrastructure Lambda Latency

Netflix Cloud Packaging in the Terabyte Era

The Netflix TechBlog

SEPTEMBER 24, 2021

Table 1: Movie and File Size Examples Initial Architecture A simplified view of our initial cloud video processing pipeline is illustrated in the following diagram. Figure 1: A Simplified Video Processing Pipeline With this architecture, chunk encoding is very efficient and processed in distributed cloud computing instances.

Cloud

Cloud Media Storage Cache

Site reliability done right: 5 SRE best practices that deliver on business objectives

Dynatrace

MAY 31, 2023

Site reliability engineering (SRE) has recently become a critical discipline in recent years as the world has shifted in favor of web-based interactions. This shift is leading more organizations to hire site reliability engineers to guarantee the reliability and resiliency of their services. Mobile retail e-commerce spending in the U.

Best Practices

Best Practices DevOps Latency Metrics

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

OCTOBER 17, 2018

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog.

Big Data

Big Data Latency Transportation Traffic

Dynatrace supports the newly released AWS Lambda Response Streaming

Dynatrace

APRIL 7, 2023

Customers can use AWS Lambda Response Streaming to improve performance for latency-sensitive applications and return larger payload sizes. Customers can use response streaming to achieve the following: Improve Time to First Byte (TTFB) performance for latency-sensitive applications. Return larger payload sizes.

Lambda

Lambda AWS Serverless Latency

Rebuilding Netflix Video Processing Pipeline with Microservices

The Netflix TechBlog

JANUARY 10, 2024

This architecture shift greatly reduced the processing latency and increased system resiliency. We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case. divide the input video into small chunks 2.

Processing

Processing Media Latency Innovation

Extending Vector with eBPF to inspect host and container performance

The Netflix TechBlog

FEBRUARY 20, 2019

by Jason Koch , with Martin Spier , Brendan Gregg , Ed Hunter Improving the tools available to our engineers to help them diagnose, triage, and work through software performance challenges in the cloud is a key goal for the cloud performance engineering team at Netflix. Vector is open source and in use by multiple companies.

Performance

Performance Latency Open Source Metrics

Dynatrace supports Azure Managed Instance for Apache Cassandra

Dynatrace

MAY 13, 2022

Because of its scalability and distributed architecture, thousands of companies trust it to run their cloud and hybrid-based workloads at high availability without compromising performance. With the Dynatrace Data Explorer, you can easily analyze metrics, such as client read/write latency by Cassandra nodes and disk space usage by keyspaces.

Azure

Azure Latency Metrics Infrastructure

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

JUNE 4, 2019

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache

Cache Latency Airlines Logistics

Dynatrace accelerates business transformation with new AI observability solution

Dynatrace

JANUARY 31, 2024

Retrieval-augmented generation emerges as the standard architecture for LLM-based applications Given that LLMs can generate factually incorrect or nonsensical responses, retrieval-augmented generation (RAG) has emerged as an industry standard for building GenAI applications.

Cache

Cache Azure Infrastructure Monitoring

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns.

Latency

Latency Storage Big Data Tuning

Improved Alerting with Atlas Streaming Eval

The Netflix TechBlog

APRIL 27, 2023

Engineers want their alerting system to be realtime, reliable, and actionable. A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! It opens doors to support more exciting use-cases. OK, Results?

Storage

Storage Cache Metrics Database

Consistent caching mechanism in Titus Gateway

The Netflix TechBlog

NOVEMBER 3, 2022

The original assumptions and architectural choices were no longer viable. Overview The figure below depicts a simplified high-level architecture of a single Titus cluster (a.k.a We started seeing increased response latencies and leader servers running at dangerously high utilization.

Cache

Cache Latency Traffic Systems

Seamlessly Swapping the API backend of the Netflix Android app

The Netflix TechBlog

SEPTEMBER 8, 2020

This allowed Android engineers to have much more control and observability over how we get our data. We tried a few iterations of what this new service should look like, and eventually settled on a modern architecture that aimed to give more control of the API experience to the client teams. It was a Node.js

Latency

Latency Cache Java Traffic

Engineering dependability and fault tolerance in a distributed system

High Scalability

FEBRUARY 19, 2021

This means a system that is not merely available but is also engineered with extensive redundant measures to continue to work as its users expect. reliability situations, where continuity of service is essential, with redundant elements continuously in-service, such as with airplane engines. This ensures reliability.

Engineering

Engineering Systems Availability Scalability

Introducing Netflix TimeSeries Data Abstraction Layer

The Netflix TechBlog

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Latency

Latency Storage Traffic Tuning

Edge Authentication and Token-Agnostic Identity Propagation

The Netflix TechBlog

FEBRUARY 9, 2021

Plus, the architecture of the Edge tier was evolving to a PaaS (platform as a service) model, and we had some tough decisions to make about how, and where, to handle identity token handling. The system architecture now takes the form of: Notice that tokens never traverse past the Edge gateway / EAS boundary. We are serving over 2.5

Architecture

Architecture Latency Servers Website

What is AWS Lambda?

Dynatrace

APRIL 5, 2021

Lambda’s highly efficient, on-demand computing environment aligns with today’s microservices-centric architectures, and readily integrates with other popular AWS offerings that an organization may already be using. AWS continues to improve how it handles latency issues. It helps SRE teams automate responses.

Lambda

Lambda AWS Serverless Hardware

Under the Hood of Amazon EC2 Container Service

All Things Distributed

JULY 20, 2015

Today, I want to explore the Amazon ECS architecture and what this architecture enables. This architecture affords Amazon ECS high availability, low latency, and high throughput because the data store is never pessimistically locked. Below is a diagram of the basic components of Amazon ECS: How we coordinate the cluster.

Latency

Latency Architecture AWS Open Source

Observability vs. monitoring: What’s the difference?

Dynatrace

NOVEMBER 3, 2021

Organizations are depending more and more on distributed architectures to provide application services. For example, when monitoring a database, you’ll want to know about any latency when writing data to a disk or average query response time. Dynatrace news. This trend is prompting advances in both observability and monitoring.

Monitoring

Monitoring Metrics DevOps Scalability

The Anna Key-Value Store Now Has 355x the Performance of DynamoDB for the Dollar

High Scalability

SEPTEMBER 8, 2018

RISELabs , those wonderfully innovative folks over at Berkeley, have uplifted their Anna datatabase —a shared-nothing, thread-per-core architecture to achieve lightning-fast speeds by avoiding all coordination mechanisms—to become cloud-aware. Our monitoring engine automatically moves data between tiers based on access patterns.

Storage

Storage Performance AWS Cloud

How to Scale Elasticsearch to Solve Your Scalability Issues

Why applying chaos engineering to data-intensive applications matters

Trending Sources

Site reliability engineering: 5 things you need to know

Site reliability engineering: 5 things to you need to know

Introducing Impressions at Netflix

Enhancing Kubernetes cluster management key to platform engineering success

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Foundation Model for Personalized Recommendation

Introducing Netflix’s Key-Value Data Abstraction Layer

Dynatrace supports SnapStart for Lambda as an AWS launch partner

Best Practices for Scaling RabbitMQ

Scalable Annotation Service?—?Marken

Growth Engineering at Netflix?—?Automated Imagery Generation

Implementing service-level objectives to improve software quality

Rapid Event Notification System at Netflix

Optimizing your Kubernetes clusters without breaking the bank

Designing Instagram

Who will watch the watchers? Extended infrastructure observability for WSO2 API Manager

Lessons learned from enterprise service-level objective management

The Netflix Cosmos Platform

Edgar: Solving Mysteries Faster with Observability

For your eyes only: improving Netflix video quality with neural networks

Growth Engineering at Netflix- Creating a Scalable Offers Platform

How to maximize serverless benefits and overcome its challenges

Netflix Cloud Packaging in the Terabyte Era

Site reliability done right: 5 SRE best practices that deliver on business objectives

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Dynatrace supports the newly released AWS Lambda Response Streaming

Rebuilding Netflix Video Processing Pipeline with Microservices

Extending Vector with eBPF to inspect host and container performance

Dynatrace supports Azure Managed Instance for Apache Cassandra

Predictive CPU isolation of containers at Netflix

Dynatrace accelerates business transformation with new AI observability solution

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Improved Alerting with Atlas Streaming Eval

Consistent caching mechanism in Titus Gateway

Seamlessly Swapping the API backend of the Netflix Android app

Engineering dependability and fault tolerance in a distributed system

Introducing Netflix TimeSeries Data Abstraction Layer

Edge Authentication and Token-Agnostic Identity Propagation

What is AWS Lambda?

Under the Hood of Amazon EC2 Container Service

Observability vs. monitoring: What’s the difference?

The Anna Key-Value Store Now Has 355x the Performance of DynamoDB for the Dollar

Stay Connected