Engineering, Scalability and Systems - Technology Performance Pulse

Low-Maintenance Backend Architectures for Scalable Applications

DZone

JANUARY 10, 2025

After years of working in the intricate world of software engineering, I learned that the most beautiful solutions are often those unseen: backends that hum along, scaling with grace and requiring very little attention. Developers could understand and manage the entire systems intricacies.

Architecture

Architecture Scalability Software Engineering Cloud

A Step-by-Step Guide to Write a System Design Document

DZone

FEBRUARY 26, 2025

Have you ever wondered how large-scale systems handle millions of requests seamlessly while ensuring speed, reliability, and scalability? Behind every high-performing application whether its a search engine, an e-commerce platform, or a real-time messaging service lies a well-thought-out system design.

Design

Design Systems Scalability Speed

Rapid Event Notification System at Netflix

The Netflix TechBlog

FEBRUARY 18, 2022

To this end, we developed a Rapid Event Notification System (RENO) to support use cases that require server initiated communication with devices in a scalable and extensible manner. In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way.

Systems

Systems Traffic Architecture Mobile

Title Launch Observability at Netflix Scale

The Netflix TechBlog

JANUARY 6, 2025

This thoughtful approach doesnt just address immediate hurdles; it builds the resilience and scalability needed for the future. In this case, the main stakeholders are: - Title Launch Operators Role: Responsible for setting up the title and its metadata into our systems. Lets explore how this mindset drivesresults.

Scalability

Scalability Cache Engineering Systems

Title Launch Observability at Netflix Scale

The Netflix TechBlog

MARCH 4, 2025

Part 3: System Strategies and Architecture By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques This blog post is a continuation of Part 2 , where we cleared the ambiguity around title launch observability at Netflix. The request schema for the observability endpoint.

Traffic

Traffic Strategy Entertainment Innovation

Evolution of search engines architecture - Algolia New Search Architecture Part 1

High Scalability

AUGUST 2, 2021

What would a totally new search engine architecture look like? Search engines, and more generally, information retrieval systems, play a central role in almost all of today’s technical stacks. After more than 30 years of evolution since TREC, search engines continue to grow and evolve, leading to new challenges.

Architecture

Architecture Engineering Systems

Key Elements of Site Reliability Engineering (SRE)

DZone

MARCH 14, 2023

Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives.

Engineering

Engineering Software Engineering Scalability Efficiency

Unlock the Power of DevSecOps with Newly Released Kubernetes Experience for Platform Engineering

Dynatrace

NOVEMBER 7, 2023

Platform engineering is on the rise. According to leading analyst firm Gartner, “80% of software engineering organizations will establish platform teams as internal providers of reusable services, components, and tools for application delivery…” by 2026.

Engineering

Engineering DevOps Best Practices Infrastructure

A Kubernetes platform engineering strategy tames Kubernetes complexity

Dynatrace

JULY 25, 2024

I spoke with Martin Spier, PicPay’s VP of Engineering, about the challenges PicPay experienced and the Kubernetes platform engineering strategy his team adopted in response. “Our development teams relied heavily on logs to understand what was going on with our systems,” he said. billion. .

Strategy

Strategy Engineering Open Source Java

Site reliability engineering: 5 things you need to know

Dynatrace

FEBRUARY 4, 2021

What is site reliability engineering? Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. Dynatrace news. SRE bridges the gap between Dev and Ops teams.

Engineering

Engineering DevOps Government Latency

How observability, application security, and AI enhance DevOps and platform engineering maturity

Dynatrace

APRIL 18, 2024

DevOps and platform engineering are essential disciplines that provide immense value in the realm of cloud-native technology and software delivery. Observability of applications and infrastructure serves as a critical foundation for DevOps and platform engineering, offering a comprehensive view into system performance and behavior.

DevOps

DevOps Engineering Artificial Intelligence Infrastructure

DevOps engineer tools: Deploy, test, evaluate, repeat

Dynatrace

DECEMBER 8, 2022

As cloud-native, distributed architectures proliferate, the need for DevOps technologies and DevOps platform engineers has increased as well. DevOps engineer tools can help ease the pressure as environment complexity grows. ” What does a DevOps platform engineer do? .” What are DevOps engineer tools and platforms.

DevOps

DevOps Engineering Testing Open Source

Accelerate and empower Site Reliability Engineering with Dynatrace observability

Dynatrace

OCTOBER 10, 2023

Planned effort Site Reliability Engineering (SRE) effort and time allocation planning typically fall into two domains: Operations Management (50%) Operations Management includes on-call responsibilities, post-mortem assessments, addressing other interruptions, and buffer time. These practices are commonly known as “ chaos engineering. ”

Engineering

Engineering DevOps Innovation Strategy

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems.

Systems

Systems Media Cache Open Source

Why applying chaos engineering to data-intensive applications matters

Dynatrace

MAY 23, 2024

Stream processing One approach to such a challenging scenario is stream processing, a computing paradigm and software architectural style for data-intensive software systems that emerged to cope with requirements for near real-time processing of massive amounts of data. We designed experimental scenarios inspired by chaos engineering.

Engineering

Engineering Tuning Latency Open Source

Engineering dependability and fault tolerance in a distributed system

High Scalability

FEBRUARY 19, 2021

This means a system that is not merely available but is also engineered with extensive redundant measures to continue to work as its users expect. Fault tolerance The ability of a system to continue to be dependable (both available and reliable) in the presence of certain component or subsystem failures.

Engineering

Engineering Systems Availability Scalability

Demystifying Interviewing for Backend Engineers @ Netflix

The Netflix TechBlog

FEBRUARY 1, 2022

By Karen Casella, Director of Engineering, Access & Identity Management Have you ever experienced one of the following scenarios while looking for your next role? Most backend engineering teams follow a process very similar to what is shown below. If so, we invite you to begin the interview process.

Engineering

Engineering Games Entertainment Innovation

New analytics capabilities for messaging system-related anomalies

Dynatrace

JANUARY 12, 2022

Messaging systems can significantly improve the reliability, performance, and scalability of the communication processes between applications and services. In serverless and microservices architectures, messaging systems are often used to build asynchronous service-to-service communication. Dynatrace news. This is great!

Analytics

Analytics Systems DevOps Healthcare

Site reliability engineering: 5 things to you need to know

Dynatrace

FEBRUARY 4, 2021

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. Dynatrace news. SRE bridges the gap between Dev and Ops teams. SRE focuses on automation.

Engineering

Engineering DevOps Government Latency

Flexible, scalable, self-service Kubernetes native observability now in General Availability

Dynatrace

MAY 17, 2022

For years, enterprises managed observability data on a team-by-team basis , using a combination of ticketing systems and configuration management tools. The post Flexible, scalable, self-service Kubernetes native observability now in General Availability appeared first on Dynatrace blog. This approach is costly and error prone.

Availability

Availability Scalability Cloud Metrics

Site Reliability Engineering

DZone

JANUARY 19, 2024

In the dynamic world of online services, the concept of site reliability engineering (SRE) has risen as a pivotal discipline, ensuring that large-scale systems maintain their performance and reliability.

Engineering

Engineering Tuning Software Engineering Internet

Mastering System Design: A Comprehensive Guide to System Scaling for Millions (Part 1)

DZone

JANUARY 19, 2024

A transformative journey into the realm of system design with our tutorial, tailored for software engineers aspiring to architect solutions that seamlessly scale to serve millions of users.

Systems

Systems Design Software Engineering Scalability

Reliability indicators that matter to your business: SLOs for all data types

Dynatrace

OCTOBER 31, 2024

It doesn’t matter if you need typically used failure-rate or response-time metrics to ensure your system’s availability and performance or if you need to rely on abnormal log drops to gain insights into raising problems—SLOs leveraged with Grail provide all the information you need.

Metrics

Metrics Availability Monitoring Scalability

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024 The Data Engineering Open Forum at Netflix on April 18th, 2024. At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role in this mission by enabling data-driven decision-making at scale.

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

Performance Engineering: The What, The Why, and The How Explained

DZone

JANUARY 24, 2020

Everything you need to know about performance engineering. As highly distributed apps become more complex, developers need to ensure their systems are as user-friendly, secure, and scalable as possible. You may also like: A Short History of Performance Engineering.

Engineering

Engineering Performance DevOps Scalability

Chaos Mesh — A Solution for System Resiliency on Kubernetes

DZone

APRIL 22, 2020

Traditionally we use unit tests and integration tests that guarantee a system is production-ready. To better identify system vulnerabilities and improve resilience, Netflix invented Chaos Monkey , which injects various types of faults into the infrastructure and business systems. This is how Chaos Engineering began.

Systems

Systems Infrastructure Engineering Testing

It’s time to upgrade the PTC System Monitor (PSM)!

Dynatrace

OCTOBER 28, 2020

As a PSM system administrator, you’ve relied on AppMon as a preconfigured APM tool for detecting, diagnosing, and repairing problems that impact the operational health of your Windchill application suite. The post It’s time to upgrade the PTC System Monitor (PSM)! Dynatrace news. appeared first on Dynatrace blog.

Monitoring

Monitoring Systems Infrastructure Cloud

Growth Engineering at Netflix- Creating a Scalable Offers Platform

The Netflix TechBlog

FEBRUARY 9, 2021

The Growth Engineering team is responsible for executing growth initiatives that help us anticipate and adapt to this change. In particular, it’s our job to design and build the systems and protocols that enable customers from all over the world to sign up for Netflix with the plan features and incentives that best suit their needs.

Engineering

Engineering Scalability Architecture Innovation

How Is Platform Engineering Different From DevOps and SRE?

DZone

DECEMBER 12, 2023

In the dynamic realm of modern software development and operations, terms such as Platform Engineering, DevOps, and Site Reliability Engineering (SRE) are frequently used, sometimes interchangeably, often causing confusion among professionals entering or navigating these domains.

DevOps

DevOps Engineering Scalability Efficiency

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Tuning

Tuning Efficiency Latency Strategy

Observability engineering: Getting Prometheus metrics right for Kubernetes with Dynatrace and Kepler

Dynatrace

DECEMBER 18, 2023

For busy site reliability engineers, ensuring system reliability, scalability, and overall health is an imperative that’s getting harder to achieve in ever-expanding, cloud-native, container-based environments. Because of its adaptability, Prometheus has become an essential tool for observability engineering.

Metrics

Metrics Engineering Energy Tuning

How automated workflows and multicloud automation can reduce engineering toil

Dynatrace

JUNE 5, 2023

Organizations are increasingly moving to multicloud environments and adopting microservices to increase the efficiency, reliability, and scalability of their applications and services. What happened and its significance on other systems and the business. Modern multicloud environments are powerful and agile, yet highly complex. .”

Engineering

Engineering Speed Monitoring Efficiency

Scaling Is Not Just About Products – It’s About Teams, Too

DZone

SEPTEMBER 19, 2021

We are well aware of what is meant by system scalability. System scalability is about maintaining the SLA of the system as the user base continues to grow and as the user activity continues to rise. However, to build highly successful products, this is not the only type of scalability that we should worry about.

Scalability

Scalability Software Engineering Engineering Systems

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Scalability

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Dynatrace

JANUARY 21, 2025

Maintaining reliability and scalability requires a good grasp of resource management; predicting future demands helps prevent resource shortages, avoid over-provisioning, and maintain cost efficiency. Forecasting can identify potential anomalies in node performance, helping to prevent issues before they impact the system.

Traffic

Traffic Metrics Analytics Monitoring

Starting an SRE Team? Stay Away From Uptime.

DZone

DECEMBER 8, 2021

A good SRE engineer will tell you your service is never down. A great SRE engineer will tell you that’s not what you should be measuring. In fact, they’ll tell you their job is customer service.

Engineering

Engineering Scalability Systems Traffic

Achieving High Availability in CI/CD With Observability

DZone

MARCH 5, 2024

Since most application releases depend on cloud infrastructure, having good continuous integration and continuous delivery (CI/CD) pipelines and end-to-end observability becomes essential for ensuring highly available systems.

Availability

Availability DevOps Infrastructure Scalability

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

The Netflix TechBlog

MARCH 5, 2019

Netflix’s engineering culture is predicated on Freedom & Responsibility, the idea that everyone (and every team) at Netflix is entrusted with a core responsibility and they are free to operate with freedom to satisfy their mission.

Infrastructure

Infrastructure Cloud Scalability AWS

Free Google Book: Building Secure and Reliable Systems

High Scalability

APRIL 9, 2020

Google added another book into their excellent SRE series: Building Secure and Reliable Systems. Copy/pasting a few paragraphs: "In this book we talk generally about systems, which is a conceptual way of thinking about the groups of components that cooperate to perform some function. It's free to download, so don't be shy.

Google

Google Systems Best Practices Strategy

Microservices vs. Monolith at a Startup: Making the Choice

DZone

JANUARY 31, 2024

The reality of the startup is that engineering teams are often at a crossroads when it comes to choosing the foundational architecture for their software applications. The allure of a microservice architecture is understandable in today's tech state of affairs, where scalability, flexibility, and independence are highly valued.

Architecture

Architecture Scalability Design Engineering

Up your quality and agility factor – using automation to build “performance-as-a-self-service”

Dynatrace

MARCH 3, 2020

For software engineering teams, this demand means not only delivering new features faster but ensuring quality, performance, and scalability too. One way to apply improvements is transforming the way application performance engineering and testing is done. Here is the definition of this model: ?. Try it today using Keptn .

Performance

Performance Education Innovation Software Architecture

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

DZone

FEBRUARY 15, 2024

Kubernetes, the de-facto orchestration platform, offers scalability and agility. Prometheus, a powerful open-source monitoring system, emerges as a perfect fit for this role, especially when integrated with Kubernetes. In the dynamic world of cloud-native technologies, monitoring and observability have become indispensable.

Monitoring

Monitoring Open Source Metrics Scalability

Designing Instagram

High Scalability

JANUARY 11, 2022

Machine Learning Engineer at Amazon and has led several machine-learning initiatives across the Amazon ecosystem. The streaming data store makes the system extensible to support other use-cases (e.g. System Components. The system will comprise of several micro-services each performing a separate task.

Design

Design Media Storage Logistics

Low-Maintenance Backend Architectures for Scalable Applications

A Step-by-Step Guide to Write a System Design Document

Trending Sources

Rapid Event Notification System at Netflix

Title Launch Observability at Netflix Scale

Title Launch Observability at Netflix Scale

Evolution of search engines architecture - Algolia New Search Architecture Part 1

Key Elements of Site Reliability Engineering (SRE)

Unlock the Power of DevSecOps with Newly Released Kubernetes Experience for Platform Engineering

A Kubernetes platform engineering strategy tames Kubernetes complexity

Site reliability engineering: 5 things you need to know

How observability, application security, and AI enhance DevOps and platform engineering maturity

DevOps engineer tools: Deploy, test, evaluate, repeat

Accelerate and empower Site Reliability Engineering with Dynatrace observability

Title Launch Observability at Netflix Scale

Supporting Diverse ML Systems at Netflix

Why applying chaos engineering to data-intensive applications matters

Engineering dependability and fault tolerance in a distributed system

Demystifying Interviewing for Backend Engineers @ Netflix

New analytics capabilities for messaging system-related anomalies

Site reliability engineering: 5 things to you need to know

Flexible, scalable, self-service Kubernetes native observability now in General Availability

Site Reliability Engineering

Mastering System Design: A Comprehensive Guide to System Scaling for Millions (Part 1)

Reliability indicators that matter to your business: SLOs for all data types

A Recap of the Data Engineering Open Forum at Netflix

Performance Engineering: The What, The Why, and The How Explained

Chaos Mesh — A Solution for System Resiliency on Kubernetes

It’s time to upgrade the PTC System Monitor (PSM)!

Growth Engineering at Netflix- Creating a Scalable Offers Platform

How Is Platform Engineering Different From DevOps and SRE?

Foundation Model for Personalized Recommendation

Observability engineering: Getting Prometheus metrics right for Kubernetes with Dynatrace and Kepler

How automated workflows and multicloud automation can reduce engineering toil

Scaling Is Not Just About Products – It’s About Teams, Too

Best Practices for Scaling RabbitMQ

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Starting an SRE Team? Stay Away From Uptime.

Achieving High Availability in CI/CD With Observability

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

Free Google Book: Building Secure and Reliable Systems

Microservices vs. Monolith at a Startup: Making the Choice

Up your quality and agility factor – using automation to build “performance-as-a-self-service”

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

Designing Instagram

Stay Connected