Infrastructure, Scalability and Systems - Technology Performance Pulse

Implementing a Self-Healing Infrastructure With Kubernetes and Prometheus

DZone

MAY 3, 2023

In today's world, the need for highly available and fault-tolerant systems is more important than ever. Furthermore, with the increased adoption of microservices and containerization , the need for a reliable infrastructure that can automatically detect and recover from failures has become critical.

Infrastructure

Infrastructure Open Source Scalability Monitoring

Dynatrace elevates data security with separated storage and unique encryption keys for each tenant

Dynatrace

NOVEMBER 29, 2024

Protect data in multi-tenant architectures To bring you the most value by unifying observability and security in one analytics and automation platform powered by AI, Dynatrace SaaS leverages a multitenancy architecture, enabling efficient and scalable data ingestion, querying, and processing on shared infrastructure.

Storage

Storage AWS Azure Architecture

Tips for Building a Scalable Payment Architecture

DZone

JANUARY 31, 2024

Having a diverse payment system is crucial to avoid vendor lock-in, to leverage local payment methods, and to maintain control over costs. But as your application's market reach grows, so does the necessity for flexible and diverse payment options.

Scalability

Scalability Architecture Strategy Infrastructure

What is hyperconverged infrastructure? Realizing the benefits of HCI

Dynatrace

NOVEMBER 11, 2022

Therefore, they need an environment that offers scalable computing, storage, and networking. That’s where hyperconverged infrastructure, or HCI, comes in. What is hyperconverged infrastructure? For organizations managing a hybrid cloud infrastructure , HCI has become a go-to strategy. Realizing the benefits of HCI.

Infrastructure

Infrastructure Storage Virtualization Network

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

OCTOBER 19, 2020

which is difficult when troubleshooting distributed systems. Now let’s look at how we designed the tracing infrastructure that powers Edgar. This insight led us to build Edgar: a distributed tracing infrastructure and user experience. Investigating a video streaming failure consists of inspecting all aspects of a member account.

Infrastructure

Infrastructure Transportation Storage Open Source

Helping customers unlock the Power of Possible

Dynatrace

OCTOBER 29, 2024

And it enables executives to have unprecedented insight into how user experiences, applications and underlying infrastructure health can power their business. By automating root-cause analysis, TD Bank reduced incidents, speeding up resolution times and maintaining system reliability. The result?

Innovation

Innovation Cloud Strategy AWS

Netflix’s Distributed Counter Abstraction

The Netflix TechBlog

NOVEMBER 12, 2024

However, this category requires near-immediate access to the current count at low latencies, all while keeping infrastructure costs to a minimum. Eventually Consistent : This category needs accurate and durable counts, and is willing to tolerate a slight delay in accuracy and a slightly higher infrastructure cost as a trade-off.

Latency

Latency Cache Infrastructure Strategy

RabbitMQ vs. Kafka: Key Differences

Scalegrid

FEBRUARY 6, 2025

Introduction to Message Brokers Message brokers enable applications, services, and systems to communicate by acting as intermediaries between senders and receivers. This decoupling simplifies system architecture and supports scalability in distributed environments.

Latency

Latency Analytics Architecture Storage

Flexible, scalable, self-service Kubernetes native observability now in General Availability

Dynatrace

MAY 17, 2022

For years, enterprises managed observability data on a team-by-team basis , using a combination of ticketing systems and configuration management tools. None of this complexity is exposed to application and infrastructure teams. Flexible automation for Kubernetes observability. This approach is costly and error prone.

Availability

Availability Scalability Cloud Metrics

Monitoring of Kubernetes Infrastructure for day 2 operations

Dynatrace

JULY 8, 2020

Kubernetes has taken over the container management world and beyond , to become what some say the operating system or the new Linux of the cloud. Monitoring the infrastructure: no matter the number of layers of abstraction that Kubernetes and containers provide, they still run on infrastructure, virtual and physical.

Infrastructure

Infrastructure Monitoring Cloud Metrics

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

The Netflix TechBlog

MARCH 5, 2019

Central engineering teams enable this operational model by reducing the cognitive burden on innovation teams through solutions related to securing, scaling and strengthening (resilience) the infrastructure. All these micro-services are currently operated in AWS cloud infrastructure.

Infrastructure

Infrastructure Cloud Scalability AWS

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

Berg , Romain Cledat , Kayla Seeley , Shashank Srikanth , Chaoying Wang , Darin Yu Netflix uses data science and machine learning across all facets of the company, powering a wide range of business applications from our internal infrastructure and content demand modeling to media understanding.

Systems

Systems Media Cache Open Source

Title Launch Observability at Netflix Scale

The Netflix TechBlog

DECEMBER 17, 2024

To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on ourservice. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness. Yet, these pages couldnt be more different.

Traffic

Traffic Scalability Strategy Monitoring

What is infrastructure monitoring and why is it mission-critical in the new normal?

Dynatrace

NOVEMBER 2, 2020

IT infrastructure is the heart of your digital business and connects every area – physical and virtual servers, storage, databases, networks, cloud services. We’ve seen the IT infrastructure landscape evolve rapidly over the past few years. What is infrastructure monitoring? . Dynatrace news.

Infrastructure

Infrastructure Monitoring Virtualization Serverless

How to observe logs with Journald and Dynatrace

Dynatrace

APRIL 4, 2025

In this blog post, youll learn how Dynatrace OneAgent automatically identifies Journald and ingests structured logs into Dynatrace while enriching them with topology and infrastructure context. Journald provides unified structured logging for systems, services, and applications, eliminating the need for custom parsing for severity or details.

Analytics

Analytics Operating System Scalability Infrastructure

New analytics capabilities for messaging system-related anomalies

Dynatrace

JANUARY 12, 2022

Messaging systems can significantly improve the reliability, performance, and scalability of the communication processes between applications and services. In serverless and microservices architectures, messaging systems are often used to build asynchronous service-to-service communication. Dynatrace news. This is great!

Analytics

Analytics Systems DevOps Healthcare

In-House Model Serving Infrastructure for GPU Flexibility

DZone

OCTOBER 7, 2024

Many organizations rely on cloud services like AWS, Azure, or GCP for these GPU-powered workloads, but a growing number of businesses are opting to build their own in-house model serving infrastructure. This shift is driven by the need for greater control over costs, data privacy, and system customization.

Infrastructure

Infrastructure Azure Hardware AWS

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Dynatrace

JANUARY 21, 2025

Ensuring smooth operations is no small feat, whether you’re in charge of application performance, IT infrastructure, or business processes. Forecasting can identify potential anomalies in node performance, helping to prevent issues before they impact the system. This ensures optimal resource utilization and cost efficiency.

Traffic

Traffic Metrics Analytics Monitoring

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

MARCH 25, 2019

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency By: Di Lin , Girish Lingappa , Jitender Aswani Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question?—?“Can

Infrastructure

Infrastructure Big Data Transportation Architecture

Foundation Model for Personalized Recommendation

The Netflix TechBlog

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Tuning

Tuning Efficiency Latency Strategy

Path to NoOps part 2: How infrastructure as code makes cloud automation attainable—and repeatable—at scale

Dynatrace

NOVEMBER 29, 2022

Infrastructure as code is a way to automate infrastructure provisioning and management. In this blog, I explore how Dynatrace has made cloud automation attainable—and repeatable—at scale by embracing the principles of infrastructure as code. Transparency and scalability. Infrastructure-as-code.

Infrastructure

Infrastructure Code Cloud DevOps

Dynatrace delivers flexible and scalable Kubernetes native synthetic private locations

Dynatrace

MAY 24, 2023

Global corporations with offices in multiple countries need to ensure that their internal systems are accessible to all employees, regardless of their location. A prominent solution is virtual machines, however, this is inadequate for customers who deploy their systems with Kubernetes.

Scalability

Scalability Virtualization Monitoring Open Source

Dynatrace extends Synthetic Monitoring capabilities with Network Availability Monitors to validate the availability of infrastructure and services

Dynatrace

JULY 15, 2024

Combined with Dynatrace OneAgent ® , you gain a precise view of the status of your systems at a glance. Whether necessary as part of deep root-cause analyses of issues faced by your users that impact your business or if you’re an engineer responsible for the infrastructure hosting your applications and network paths.

Availability

Availability Network Monitoring Infrastructure

Reliability indicators that matter to your business: SLOs for all data types

Dynatrace

OCTOBER 31, 2024

It doesn’t matter if you need typically used failure-rate or response-time metrics to ensure your system’s availability and performance or if you need to rely on abnormal log drops to gain insights into raising problems—SLOs leveraged with Grail provide all the information you need.

Metrics

Metrics Availability Monitoring Scalability

Globalizing Productions with Netflix’s Media Production Suite

The Netflix TechBlog

MARCH 31, 2025

Our AnswerContent Hubs Media Production Suite(MPS) [link] Building a global scalable solution that could be utilized in a diversity of markets has been an exciting challenge. This infrastructure is available for Netflix shows and is foundational under Content Hubs Media Production Suite tooling. So what isit?

Media

Media Logistics Innovation Cloud

What is log management? How to tame distributed cloud system complexities

Dynatrace

SEPTEMBER 8, 2022

Log management is an organization’s rules and policies for managing and enabling the creation, transmission, analysis, storage, and other tasks related to IT systems’ and applications’ log data. Most infrastructure and applications generate logs. How log management systems optimize performance and security.

Cloud

Cloud Systems Analytics DevOps

It’s time to upgrade the PTC System Monitor (PSM)!

Dynatrace

OCTOBER 28, 2020

As a PSM system administrator, you’ve relied on AppMon as a preconfigured APM tool for detecting, diagnosing, and repairing problems that impact the operational health of your Windchill application suite. This means that your entire IT infrastructure can be monitored within minutes. Dynatrace news. You name it, and we have it!

Monitoring

Monitoring Systems Infrastructure Cloud

Cost-Aware Resilience: Implementing Chaos Engineering Without Breaking the Budget

DZone

APRIL 1, 2025

Modern distributed systems, like microservices and cloud-native architectures, are built to be scalable and reliable. Chaos engineering is a useful way to test and improve system resilience by intentionally creating controlled failures. However, their complexity can lead to unexpected failures.

Engineering

Engineering Virtualization Scalability Architecture

Chaos Mesh — A Solution for System Resiliency on Kubernetes

DZone

APRIL 22, 2020

Traditionally we use unit tests and integration tests that guarantee a system is production-ready. To better identify system vulnerabilities and improve resilience, Netflix invented Chaos Monkey , which injects various types of faults into the infrastructure and business systems. This is how Chaos Engineering began.

Systems

Systems Infrastructure Engineering Testing

Redefining Artifact Storage: Preparing for Tomorrow's Binary Management Needs

DZone

SEPTEMBER 23, 2024

As software pipelines evolve, so do the demands on binary and artifact storage systems. While solutions like Nexus, JFrog Artifactory, and other package managers have served well, they are increasingly showing limitations in scalability, security, flexibility, and vendor lock-in. Let’s explore the key players:

Storage

Storage Innovation Scalability Infrastructure

Achieving High Availability in CI/CD With Observability

DZone

MARCH 5, 2024

Forbes estimates that cloud budgets will break all previous records as businesses will spend over $1 trillion on cloud computing infrastructure in 2024. By integrating observability tools in CI/CD pipelines, organizations can increase deployment frequency, minimize risks, and build highly available systems.

Availability

Availability DevOps Infrastructure Scalability

Best Practices for Scaling RabbitMQ

Scalegrid

FEBRUARY 24, 2025

Scaling RabbitMQ ensures your system can handle growing traffic and maintain high performance. Key Takeaways RabbitMQ improves scalability and fault tolerance in distributed systems by decoupling applications, enabling reliable message exchanges.

Best Practices

Best Practices Traffic Strategy Scalability

What is observability? Not just logs, metrics and traces

Dynatrace

OCTOBER 1, 2021

As dynamic systems architectures increase in complexity and scale, IT teams face mounting pressure to track and respond to conditions and issues across their multi-cloud environments. How do you make a system observable? Dynatrace news. The architects and developers who create the software must design it to be observed.

Metrics

Metrics Open Source Monitoring Cloud

Simplify log onboarding: From zero to observability in minutes

Dynatrace

MARCH 5, 2025

This blog post explains how Dynatrace simplifies log ingestion, whether youre onboarding logs from your infrastructure using OneAgent, cloud services using log forwarding, or driving open-source standardization leveraging OpenTelemetry (OTel), Fluent Bit, or any other API-based ingestion methods.

Open Source

Open Source IoT Cloud Azure

Software-Defined Networking in Distributed Systems: Transforming Data Centers and Cloud Computing Environments

DZone

FEBRUARY 3, 2024

In the changing world of data centers and cloud computing, the desire for efficient, flexible, and scalable networking solutions has resulted in the broad use of Software-Defined Networking (SDN). Traditional networking models have a tightly integrated control plane and data plane within network devices.

Network

Network Systems Cloud Software

Six causes of major software outages–And how to avoid them

Dynatrace

AUGUST 8, 2024

From business operations to personal communication, the reliance on software and cloud infrastructure is only increasing. Ransomware encrypts essential data, locking users out of systems and halting operations until a ransom is paid. Outages can disrupt services, cause financial losses, and damage brand reputations.

Software

Software Software Network Infrastructure

Kubernetes in the wild report 2023

Dynatrace

JANUARY 16, 2023

Findings provide insights into Kubernetes practitioners’ infrastructure preferences and how they use advanced Kubernetes platform technologies. As Kubernetes adoption increases and it continues to advance technologically, Kubernetes has emerged as the “operating system” of the cloud. Kubernetes moved to the cloud in 2022.

Open Source

Open Source Java Operating System Programming

Enhancing Resiliency: Implementing the Circuit Breaker Pattern for Strong Serverless Architecture on AWS

DZone

JANUARY 16, 2024

Serverless architecture is a way of building and running applications without the need to manage infrastructure. Scalability: Serverless services automatically scale with the application's needs. Resiliency is the ability of a system to handle and recover from faults, and it's vital in a serverless environment for a few reasons:

Serverless

Serverless AWS Architecture Lambda

Deploying Prometheus and Grafana as Applications using ArgoCD?—?Including Dashboards

DZone

MARCH 30, 2023

If you're tired of managing your infrastructure manually, ArgoCD is the perfect tool to streamline your processes and ensure your services are always in sync with your source code. Say goodbye to the headaches of manual infrastructure management and hello to a more efficient and scalable approach with ArgoCD!

Infrastructure

Infrastructure Scalability Efficiency Code

What is container as a service? How CaaS compares to PaaS, IaaS, and FaaS

Dynatrace

NOVEMBER 3, 2022

The containerization craze has continued for enterprises, with benefits such as portability, efficiency, and scalability. These containers are software packages that include all the relevant dependencies needed to run software on any system. Easy scalability. million in 2020. Process portability. Faster deployment. CaaS vs.

Serverless

Serverless Azure Hardware Transportation

Free Google Book: Building Secure and Reliable Systems

High Scalability

APRIL 9, 2020

Google added another book into their excellent SRE series: Building Secure and Reliable Systems. Copy/pasting a few paragraphs: "In this book we talk generally about systems, which is a conceptual way of thinking about the groups of components that cooperate to perform some function. It's free to download, so don't be shy.

Google

Google Systems Best Practices Strategy

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

DZone

FEBRUARY 15, 2024

Kubernetes, the de-facto orchestration platform, offers scalability and agility. Prometheus, a powerful open-source monitoring system, emerges as a perfect fit for this role, especially when integrated with Kubernetes. In the dynamic world of cloud-native technologies, monitoring and observability have become indispensable.

Monitoring

Monitoring Open Source Metrics Scalability

Key Elements of Site Reliability Engineering (SRE)

DZone

MARCH 14, 2023

Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives.

Engineering

Engineering Software Engineering Scalability Efficiency

What is GitOps?

Dynatrace

JULY 25, 2022

These methods improve the software development lifecycle (SDLC), but what if infrastructure deployment and management could also benefit? Development teams use GitOps to specify their infrastructure requirements in code. Known as infrastructure as code (IaC), it can build out infrastructure automatically to scale.

DevOps

DevOps Infrastructure Speed Cloud

Implementing a Self-Healing Infrastructure With Kubernetes and Prometheus

Dynatrace elevates data security with separated storage and unique encryption keys for each tenant

Trending Sources

Tips for Building a Scalable Payment Architecture

What is hyperconverged infrastructure? Realizing the benefits of HCI

Building Netflix’s Distributed Tracing Infrastructure

Helping customers unlock the Power of Possible

Netflix’s Distributed Counter Abstraction

RabbitMQ vs. Kafka: Key Differences

Flexible, scalable, self-service Kubernetes native observability now in General Availability

Monitoring of Kubernetes Infrastructure for day 2 operations

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

Supporting Diverse ML Systems at Netflix

Title Launch Observability at Netflix Scale

What is infrastructure monitoring and why is it mission-critical in the new normal?

How to observe logs with Journald and Dynatrace

New analytics capabilities for messaging system-related anomalies

In-House Model Serving Infrastructure for GPU Flexibility

Better dashboarding with Dynatrace Davis AI: Instant meaningful insights

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Foundation Model for Personalized Recommendation

Path to NoOps part 2: How infrastructure as code makes cloud automation attainable—and repeatable—at scale

Dynatrace delivers flexible and scalable Kubernetes native synthetic private locations

Dynatrace extends Synthetic Monitoring capabilities with Network Availability Monitors to validate the availability of infrastructure and services

Reliability indicators that matter to your business: SLOs for all data types

Globalizing Productions with Netflix’s Media Production Suite

What is log management? How to tame distributed cloud system complexities

It’s time to upgrade the PTC System Monitor (PSM)!

Cost-Aware Resilience: Implementing Chaos Engineering Without Breaking the Budget

Chaos Mesh — A Solution for System Resiliency on Kubernetes

Redefining Artifact Storage: Preparing for Tomorrow's Binary Management Needs

Achieving High Availability in CI/CD With Observability

Best Practices for Scaling RabbitMQ

What is observability? Not just logs, metrics and traces

Simplify log onboarding: From zero to observability in minutes

Software-Defined Networking in Distributed Systems: Transforming Data Centers and Cloud Computing Environments

Six causes of major software outages–And how to avoid them

Kubernetes in the wild report 2023

Enhancing Resiliency: Implementing the Circuit Breaker Pattern for Strong Serverless Architecture on AWS

Deploying Prometheus and Grafana as Applications using ArgoCD?—?Including Dashboards

What is container as a service? How CaaS compares to PaaS, IaaS, and FaaS

Free Google Book: Building Secure and Reliable Systems

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

Key Elements of Site Reliability Engineering (SRE)

What is GitOps?

Stay Connected