Big Data and Design - Technology Performance Pulse

What is Greenplum Database? Intro to the Big Data Database

Scalegrid

MAY 13, 2020

It’s architecture was specially designed to manage large-scale data warehouses and business intelligence workloads by giving you the ability to spread your data out across a multitude of servers. This feature-packed database provides powerful and rapid analytics on data that scales up to petabyte volumes.

Big Data

Big Data Database Artificial Intelligence Open Source

In-Stream Big Data Processing

Highly Scalable

AUGUST 20, 2013

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. This system has been designed to supplement and succeed the existing Hadoop-based system that had too high latency of data processing and too high maintenance costs.

Big Data

Big Data Processing Lambda Database

Sustainability: Thoughts from a software engineer

Dynatrace

MARCH 17, 2025

Until recently, improvements in data center power efficiency compensated almost entirely for the increasing demand for computing resources. The rise of big data, cryptocurrencies, and AI means the IT sector contributes significantly to global greenhouse gas emissions. However, this trend is now reversing.

Software Engineering

Software Engineering Engineering Software Software

Driving down the cost of Big-Data analytics - All Things Distributed

All Things Distributed

AUGUST 18, 2011

Driving down the cost of Big-Data analytics. The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications.

Big Data

Big Data Analytics AWS Cloud

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

MARCH 25, 2019

Finally, imagine yourself in the role of a data platform reliability engineer tasked with providing advanced lead time to data pipeline (ETL) owners by proactively identifying issues upstream to their ETL jobs. Design a flexible data model ? —?Represent Enable seamless integration?—? push or pull.

Infrastructure

Infrastructure Big Data Transportation Architecture

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

The Netflix TechBlog

OCTOBER 28, 2021

Netflix’s unique work culture and petabyte-scale data problems are what drew me to Netflix. During earlier years of my career, I primarily worked as a backend software engineer, designing and building the backend systems that enable big data analytics.

Data Engineering

Data Engineering Engineering Big Data Software Engineering

Performance Monitoring Dashboards in the Age of Big Data Pollution

Rigor

MAY 22, 2019

Big data is like the pollution of the information age. The Big Data Struggle and Performance Reporting. As the big data era brings in multiple options for visualization, it has become apparent that not all solutions are created equal. Conclusion.

Big Data

Big Data Monitoring Performance Metrics

Data Engineers of Netflix?—?Interview with Kevin Wylie

The Netflix TechBlog

JULY 15, 2021

His favorite TV shows: Ozark, Breaking Bad, Black Mirror, Barry, and Chernobyl Since I joined Netflix back in 2011, my favorite project has been designing and building the first version of our entertainment knowledge graph. I was later hired into my first purely data gig where I was able to deepen my knowledge of big data.

Data Engineering

Data Engineering Engineering Entertainment Big Data

What Should You Know About Graph Database’s Scalability?

DZone

JANUARY 20, 2023

Do Not Be Misled Designing and implementing a scalable graph database system has never been a trivial task. There is a countless number of enterprises, particularly Internet giants, that have explored ways to make graph data processing scalable.

Scalability

Scalability Big Data Hardware Internet

Top 15 Software Testing Trends to Watch Out in 2021

DZone

DECEMBER 28, 2020

The introduction of innovative technologies has brought the newest updates in software testing, development, design, and delivery. Nowadays, Big Data tests mainly include data testing, paving the way for the Internet of Things to become the center point. Besides, AI and ML seem to reach a new level.

Software

Software Software Testing Big Data

Path to NoOps part 1: How modern AIOps brings NoOps within reach

Dynatrace

OCTOBER 25, 2022

NoOps is an advanced transformation of DevOps where many of the functions needed to manage, optimize and secure IT services and applications are automated within the design. This risk leads many to question the practicality of DevOps, which makes the idea of NoOps more attractive. Thus, the concept of NoOps takes DevOps a step further.

DevOps

DevOps Big Data Cloud Innovation

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

Dynatrace

OCTOBER 4, 2022

While data lakehouses combine the flexibility and cost-efficiency of data lakes with the querying capabilities of data warehouses, it’s important to understand how these storage environments differ. Data warehouses. Data warehouses were the original big data storage option.

Artificial Intelligence

Artificial Intelligence Storage Analytics Government

Seven benefits of AIOps to transform your business operations

Dynatrace

JULY 5, 2022

AIOps combines big data and machine learning to automate key IT operations processes, including anomaly detection and identification, event correlation, and root-cause analysis. Like the development and design phases, these applications generate massive data volumes that offer relevant and actionable insights.

Artificial Intelligence

Artificial Intelligence Cloud Innovation Strategy

Python at Netflix

The Netflix TechBlog

APRIL 29, 2019

Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. Orchestration The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines.

Open Source

Open Source Network Infrastructure Big Data

DynatraceGo! APAC 2021: Lessons in thick data and keeping pace with the market

Dynatrace

AUGUST 10, 2021

BPAY is in the midst of its digital transformation journey in which it is discovering the critical importance of developing “contemporary ways of designing, operating, and using” its software. She dispelled the myth that more big data equals better decisions, higher profits, or more customers. No matter how much you collect.

DevOps

DevOps Innovation Big Data Cloud

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Uber Engineering

AUGUST 14, 2019

Once identified, … The post Less is More: Engineering Data Warehouse Efficiency with Minimalist Design appeared first on Uber Engineering Blog. In our experience, optimizing for operational efficiency requires answering one key question: for which tables does the maintenance cost supersede utility?

Efficiency

Efficiency Engineering Design Storage

Turbocharge Your Apache Spark Jobs for Unmatched Performance

DZone

JULY 17, 2023

Apache Spark is a leading platform in the field of big data processing, known for its speed, versatility, and ease of use. Understanding Apache Spark Apache Spark is a unified computing engine designed for large-scale data processing. However, getting the most out of Spark often involves fine-tuning and optimization.

Big Data

Big Data Performance Open Source Tuning

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. The processed data is typically stored as data warehouse tables in AWS S3.

Latency

Latency Storage Big Data Tuning

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing. Backfill: Backfilling datasets is a common operation in big data processing. ETL pipelines keep all the benefits of batch workflows.

Processing

Processing Big Data Efficiency Engineering

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

Dynatrace

DECEMBER 15, 2022

ITOps refers to the process of acquiring, designing, deploying, configuring, and maintaining equipment and services that support an organization’s desired business outcomes. ITOps is an IT discipline involving actions and decisions made by the operations team responsible for an organization’s IT infrastructure. ITOps vs. AIOps.

Artificial Intelligence

Artificial Intelligence DevOps Hardware Virtualization

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

Creating new development environments is cumbersome: Populating them with data is compute-intensive, and the deployment process is error-prone, leading to higher costs, slower iteration, and unreliable data. To handle errors efficiently, Netflix developed a rule-based classifier for error classification called “Pensive.”

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

What is APM?

Dynatrace

JUNE 1, 2020

Application discovery, tracing and diagnostics (ADTD): Application discovery, tracing and diagnosis is a set of processes designed to understand the relationships between application servers, map transactions across these nodes, and enable the deep inspection of methods using bytecode instrumentation (BCI) and/or distributed tracing.

Artificial Intelligence

Artificial Intelligence Social Media Monitoring IoT

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

The Netflix TechBlog

MARCH 2, 2021

Later I enrolled in a data science program focused on helping academics transition to industry roles. A passion for making informed decisions based on data. Working on my PhD, I was using optimization techniques to design radiotherapy fractionation schemes to improve the results of clinical practices.

Analytics

Analytics C++ Innovation Engineering

Expanding the AWS Cloud: Introducing the AWS Europe (London) Region

All Things Distributed

DECEMBER 13, 2016

With the launch of the AWS Europe (London) Region, AWS can enable many more UK enterprise, public sector and startup customers to reduce IT costs, address data locality needs, and embark on rapid transformations in critical new areas, such as big data analysis and Internet of Things. Fraud.net is a good example of this.

AWS

AWS Cloud Artificial Intelligence IoT

What is a Distributed Storage System

Scalegrid

FEBRUARY 8, 2024

Their design emphasizes increasing availability by spreading out files among different nodes or servers — this approach significantly reduces risks associated with losing or corrupting data due to node failure. These distributed storage services also play a pivotal role in big data and analytics operations.

Storage

Storage Systems Big Data Azure

Reimagining Experimentation Analysis at Netflix

The Netflix TechBlog

SEPTEMBER 10, 2019

Instead of relying on engineers to productionize scientific contributions, we’ve made a strategic bet to build an architecture that enables data scientists to easily contribute. The two main challenges with this approach are establishing an easy contribution framework and handling Netflix’s scale of data.

Metrics

Metrics Architecture Infrastructure Innovation

AIOps observability adoption ascends in healthcare

Dynatrace

MARCH 14, 2022

AIOps (or “AI for IT operations”) uses artificial intelligence so that big data can help IT teams work faster and more effectively. As healthcare providers consider AIOps solutions, they should evaluate whether traditional AIOps approaches designed for correlation can enable long-term success.

Healthcare

Healthcare Artificial Intelligence Innovation Strategy

Probabilistic Data Structures for Web Analytics and Data Mining

Highly Scalable

MAY 1, 2012

Let us start with a simple example that illustrates capabilities of probabilistic data structures: Let us have a data set that is simply a heap of ten million random integer values and we know that it contains not more than one million distinct values (there are many duplicates). what is the cardinality of the data set)?

Analytics

Analytics Traffic Big Data Efficiency

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits. This article will list some of the use cases of AutoOptimize, discuss the design principles that help enhance efficiency, and present the high-level architecture.

Storage

Storage Latency Efficiency Data Engineering

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

The Netflix TechBlog

JUNE 1, 2021

At Netflix, the work that data engineers do to produce data in a robust, scalable way is incredibly important to provide the best experience to our members as they interact with our service. Through these cross-functional efforts, I’ve also really gotten to learn and appreciate the nuances of payments.

Data Engineering

Data Engineering Engineering Software Engineering Big Data

NoSQL Data Modeling Techniques

Highly Scalable

MARCH 1, 2012

The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases: NoSQL Data Models. The main design theme is “ What answers do I have?” ” .

Database

Database Ecommerce Efficiency Engineering

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

The Netflix TechBlog

MAY 26, 2020

As with any sustainable engineering design, focusing on simplicity is very important. These characteristics allow for an on-call response time that is relaxed and more in line with traditional big data analytical pipelines. Requirements There are multiple ways you can solve this problem and many technologies to choose from.

Network

Network Tuning AWS Traffic

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. After a fixed number of iterations is exhausted, the optimizer returns the “best” configuration solution (i.e.,

Tuning

Tuning Efficiency Big Data Engineering

Building a Rule-Based Platform to Manage Netflix Membership SKUs at Scale

The Netflix TechBlog

FEBRUARY 16, 2021

However, with our rapid product innovation speed, the whole approach experienced significant challenges: Business Complexity: The existing SKU management solution was designed years ago when the engagement rules were simple?—?three three plans and one offer homogeneously applied to all regions.

Mobile

Mobile Engineering Infrastructure Scalability

Data Engineers of Netflix?—?Interview with Samuel Setegne

The Netflix TechBlog

JUNE 1, 2021

clinical data was often small enough to fit into memory on an average computer and only in rare cases would its computation require any technical ingenuity or massive computing power. There was not enough scope to explore the distributed and large-scale computing challenges that usually come with big data processing.

Data Engineering

Data Engineering Engineering Big Data Healthcare

Helios: hyperscale indexing for the cloud & edge – part 1

The Morning Paper

OCTOBER 26, 2020

Helios also serves as a reference architecture for how Microsoft envisions its next generation of distributed big-data processing systems being built. What follows is a discussion of where big data systems might be heading, heavily inspired by the remarks in this paper, but with several of my own thoughts mixed in.

Cloud

Cloud Big Data Latency Architecture

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

by Jun He , Akash Dwivedi , Natallia Dzenisenka , Snehal Chennuru , Praneeth Yenugutala , Pawan Dixit At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations.

Java

Java Scalability Traffic Architecture

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

MAY 14, 2019

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., Finally, we show that Seer can identify application level design bugs, and provide insights on how to better architect microservices to achieve predictable performance. ASPLOS’19.

Big Data

Big Data Cloud Performance Hardware

What is Application Performance Monitoring?

Dynatrace

JUNE 1, 2020

Application discovery, tracing and diagnostics (ADTD): Application discovery, tracing and diagnosis is a set of processes designed to understand the relationships between application servers, map transactions across these nodes, and enable the deep inspection of methods using bytecode instrumentation (BCI) and/or distributed tracing.

Monitoring

Monitoring Performance Social Media Artificial Intelligence

What is RabbitMQ Used For

Scalegrid

JUNE 28, 2024

It leverages various exchange types to either route messages directly to designated queues following specific routing and binding keys or disperses them broadly like an indiscriminate town herald. Can RabbitMQ handle the high-throughput needs of big data applications? RabbitMQ’s real adaptability emerges with topic exchanges.

IoT

IoT Healthcare Programming Open Source

Streaming SQL in Data Mesh

The Netflix TechBlog

NOVEMBER 3, 2023

However, this design decision led to a different set of challenges. By keeping the logic of individual Processors simple, it allowed them to be reusable so we could centrally manage and operate them at scale. It also allowed them to be composable, so users could combine the different Processors to express the logic they needed.

Processing

Processing Engineering Infrastructure Latency

Music to my Ears - All Things Distributed

All Things Distributed

MARCH 28, 2011

Amazon S3 is used by enterprises of all sizes and is designed to handle scaling extremely well; it stores hundreds of billions of objects and easily performs several hundreds of thousands of storage transaction a second. a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications. Expanding the Cloud â??

AWS

AWS Cloud Storage Internet

Mastering Distributed SQL™ Databases in 2025

Scalegrid

JANUARY 10, 2025

They keep the features that developers like but can handle much more data, similar to NoSQL systems. Notably, they simplify handling big data flows, offer consistent transactions, and sustain high performance even when they’re used for real-time data analysis and complex queries.

Database

Database Scalability Best Practices Blockchain

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

APRIL 14, 2020

has hours of system design content. They also do live system design discussions every week. Scrapinghub is hiring a Senior Software Engineer (Big Data/AI). Learn to balance architecture trade-offs and design scalable enterprise-level software. Who's Hiring? InterviewCamp.io Try out their platform.

Education

Education Software Engineering Scalability Engineering

What is Greenplum Database? Intro to the Big Data Database

In-Stream Big Data Processing

Trending Sources

Sustainability: Thoughts from a software engineer

Driving down the cost of Big-Data analytics - All Things Distributed

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Performance Monitoring Dashboards in the Age of Big Data Pollution

Data Engineers of Netflix?—?Interview with Kevin Wylie

What Should You Know About Graph Database’s Scalability?

Top 15 Software Testing Trends to Watch Out in 2021

Path to NoOps part 1: How modern AIOps brings NoOps within reach

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

Seven benefits of AIOps to transform your business operations

Python at Netflix

DynatraceGo! APAC 2021: Lessons in thick data and keeping pace with the market

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Turbocharge Your Apache Spark Jobs for Unmatched Performance

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Incremental Processing using Netflix Maestro and Apache Iceberg

What is ITOps? Why IT operations is more crucial than ever in a multicloud world

A Recap of the Data Engineering Open Forum at Netflix

What is APM?

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Expanding the AWS Cloud: Introducing the AWS Europe (London) Region

What is a Distributed Storage System

Reimagining Experimentation Analysis at Netflix

AIOps observability adoption ascends in healthcare

Probabilistic Data Structures for Web Analytics and Data Mining

Optimizing data warehouse storage

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

NoSQL Data Modeling Techniques

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Building a Rule-Based Platform to Manage Netflix Membership SKUs at Scale

Data Engineers of Netflix?—?Interview with Samuel Setegne

Helios: hyperscale indexing for the cloud & edge – part 1

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

What is Application Performance Monitoring?

What is RabbitMQ Used For

Streaming SQL in Data Mesh

Music to my Ears - All Things Distributed

Mastering Distributed SQL™ Databases in 2025

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Stay Connected