Architecture, Data and Data Engineering - Technology Performance Pulse

1. Streamlining Membership Data Engineering at Netflix with Psyberg

The Netflix TechBlog

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty At Netflix, our Membership and Finance Data Engineering team harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. What is late-arriving data? Let’s dive in!

Data Engineering

Data Engineering Engineering Processing Games

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024 The Data Engineering Open Forum at Netflix on April 18th, 2024. At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role in this mission by enabling data-driven decision-making at scale.

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

Data Engineers of Netflix?—?Interview with Samuel Setegne

The Netflix TechBlog

JUNE 1, 2021

Data Engineers of Netflix?—?Interview Interview with Samuel Setegne Samuel Setegne This post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. For example?—?clinical

Data Engineering

Data Engineering Engineering Big Data Healthcare

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

MARCH 25, 2019

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency By: Di Lin , Girish Lingappa , Jitender Aswani Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question?—?“Can

Infrastructure

Infrastructure Big Data Transportation Architecture

Introducing Impressions at Netflix

The Netflix TechBlog

FEBRUARY 14, 2025

Every image you hover over isnt just a visual placeholder; its a critical data point that fuels our sophisticated personalization engine. This nuanced integration of data and technology empowers us to offer bespoke content recommendations. This queue ensures we are consistently capturing raw events from our global userbase.

Tuning

Tuning Latency Efficiency Storage

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Storage

Storage Latency Efficiency Data Engineering

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

by Jun He , Akash Dwivedi , Natallia Dzenisenka , Snehal Chennuru , Praneeth Yenugutala , Pawan Dixit At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations.

Java

Java Scalability Traffic Architecture

Experimentation is a major focus of Data Science across Netflix

The Netflix TechBlog

JANUARY 11, 2022

Here we describe the role of Experimentation and A/B testing within the larger Data Science and Engineering organization at Netflix, including how our platform investments support running tests at scale while enabling innovation. Curious to learn more about other Data Science and Engineering functions at Netflix?

Innovation

Innovation Metrics Engineering Testing

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

The Netflix TechBlog

MARCH 5, 2019

While our engineering teams have and continue to build solutions to lighten this cognitive load (better guardrails, improved tooling, …), data and its derived products are critical elements to understanding, optimizing and abstracting our infrastructure. In the Reliability space, our data teams focus on two main approaches.

Infrastructure

Infrastructure Cloud Scalability AWS

These 7 Edge Data Challenges Will Test Companies the Most in 2025

VoltDB

DECEMBER 11, 2024

Edge computing has transformed how businesses and industries process and manage data. By bringing computation closer to the data source, edge-based deployments reduce latency, enhance real-time capabilities, and optimize network bandwidth. Data interception during transit. Redundancy and inefficiency in data aggregation.

IoT

IoT Energy Logistics Latency

Reimagining Experimentation Analysis at Netflix

The Netflix TechBlog

SEPTEMBER 10, 2019

After recreating the dataset, you can plot the raw numbers and perform custom analyses to understand the distribution of the data across test cells. However, the default reports only provide a summary view of the data with some powerful but limited filtering options.

Metrics

Metrics Architecture Infrastructure Innovation

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

How Uber Sped Up SQL-based Data Analytics with Presto and Express Queries

InfoQ

NOVEMBER 18, 2024

Uber uses Presto, an open-source distributed SQL query engine, to provide analytics across several data sources, including Apache Hive, Apache Pinot, MySQL, and Apache Kafka. To improve its performance, Uber engineers explored the advantages of dealing with quick queries, a.k.a.

Analytics

Analytics Open Source Engineering Performance

5 key areas for tech leaders to watch in 2020

O'Reilly

FEBRUARY 18, 2020

It’s also the data source for our annual usage study, which examines the most-used topics and the top search terms. [1]. This year’s growth in Python usage was buoyed by its increasing popularity among data scientists and machine learning (ML) and artificial intelligence (AI) engineers.

Software Architecture

Software Architecture DevOps Data Engineering Architecture

What is IT automation?

Dynatrace

JULY 6, 2022

Scripts and procedures usually focus on a particular task, such as deploying a new microservice to a Kubernetes cluster, implementing data retention policies on archived files in the cloud, or running a vulnerability scanner over code before it’s deployed. The range of use cases for automating IT is as broad as IT itself.

Artificial Intelligence

Artificial Intelligence Tuning Strategy Big Data

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

by Jun He , Yingyi Zhang , and Pawan Dixit Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset.

Processing

Processing Big Data Efficiency Engineering

Your technology architecture and engineering organization should coevolve as your startup grows

Abhishek Tiwari

FEBRUARY 26, 2020

The evolution of your technology architecture should depend on the size, culture, and skill set of your engineering organization. There are no hard-and-fast rules to figure out interdependency between technology architecture and engineering organization but below is what I think can really work well for product startup.

Technology

Technology Technology Architecture Engineering

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Uber Engineering

AUGUST 14, 2019

Maintaining Uber’s large-scale data warehouse comes with an operational cost in terms of ETL functions and storage. Once identified, … The post Less is More: Engineering Data Warehouse Efficiency with Minimalist Design appeared first on Uber Engineering Blog.

Efficiency

Efficiency Engineering Design Storage

Data Ingestion: The First Step Towards a Flawless Data Pipeline

Simform

JANUARY 8, 2023

Data ingestion is the foremost layer in a data engineering pipeline, acting as a vital pillar in the overall analytics architecture. Thus, it is essential to implement data ingestion just right. Here is everything you need to know to take the first step toward a flawless data pipeline.

Data Engineering

Data Engineering Analytics Architecture Engineering

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

InfoQ

APRIL 15, 2024

He specifically delved into Venice DB, the NoSQL data store used for feature persistence. At the QCon London 2024 conference, Félix GV from LinkedIn discussed the AI/ML platform powering the company’s products. By Rafal Gancarz

Big Data

Big Data Artificial Intelligence Data Engineering Latency

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

The Netflix TechBlog

MAY 26, 2020

By collecting, accessing and analyzing network data from a variety of sources like VPC Flow Logs, ELB Access Logs, Custom Exporter Agents, etc, we can provide Network Insight to users through multiple data visualization techniques like Lumen , Atlas , etc. At Netflix we publish the Flow Log data to Amazon S3.

Network

Network Tuning AWS Big Data

Kubernetes for Big Data Workloads

Abhishek Tiwari

DECEMBER 27, 2017

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Organisations are already using Kubernetes for a variety of workloads [1] [2] and data workloads are up next. Key challenges. Performance.

Big Data

Big Data Storage Benchmarking Hardware

Friends don't let friends build data pipelines

Abhishek Tiwari

JULY 12, 2018

Building data pipelines can offer strategic advantages to the business. Often companies underestimate the necessary effort and cost involved to build and maintain data pipelines. Data pipeline initiatives are generally unfinished projects. In this post, we will discuss why you should avoid building data pipelines in first place.

Latency

Latency Analytics Scalability Engineering

What is a Data Pipeline: Types, Architecture, Use Cases & more

Simform

JANUARY 25, 2023

Businesses can unlock the value of data only after it is transformed into actionable insights and when those insights are delivered promptly. But implementing such robust data pipelines can be complex and challenging. This blog discusses all the ins and outs of building data pipelines and how they can help strengthen businesses.

Architecture

Architecture Data Engineering Engineering

Microservices Adoption in 2020

O'Reilly

JULY 15, 2020

Software engineers comprise the survey audience’s single largest cluster, over one quarter (27%) of respondents (Figure 1). If you combine the different architectural roles—i.e., Adding architects and engineers, we see that roughly 55% of the respondents are directly involved in software development. Figure 1: Respondent roles.

Database

Database Architecture Education Systems

5 data integration trends that will define the future of ETL in 2018

Abhishek Tiwari

DECEMBER 27, 2017

ETL refers to extract, transform, load and it is generally used for data warehousing and data integration. There are several emerging data trends that will define the future of ETL in 2018. A common theme across all these trends is to remove the complexity by simplifying data management as a whole.

Big Data

Big Data Artificial Intelligence Storage Hardware

AI meets operations

O'Reilly

FEBRUARY 2, 2020

First, the behavior of an AI application depends on a model , which is built from source code and training data. A model isn’t source code, and it isn’t data; it’s an artifact built from the two. You need a repository for models and for the training data. Models almost certainly react to incoming data; that’s their point.

Software Architecture

Software Architecture Monitoring Software Engineering Code

Spice up your Analytics: Amazon QuickSight Now Generally Available in N. Virginia, Oregon, and Ireland.

All Things Distributed

NOVEMBER 15, 2016

Previously, I wrote about Amazon QuickSight , a new service targeted at business users that aims to simplify the process of deriving insights from a wide variety of data sources quickly, easily, and at a low cost. Put simply, data is not always readily available and accessible to organizational end users.

Analytics

Analytics Availability Media Social Media

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

APRIL 14, 2020

Scrapinghub is hiring a Senior Software Engineer (Big Data/AI). You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc.

Education

Education Software Engineering Scalability Engineering

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

APRIL 28, 2020

Scrapinghub is hiring a Senior Software Engineer (Big Data/AI). You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc.

Education

Education Software Engineering Scalability Engineering

Zendesk Moves from DynamoDB to MySQL and S3 to Save over 80% in Costs

InfoQ

DECEMBER 29, 2023

Zendesk reduced its data storage costs by over 80% by migrating from DynamoDB to a tiered storage solution using MySQL and S3. The company considered different storage technologies and decided to combine the relational database and the object store to strike a balance between querybility and scalability while keeping the costs down.

Storage

Storage Scalability Database Technology

How machine learning is accelerating data integration?

Abhishek Tiwari

DECEMBER 23, 2017

Data integration generally requires in-depth domain knowledge, a strong understanding of data schemas and underlying relationships. This can be time-consuming and bit challenging if you are dealing with hundreds of data sources and thousands of event types (see my recent article on ELT architecture ).

Architecture

Architecture Engineering Processing Code

Sustainability at AWS re:Invent 2022 All the talks and videos I could find…

Adrian Cockcroft

FEBRUARY 13, 2023

STP213 Scaling global carbon footprint management — Blake Blackwell Persefoni Manager Data Engineering and Michael Floyd AWS Head of Sustainability Solutions. SUS302 Optimizing architectures for sustainability — Katja Philipp AWS SA and Szymon Kochanski AWS SA. Good example, well presented interesting new material.

AWS

AWS Energy Architecture Programming

Canva Opts for Amazon KDS over SNS+SQS to Save 85% with 25 Billion Events per Day

InfoQ

AUGUST 7, 2024

Canva evaluated different data massaging solutions for its Product Analytics Platform, including the combination of AWS SNS and SQS, MKS, and Amazon KDS, and eventually chose the latter, primarily based on its much lower costs. The company compared many aspects of these solutions, like performance, maintenance effort, and cost.

AWS

AWS Analytics Performance Data Engineering

Stream Processing: How it Works, Use Cases & Popular Frameworks

Simform

FEBRUARY 14, 2023

Stream processing has become a core part of enterprise data architecture today due to the explosive growth of data from sources such as IoT sensors, security logs, and web applications. This blog discusses the topic of stream processing in detail to help you navigate its landscape with ease.

Processing

Processing IoT Architecture Data Engineering

Organise your engineering teams around the work by reteaming

Abhishek Tiwari

JULY 20, 2019

The engineering organisation described may not work for you because of a team of 8-10 people is still a very big overhead. In this model, software architecture and code ownership is a reflection of the organisational model. Thirdly, let engineers themselves choose the delivery teams and organise them around the initiative.

Engineering

Engineering Retail Airlines Healthcare

Symphonia at Velocity 2018, and more Serverless Insights

The Symphonia

JUNE 19, 2018

I was fortunate to be both presenting a 2-day workshop (on AWS Serverless Architectures and Continuous Deployment) as well as hosting a full-day Serverless track of talks. One of the catalysts for starting Symphonia was the massive interest in my article on “ Serverless Architectures ” that is published on Martin Fowler’s site.

Serverless

Serverless AWS DevOps Open Source

A case for ELT

Abhishek Tiwari

DECEMBER 22, 2017

Cheap storage and on-demand compute in the cloud coupled with the emergence of new big data frameworks and tools are forcing us to rethink the whole ETL and data warehousing architecture. Then we perform frequent batch ETL from application databases to a data warehouse. Classic ETL. then ELT is a more preferred option.

Big Data

Big Data Retail Storage Google

Presentation: Azure Cosmos DB: Low Latency and High Availability at Planet Scale

InfoQ

JULY 14, 2023

Mei-Chin Tsai, Vinod discuss the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability. By Mei-Chin Tsai, Vinod Sridharan

Latency

Latency Azure Availability Scalability

InfoQ Dev Summit in Boston: Two Days of Talks for Senior Developers

InfoQ

DECEMBER 20, 2023

InfoQ is delighted to announce a new two-day conference, InfoQ Dev Summit Boston 2024, taking place June 24-25, 2024. This event is designed to help senior developers navigate their immediate development challenges, focusing exclusively on the technical aspects that matter right now. By Artenisa Chatziou

Development

Development Design Data Engineering Scalability

How HubSpot Uses Apache Kafka Swimlanes for Timely Processing of Workflow Actions

InfoQ

NOVEMBER 29, 2023

HubSpot adopted routing messages over multiple Kafka topics (called swimlanes) for the same producer to avoid the build-up in the consumer group lag and prioritize the processing of real-time traffic.

Processing

Processing Traffic Data Engineering Scalability

How LinkedIn Serves Over 4.8 Million Member Profiles per Second

InfoQ

JULY 3, 2023

LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually. By Rafal Gancarz

Cache

Cache Latency Traffic Database

1. Streamlining Membership Data Engineering at Netflix with Psyberg

A Recap of the Data Engineering Open Forum at Netflix

Trending Sources

Data Engineers of Netflix?—?Interview with Samuel Setegne

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Introducing Impressions at Netflix

Optimizing data warehouse storage

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Experimentation is a major focus of Data Science across Netflix

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

These 7 Edge Data Challenges Will Test Companies the Most in 2025

Reimagining Experimentation Analysis at Netflix

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

How Uber Sped Up SQL-based Data Analytics with Presto and Express Queries

5 key areas for tech leaders to watch in 2020

What is IT automation?

Incremental Processing using Netflix Maestro and Apache Iceberg

Your technology architecture and engineering organization should coevolve as your startup grows

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Data Ingestion: The First Step Towards a Flawless Data Pipeline

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

Kubernetes for Big Data Workloads

Friends don't let friends build data pipelines

What is a Data Pipeline: Types, Architecture, Use Cases & more

Microservices Adoption in 2020

5 data integration trends that will define the future of ETL in 2018

AI meets operations

Spice up your Analytics: Amazon QuickSight Now Generally Available in N. Virginia, Oregon, and Ireland.

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Zendesk Moves from DynamoDB to MySQL and S3 to Save over 80% in Costs

How machine learning is accelerating data integration?

Sustainability at AWS re:Invent 2022 All the talks and videos I could find…

Canva Opts for Amazon KDS over SNS+SQS to Save 85% with 25 Billion Events per Day

Stream Processing: How it Works, Use Cases & Popular Frameworks

Organise your engineering teams around the work by reteaming

Symphonia at Velocity 2018, and more Serverless Insights

A case for ELT

Presentation: Azure Cosmos DB: Low Latency and High Availability at Planet Scale

InfoQ Dev Summit in Boston: Two Days of Talks for Senior Developers

How HubSpot Uses Apache Kafka Swimlanes for Timely Processing of Workflow Actions

How LinkedIn Serves Over 4.8 Million Member Profiles per Second

Stay Connected