Data Engineering, Development and Scalability - Technology Performance Pulse

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024 The Data Engineering Open Forum at Netflix on April 18th, 2024. At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role in this mission by enabling data-driven decision-making at scale.

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

The Netflix TechBlog

MARCH 5, 2019

As a micro-service owner, a Netflix engineer is responsible for its innovation as well as its operation, which includes making sure the service is reliable, secure, efficient and performant. How can we develop templated detection modules (rules- and ML-based) and data streams to increases speed of development?

Infrastructure

Infrastructure Cloud Scalability AWS

Analytics at Netflix: Who we are and what we do

The Netflix TechBlog

SEPTEMBER 18, 2020

Full ownership often means building new data pipelines, navigating complex schemas and large data sets, developing or improving metrics for business performance, and creating intuitive visualizations and dashboards?—?always Others have grown into new areas as part of their professional development at Netflix.

Analytics

Analytics Engineering Data Engineering Tuning

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

MARCH 25, 2019

We adopted the following mission statement to guide our investments: “Provide a complete and accurate data lineage system enabling decision-makers to win moments of truth.” Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management.

Infrastructure

Infrastructure Big Data Transportation Architecture

Scaling Appsec at Netflix (Part 2)

The Netflix TechBlog

JUNE 6, 2022

Our goal is to manage security risks to Netflix via clear, opinionated security guidance, and by providing risk context to Netflix engineering teams to make pragmatic risk decisions at scale. including bug bounty, pentesting, PSIRT (product security incident response), security reviews, and developer security education?—?via

Software Engineering

Software Engineering Scalability Education Engineering

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

MARCH 7, 2024

In addition to Spark, we want to support last-mile data processing in Python, addressing use cases such as feature transformations, batch inference, and training. Occasionally, these use cases involve terabytes of data, so we have to pay attention to performance. Internally, we use a production workflow orchestrator called Maestro.

Systems

Systems Media Cache Open Source

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

As Big data and ML became more prevalent and impactful, the scalability, reliability, and usability of the orchestrating ecosystem have increasingly become more important for our data scientists and the company. Another dimension of scalability to consider is the size of the workflow.

Java

Java Scalability Traffic Architecture

Netflix at AWS re:Invent 2019

The Netflix TechBlog

NOVEMBER 22, 2019

1pm-2pm NFX 207 Benchmarking stateful services in the cloud Vinay Chella , Data Platform Engineering Manager Abstract : AWS cloud services make it possible to achieve millions of operations per second in a scalable fashion across multiple regions. We explore all the systems necessary to make and stream content from Netflix.

AWS

AWS Entertainment Open Source Benchmarking

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

At this scale, we can gain a significant amount of performance and cost benefits by optimizing the storage layout (records, objects, partitions) as the data lands into our warehouse. Some of the optimizations are prerequisites for a high-performance data warehouse. Other Components Iceberg We use Apache Iceberg as the table format.

Storage

Storage Latency Efficiency Data Engineering

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

Whether in analyzing A/B tests, optimizing studio production, training algorithms, investing in content acquisition, detecting security breaches, or optimizing payments, well structured and accurate data is foundational. Users configure the workflow to read the data in a window (e.g. data arrives too late to be useful).

Processing

Processing Big Data Efficiency Engineering

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

APRIL 14, 2020

You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc. Please apply here. Advertise your job here!

Education

Education Software Engineering Scalability Engineering

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

APRIL 28, 2020

You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc. Please apply here. Advertise your job here!

Education

Education Software Engineering Scalability Engineering

Zendesk Moves from DynamoDB to MySQL and S3 to Save over 80% in Costs

InfoQ

DECEMBER 29, 2023

Zendesk reduced its data storage costs by over 80% by migrating from DynamoDB to a tiered storage solution using MySQL and S3. The company considered different storage technologies and decided to combine the relational database and the object store to strike a balance between querybility and scalability while keeping the costs down.

Storage

Storage Scalability Database Technology

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

High Scalability

NOVEMBER 12, 2019

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Java Software Engineering Engineering

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

High Scalability

OCTOBER 29, 2019

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Java Software Engineering Engineering

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

MARCH 3, 2020

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Engineering Games Java

InfoQ Dev Summit in Boston: Two Days of Talks for Senior Developers

InfoQ

DECEMBER 20, 2023

This event is designed to help senior developers navigate their immediate development challenges, focusing exclusively on the technical aspects that matter right now. InfoQ is delighted to announce a new two-day conference, InfoQ Dev Summit Boston 2024, taking place June 24-25, 2024. By Artenisa Chatziou

Development

Development Design Data Engineering Scalability

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

FEBRUARY 18, 2020

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Engineering Games Java

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

High Scalability

JANUARY 7, 2020

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Java Software Engineering Engineering

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

High Scalability

DECEMBER 12, 2019

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Java Software Engineering Engineering

Post: Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

MARCH 17, 2020

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Engineering Java Servers

Post: Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

MARCH 24, 2020

You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc. Please apply here. Try the 30-day free trial!

Education

Education Software Engineering Engineering Big Data

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

FEBRUARY 9, 2020

Etleap is analyst-friendly , enterprise-grade ETL-as-a-service , built for Redshift and Snowflake data warehouses and S3/Glue data lakes. Our intuitive software allows data engineers to maintain pipelines without writing code, and lets analysts gain access to data in minutes instead of months.

Education

Education Engineering Games Java

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

High Scalability

MARCH 30, 2020

You will be designing and implementing distributed systems : large-scale web crawling platform, integrating Deep Learning based web data extraction components, working on queue algorithms, large datasets, creating a development platform for other company departments, etc. Please apply here. Try the 30-day free trial!

Education

Education Software Engineering Engineering Big Data

Netflix at AWS re:Invent 2019

The Netflix TechBlog

NOVEMBER 22, 2019

1pm-2pm NFX 207 Benchmarking stateful services in the cloud Vinay Chella , Data Platform Engineering Manager Abstract : AWS cloud services make it possible to achieve millions of operations per second in a scalable fashion across multiple regions. We explore all the systems necessary to make and stream content from Netflix.

AWS

AWS Entertainment Open Source Benchmarking

Netflix at AWS re:Invent 2019

The Netflix TechBlog

NOVEMBER 22, 2019

1pm-2pm NFX 207 Benchmarking stateful services in the cloud Vinay Chella , Data Platform Engineering Manager Abstract : AWS cloud services make it possible to achieve millions of operations per second in a scalable fashion across multiple regions. We explore all the systems necessary to make and stream content from Netflix.

AWS

AWS Entertainment Open Source Benchmarking

AWS Launches General Availability of Amazon EC2 P5 Instances for AI/ML and HPC Workloads

InfoQ

AUGUST 3, 2023

AWS recently announced the general availability (GA) of Amazon EC2 P5 instances powered by the latest NVIDIA H100 Tensor Core GPUs suitable for users that require high performance and scalability in AI/ML and HPC workloads. The GA is a follow-up to the earlier announcement of the development of the infrastructure.

AWS

AWS Availability Scalability Infrastructure

Microservices Adoption in 2020

O'Reilly

JULY 15, 2020

Adding architects and engineers, we see that roughly 55% of the respondents are directly involved in software development. Technical roles represented in the “Other” category include IT managers, data engineers, DevOps practitioners, data scientists, systems engineers, and systems administrators.

Database

Database Architecture Education Systems

Friends don't let friends build data pipelines

Abhishek Tiwari

JULY 12, 2018

Data Pipeline A data pipeline is a software that ingests data from multiple sources, transforms it and finally makes it available to internal or external products. Unfortunately, building data pipelines remains a daunting, time-consuming, and costly activity. Depending on frameworks, data processing units (a.k.a

Latency

Latency Analytics Scalability Engineering

Expanding the Cloud: Introducing Amazon QuickSight

All Things Distributed

OCTOBER 7, 2015

In such a data intensive environment, making key business decisions such as running marketing and sales campaigns, logistic planning, financial analysis and ad targeting require deriving insights from these data. However, the data infrastructure to collect, store and process data is geared toward developers (e.g.,

Cloud

Cloud Big Data AWS Analytics

How Uber Sped Up SQL-based Data Analytics with Presto and Express Queries

InfoQ

NOVEMBER 18, 2024

Uber uses Presto, an open-source distributed SQL query engine, to provide analytics across several data sources, including Apache Hive, Apache Pinot, MySQL, and Apache Kafka. To improve its performance, Uber engineers explored the advantages of dealing with quick queries, a.k.a.

Analytics

Analytics Open Source Engineering Performance

Kubernetes for Big Data Workloads

Abhishek Tiwari

DECEMBER 27, 2017

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Organisations are already using Kubernetes for a variety of workloads [1] [2] and data workloads are up next. Key challenges. Performance.

Big Data

Big Data Storage Benchmarking Hardware

Hugging Face's Guide to Optimizing LLMs in Production

InfoQ

SEPTEMBER 25, 2023

When it comes to deploying Large Language Models (LLMs) in production, the two major challenges originate from the huge amount of parameters they require and the necessity of handling very long input sequences to represent contextual information.

Data Engineering

Data Engineering Scalability Engineering Development

How HubSpot Uses Apache Kafka Swimlanes for Timely Processing of Workflow Actions

InfoQ

NOVEMBER 29, 2023

HubSpot adopted routing messages over multiple Kafka topics (called swimlanes) for the same producer to avoid the build-up in the consumer group lag and prioritize the processing of real-time traffic.

Processing

Processing Traffic Data Engineering Scalability

How LinkedIn Serves Over 4.8 Million Member Profiles per Second

InfoQ

JULY 3, 2023

LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually. By Rafal Gancarz

Cache

Cache Latency Traffic Database

Google Announces the General Availability of A2 Virtual Machines

InfoQ

APRIL 7, 2021

Recently, Google announced A2 Virtual Machines (VMs)' general availability based on the NVIDIA Ampere A100 Tensor Core GPUs in Compute Engine.

Virtualization

Virtualization Google Availability Engineering

Microsoft Azure Managed Lustre for HPC and AI Workloads Now Generally Available

InfoQ

JULY 20, 2023

Microsoft recently announced the general availability (GA) of Azure Managed Lustre, a managed file system for high-performance computing (HPC) and AI workloads. By Steef-Jan Wiggers

Azure

Azure Availability Systems Performance

Cloud Efficiency at Netflix

The Netflix TechBlog

DECEMBER 17, 2024

This diverse technological landscape generates extensive and rich data from various infrastructure entities, from which, data engineers and analysts collaborate to provide actionable insights to the engineering organization in a continuous feedback loop that ultimately enhances the business.

Efficiency

Efficiency Cloud Analytics Infrastructure

A Recap of the Data Engineering Open Forum at Netflix

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

Trending Sources

Analytics at Netflix: Who we are and what we do

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Scaling Appsec at Netflix (Part 2)

Supporting Diverse ML Systems at Netflix

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Netflix at AWS re:Invent 2019

Optimizing data warehouse storage

Incremental Processing using Netflix Maestro and Apache Iceberg

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Zendesk Moves from DynamoDB to MySQL and S3 to Save over 80% in Costs

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

InfoQ Dev Summit in Boston: Two Days of Talks for Senior Developers

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Post: Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: Essilen Research, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Post: InterviewCamp.io, Scrapinghub, Fauna, Sisu, Educative, PA File Sight, Etleap, Triplebyte, Stream

Netflix at AWS re:Invent 2019

Netflix at AWS re:Invent 2019

AWS Launches General Availability of Amazon EC2 P5 Instances for AI/ML and HPC Workloads

Microservices Adoption in 2020

Friends don't let friends build data pipelines

Expanding the Cloud: Introducing Amazon QuickSight

How Uber Sped Up SQL-based Data Analytics with Presto and Express Queries

Kubernetes for Big Data Workloads

Hugging Face's Guide to Optimizing LLMs in Production

How HubSpot Uses Apache Kafka Swimlanes for Timely Processing of Workflow Actions

How LinkedIn Serves Over 4.8 Million Member Profiles per Second

Google Announces the General Availability of A2 Virtual Machines

Microsoft Azure Managed Lustre for HPC and AI Workloads Now Generally Available

Cloud Efficiency at Netflix

Stay Connected