Big Data and Tuning - Technology Performance Pulse

3 Performance Tricks for Dealing With Big Data Sets

DZone

AUGUST 21, 2021

This article describes 3 different tricks that I used in dealing with big data sets (order of 10 million records) and that proved to enhance performance dramatically. Trick 1: CLOB Instead of Result Set.

Big Data

Big Data Performance Tuning Mobile

Write Optimized Spark Code for Big Data Applications

DZone

MARCH 7, 2023

Apache Spark is a powerful open-source distributed computing framework that provides a variety of APIs to support big data processing. In addition, pySpark applications can be tuned to optimize performance and achieve better execution time, scalability, and resource utilization.

Big Data

Big Data Code Tuning Open Source

What is IT automation?

Dynatrace

JULY 6, 2022

Expect to spend time fine-tuning automation scripts as you find the right balance between automated and manual processing. This kind of automation can support key IT operations, such as infrastructure, digital processes, business processes, and big-data automation. Big data automation tools.

Artificial Intelligence

Artificial Intelligence Tuning Strategy Big Data

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This technique facilitates validation on multiple fronts.

Traffic

Traffic Latency Tuning Systems

Auto-Diagnosis and Remediation in Netflix Data Platform

The Netflix TechBlog

JANUARY 13, 2022

This blog will explore these two systems and how they perform auto-diagnosis and remediation across our Big Data Platform and Real-time infrastructure. One example where it can dramatically help is Spark jobs, where memory tuning is a significant challenge. Expand Pensive with Machine Learning classifiers.

Big Data

Big Data Infrastructure Metrics Games

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. The processed data is typically stored as data warehouse tables in AWS S3.

Latency

Latency Storage Big Data Tuning

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

Optimizing anomaly detection and noise

Dynatrace

MARCH 11, 2021

I took a big-data-analysis approach, which started with another problem visualization. I wanted to understand how I could tune Dynatrace’s problem detection, but to do that I needed to understand the situation first. To achieve that I took two approaches: Visualizing historic problem data via a “Swimlane Visualization”.

Tuning

Tuning Architecture Monitoring Big Data

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

MARCH 25, 2019

Netflix’s diverse data landscape made it challenging to capture all the right data and conforming it to a common data model. Spark is the primary big-data compute engine at Netflix and with pretty much every upgrade in Spark, the spark plan changed as well springing continuous and unexpected surprises for us.

Infrastructure

Infrastructure Big Data Transportation Architecture

Turbocharge Your Apache Spark Jobs for Unmatched Performance

DZone

JULY 17, 2023

Apache Spark is a leading platform in the field of big data processing, known for its speed, versatility, and ease of use. However, getting the most out of Spark often involves fine-tuning and optimization. Understanding Apache Spark Apache Spark is a unified computing engine designed for large-scale data processing.

Big Data

Big Data Performance Open Source Tuning

Python at Netflix

The Netflix TechBlog

APRIL 29, 2019

Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid. Orchestration The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines. We leverage Python to protect our SSH resources using Bless.

Open Source

Open Source Network Infrastructure Big Data

How Netflix uses eBPF flow logs at scale for network insight

The Netflix TechBlog

JUNE 7, 2021

After several iterations of the architecture and some tuning, the solution has proven to be able to scale. Summary Providing network insight into the cloud network infrastructure using eBPF flow logs at scale is made possible with eBPF and a highly scalable and efficient flow collection pipeline.

Network

Network Transportation AWS Cloud

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

The Netflix TechBlog

MAY 26, 2020

And in order to gain visibility into these logs, we need to somehow ingest and enrich this data. It is easier to tune a large Spark job for a consistent volume of data. In other words, we are able to ensure that our Spark app does not “eat” more data than it was tuned to handle. We named this library Sqooby.

Network

Network Tuning AWS Traffic

A guide to Autonomous Performance Optimization

Dynatrace

SEPTEMBER 15, 2020

If you want to see a more hands-on approach, I encourage you to watch the recording as Stefano did a live demo of Akamas’s integration with Dynatrace, showing how to minimize the footprint of a Java application with automated JVM tuning.

Performance

Performance Java Metrics Cloud

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

Last but not least, thank you to the organizers of the Data Engineering Open Forum: Chris Colburn , Xinran Waibel , Jai Balani , Rashmi Shamprasad , and Patricia Ho. If you are interested in attending a future Data Engineering Open Forum, we highly recommend you join our Google Group to stay tuned to event announcements.

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The Netflix TechBlog

JULY 21, 2022

We at Netflix, as a streaming service running on millions of devices, have a tremendous amount of data about device capabilities/characteristics and runtime data in our big data platform. With large data, comes the opportunity to leverage the data for predictive and classification based analysis.

Big Data

Big Data Cache Engineering Data Engineering

Conducting log analysis with an observability platform and full data context

Dynatrace

APRIL 20, 2023

Using Grail to heal observability pains Grail logs not only store big data, but also map out dependencies to enable fast analytics and data reasoning. Business leaders can decide which logs they want to use and tune storage to their data needs. Seamless integration.

Analytics

Analytics Infrastructure Storage Architecture

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Dynatrace

FEBRUARY 16, 2023

As teams try to gain insight into this data deluge, they have to balance the need for speed, data fidelity, and scale with capacity constraints and cost. To solve this problem, Dynatrace launched Grail, its causational data lakehouse , in 2022.

Analytics

Analytics Innovation Metrics Database

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

However, it is paramount that we validate the complete set of identifiers such as a list of movie ids across producers and consumers for higher overall confidence in the data transport layer of choice. Please stay tuned! Data Mesh: Delivering Data-driven Value at Scale , O’Reilly Media, Inc., Endnotes ¹ Inmon, Bill.

Big Data

Big Data Government Processing Analytics

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

by Jun He , Akash Dwivedi , Natallia Dzenisenka , Snehal Chennuru , Praneeth Yenugutala , Pawan Dixit At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations.

Java

Java Scalability Traffic Architecture

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

Orient: Gather tuning parameters for a particular table that changed. AutoAnalyze In short, AutoAnalyze finds the best tuning/configuration parameters for a table. The work done in the service can be further broken down into the following 3 steps: Observe: Listen to changes in the warehouse in near real-time.

Storage

Storage Latency Efficiency Data Engineering

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing. Backfill: Backfilling datasets is a common operation in big data processing.

Processing

Processing Big Data Efficiency Engineering

Mastering Distributed SQL™ Databases in 2025

Scalegrid

JANUARY 10, 2025

They keep the features that developers like but can handle much more data, similar to NoSQL systems. Notably, they simplify handling big data flows, offer consistent transactions, and sustain high performance even when they’re used for real-time data analysis and complex queries.

Database

Database Scalability Best Practices Blockchain

Music to my Ears - All Things Distributed

All Things Distributed

MARCH 28, 2011

We see that with our Amazon customers; when they hear a great tune on a radio they may identify it using the Shazam or Soundhound apps on their mobile phone and buy that song instantly from the Amazon MP3 store. Driving down the cost of Big-Data analytics. Introducing the AWS South America (Sao Paulo) Region.

AWS

AWS Cloud Storage Internet

Structural Evolutions in Data

O'Reilly

SEPTEMBER 19, 2023

Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.” ” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize.

Hardware

Hardware Storage Big Data Blockchain

Why MySQL Could Be Slow With Large Tables

Percona

JANUARY 19, 2023

It was developed for optimizing data storage and access for big data sets. There is a cool blog post from Vadim covering big data sets in MyRocks: MyRocks Use Case: Big Dataset Query tuning: It is common to find applications that at the beginning perform very well, but as data grows the performance starts to decrease.

Open Source

Open Source Storage Database Big Data

Web Performance Bookshelf

Rigor

JANUARY 13, 2020

Take, for example, The Web Almanac , the golden collection of Big Data combined with the collective intelligence from most of the authors listed below, brilliantly spearheaded by Google’s @rick_viscomi. Web Performance Tuning. Professional Website Performance. Professional Website Performance. Website Optimization.

Performance

Performance Social Media Website Website Performance

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

InfoQ

APRIL 15, 2024

He specifically delved into Venice DB, the NoSQL data store used for feature persistence. At the QCon London 2024 conference, Félix GV from LinkedIn discussed the AI/ML platform powering the company’s products. By Rafal Gancarz

Artificial Intelligence

Artificial Intelligence Big Data Data Engineering Latency

USENIX LISA 2018: CFP Now Open

Brendan Gregg

APRIL 30, 2018

In this year's CFP we’re looking for topics covering the latest trends and best practices in cloud computing, containerization, machine learning, big data, infrastructure, scalability, DevOps, IT management, automation, reliability, monitoring, performance tuning, security, databases, programming, datacenters, and more.

DevOps

DevOps Network Best Practices Programming

USENIX LISA 2018: CFP Now Open

Brendan Gregg

APRIL 29, 2018

In this year's CFP we’re looking for topics covering the latest trends and best practices in cloud computing, containerization, machine learning, big data, infrastructure, scalability, DevOps, IT management, automation, reliability, monitoring, performance tuning, security, databases, programming, datacenters, and more.

DevOps

DevOps Network Best Practices Programming

Microsoft Engineering loves SQLBits

SQL Server According to Bob

FEBRUARY 15, 2018

Best practices on Building a Big Data Analytics Solution – Michael Rys. If you want to learn about Azure Data Lake, there is no one better. Friday Sessions: SQL Intelligence excels your tuning and security expertise – Veljko Vasic, Ron Matchoro and Frans Lytzen. I’ve known Michael for a very long time.

Engineering

Engineering Azure Best Practices Servers

Delta: A Data Synchronization and Enrichment Platform

The Netflix TechBlog

OCTOBER 15, 2019

High Level Architecture of Delta Stay Tuned We will publish follow-up blogs about technical details of the key components such as Delta-Connector and Delta Stream Processing Framework. Please stay tuned. Below is a view of the high level architecture of the Delta platform.

Transportation

Transportation Architecture Processing Storage

Streaming SQL in Data Mesh

The Netflix TechBlog

NOVEMBER 3, 2023

Stay tuned for more updates! Streaming SQL in Data Mesh was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story. We’ve been working with our partner teams to prioritize and build the next set of features to extend the SQL Processor.

Processing

Processing Engineering Infrastructure Latency

Bringing the Magic of Amazon AI and Alexa to Apps on AWS.

All Things Distributed

NOVEMBER 30, 2016

Effectively applying AI involves extensive manual effort to develop and tune many different types of machine learning and deep learning algorithms (e.g. automatic speech recognition, natural language understanding, image classification), collect and clean the training data, and train and tune the machine learning models.

AWS

AWS Lambda Artificial Intelligence Mobile

Will AWS Have Anything New To Say About Sustainability at re:Invent 2024?

Adrian Cockcroft

NOVEMBER 18, 2024

uses big data to reduce methane emissions Trace gases including methane and carbon dioxide contribute to climate change and impact the health of millions of people across the globe. Discover how Scepter, Inc. aggregates vast datasets, pinpoints emissions, and helps customers like ExxonMobil monitor and mitigate methane releases.

AWS

AWS Energy Lambda Government

World’s Top Web Performance Leaders To Watch

Rigor

SEPTEMBER 11, 2019

He has a keen interest in web technologies, performance tuning, security, and the practical use of technology. Doug is a freelance mobile performance expert, a popular speaker – particularly on the topic of web tuning and image optimization – and the author of High Performance Android Apps. Doug Sillars. Doug Sillars.

Performance

Performance Education Google Website

Should You Use ClickHouse as a Main Operational Database?

Percona

JANUARY 14, 2019

Currently, an issue has been opened to make the “tailing” based on the primary key much faster: slow order by primary key with small limit on big data.

Database

Database Analytics Blockchain Healthcare

Technology Performance Pulse

3 Performance Tricks for Dealing With Big Data Sets

Write Optimized Spark Code for Big Data Applications

Trending Sources

What is IT automation?

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Auto-Diagnosis and Remediation in Netflix Data Platform

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Optimizing anomaly detection and noise

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Turbocharge Your Apache Spark Jobs for Unmatched Performance

Python at Netflix

How Netflix uses eBPF flow logs at scale for network insight

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

A guide to Autonomous Performance Optimization

A Recap of the Data Engineering Open Forum at Netflix

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

Conducting log analysis with an observability platform and full data context

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Data Movement in Netflix Studio via Data Mesh

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Optimizing data warehouse storage

Incremental Processing using Netflix Maestro and Apache Iceberg

Mastering Distributed SQL™ Databases in 2025

Music to my Ears - All Things Distributed

Structural Evolutions in Data

Why MySQL Could Be Slow With Large Tables

Web Performance Bookshelf

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

USENIX LISA 2018: CFP Now Open

USENIX LISA 2018: CFP Now Open

Microsoft Engineering loves SQLBits

Delta: A Data Synchronization and Enrichment Platform

Streaming SQL in Data Mesh

Bringing the Magic of Amazon AI and Alexa to Apps on AWS.

Will AWS Have Anything New To Say About Sustainability at re:Invent 2024?

World’s Top Web Performance Leaders To Watch

Should You Use ClickHouse as a Main Operational Database?

Stay Connected