Big Data, Performance and Tuning - Technology Performance Pulse

3 Performance Tricks for Dealing With Big Data Sets

DZone

AUGUST 21, 2021

This article describes 3 different tricks that I used in dealing with big data sets (order of 10 million records) and that proved to enhance performance dramatically. This trick enhanced the performance dramatically. Trick 1: CLOB Instead of Result Set.

Big Data

Big Data Performance Tuning Mobile

Write Optimized Spark Code for Big Data Applications

DZone

MARCH 7, 2023

Apache Spark is a powerful open-source distributed computing framework that provides a variety of APIs to support big data processing. In addition, pySpark applications can be tuned to optimize performance and achieve better execution time, scalability, and resource utilization.

Big Data

Big Data Code Tuning Open Source

A guide to Autonomous Performance Optimization

Dynatrace

SEPTEMBER 15, 2020

In my recent Performance Clinic with Stefano Doni , CTO & Co-Founder of Akamas , I made the statement, “Application development and release cycles today are measured in days, instead of months. Increase in environment complexity and increased frequency in delivery requires a novel approach to performance optimization.

Performance

Performance Java Metrics Cloud

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

The Netflix TechBlog

MAY 4, 2023

Migrating Critical Traffic At Scale with No Downtime — Part 1 Shyam Gala , Javier Fernandez-Ivern , Anup Rokkam Pratap , Devang Shah Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. This approach has a handful of benefits.

Traffic

Traffic Latency Tuning Systems

What is IT automation?

Dynatrace

JULY 6, 2022

Expect to spend time fine-tuning automation scripts as you find the right balance between automated and manual processing. This kind of automation can support key IT operations, such as infrastructure, digital processes, business processes, and big-data automation. Big data automation tools.

Artificial Intelligence

Artificial Intelligence Tuning Strategy Big Data

Turbocharge Your Apache Spark Jobs for Unmatched Performance

DZone

JULY 17, 2023

Apache Spark is a leading platform in the field of big data processing, known for its speed, versatility, and ease of use. However, getting the most out of Spark often involves fine-tuning and optimization. Understanding Apache Spark Apache Spark is a unified computing engine designed for large-scale data processing.

Big Data

Big Data Performance Open Source Tuning

Auto-Diagnosis and Remediation in Netflix Data Platform

The Netflix TechBlog

JANUARY 13, 2022

This blog will explore these two systems and how they perform auto-diagnosis and remediation across our Big Data Platform and Real-time infrastructure. One example where it can dramatically help is Spark jobs, where memory tuning is a significant challenge. Expand Pensive with Machine Learning classifiers.

Big Data

Big Data Infrastructure Metrics Games

How Netflix uses eBPF flow logs at scale for network insight

The Netflix TechBlog

JUNE 7, 2021

At much less than 1% of CPU and memory on the instance, this highly performant sidecar provides flow data at scale for network insight. The sidecar has been implemented by leveraging the highly performant eBPF along with carefully chosen transport protocols to consume less than 1% of CPU and memory on any instance in our fleet.

Network

Network Transportation AWS Cloud

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. Auto Remediation generates recommendations by considering both performance (i.e., Multi-objective optimizations.

Tuning

Tuning Efficiency Big Data Engineering

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

OCTOBER 27, 2020

Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto , to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3.

Latency

Latency Storage Big Data Tuning

Optimizing anomaly detection and noise

Dynatrace

MARCH 11, 2021

I took a big-data-analysis approach, which started with another problem visualization. I wanted to understand how I could tune Dynatrace’s problem detection, but to do that I needed to understand the situation first. To achieve that I took two approaches: Visualizing historic problem data via a “Swimlane Visualization”.

Tuning

Tuning Architecture Monitoring Big Data

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Dynatrace

FEBRUARY 16, 2023

As observability and security data converge in modern multicloud environments, there’s more data than ever to orchestrate and analyze. The goal is to turn more data into insights so the whole organization can make data-driven decisions and automate processes.

Analytics

Analytics Innovation Metrics Database

Python at Netflix

The Netflix TechBlog

APRIL 29, 2019

The service that orchestrates failover uses numpy and scipy to perform numerical analysis, boto3 to make changes to our AWS infrastructure, rq to run asynchronous workloads and we wrap it all up in a thin layer of Flask APIs. Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid.

Open Source

Open Source Network Infrastructure Big Data

A Recap of the Data Engineering Open Forum at Netflix

The Netflix TechBlog

JUNE 20, 2024

In this talk, Jessica Larson shares her takeaways from building a new data platform post-GDPR. Clark Wright, Staff Analytics Engineer at Airbnb, talked about the concept of Data Quality Score at Airbnb. To handle errors efficiently, Netflix developed a rule-based classifier for error classification called “Pensive.” Until next time!

Data Engineering

Data Engineering Engineering Entertainment Software Engineering

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

The Netflix TechBlog

MAY 26, 2020

And in order to gain visibility into these logs, we need to somehow ingest and enrich this data. It is easier to tune a large Spark job for a consistent volume of data. In other words, we are able to ensure that our Spark app does not “eat” more data than it was tuned to handle. We named this library Sqooby.

Network

Network Tuning AWS Traffic

Conducting log analysis with an observability platform and full data context

Dynatrace

APRIL 20, 2023

Causal AI—which brings AI-enabled actionable insights to IT operations—and a data lakehouse, such as Dynatrace Grail , can help break down silos among ITOps, DevSecOps, site reliability engineering, and business analytics teams. Business leaders can decide which logs they want to use and tune storage to their data needs.

Analytics

Analytics Infrastructure Storage Architecture

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The Netflix TechBlog

JULY 21, 2022

We at Netflix, as a streaming service running on millions of devices, have a tremendous amount of data about device capabilities/characteristics and runtime data in our big data platform. With large data, comes the opportunity to leverage the data for predictive and classification based analysis.

Big Data

Big Data Cache Engineering Data Engineering

Data Movement in Netflix Studio via Data Mesh

The Netflix TechBlog

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. The compaction process is needed to optimize the performance of downstream queries on the business view as well as lower costs of S3 GET OBJECT operations.

Big Data

Big Data Government Processing Analytics

Optimizing data warehouse storage

The Netflix TechBlog

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Storage

Storage Latency Efficiency Data Engineering

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

The Netflix TechBlog

OCTOBER 18, 2022

by Jun He , Akash Dwivedi , Natallia Dzenisenka , Snehal Chennuru , Praneeth Yenugutala , Pawan Dixit At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations.

Java

Java Scalability Traffic Architecture

Mastering Distributed SQL™ Databases in 2025

Scalegrid

JANUARY 10, 2025

They keep the features that developers like but can handle much more data, similar to NoSQL systems. Notably, they simplify handling big data flows, offer consistent transactions, and sustain high performance even when they’re used for real-time data analysis and complex queries.

Database

Database Scalability Best Practices Blockchain

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

NOVEMBER 20, 2023

For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing. Backfill: Backfilling datasets is a common operation in big data processing.

Processing

Processing Big Data Efficiency Engineering

Web Performance Bookshelf

Rigor

JANUARY 13, 2020

Reading time 1 min Why share the library of the web performance books while there’s a substantial collection of fantastic websites and articles on the net? High Performance Browser Networking. This book is about performance problems and the various technologies created to fight them. High Performance Websites.

Performance

Performance Social Media Website Website Performance

Music to my Ears - All Things Distributed

All Things Distributed

MARCH 28, 2011

We see that with our Amazon customers; when they hear a great tune on a radio they may identify it using the Shazam or Soundhound apps on their mobile phone and buy that song instantly from the Amazon MP3 store. Driving down the cost of Big-Data analytics. Introducing the AWS South America (Sao Paulo) Region.

AWS

AWS Cloud Storage Internet

Why MySQL Could Be Slow With Large Tables

Percona

JANUARY 19, 2023

While the technologies have evolved and matured enough, there are still some people thinking that MySQL is only for small projects or that it can’t perform well with large tables. With disks being faster nowadays and CPU and memory resources being cheaper, we could easily say MySQL can handle TBs of data with good performance.

Open Source

Open Source Storage Database Big Data

Structural Evolutions in Data

O'Reilly

SEPTEMBER 19, 2023

Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.” ” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize.

Hardware

Hardware Storage Big Data Blockchain

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

InfoQ

APRIL 15, 2024

He specifically delved into Venice DB, the NoSQL data store used for feature persistence. At the QCon London 2024 conference, Félix GV from LinkedIn discussed the AI/ML platform powering the company’s products. By Rafal Gancarz

Artificial Intelligence

Artificial Intelligence Big Data Data Engineering Latency

USENIX LISA 2018: CFP Now Open

Brendan Gregg

APRIL 30, 2018

In this year's CFP we’re looking for topics covering the latest trends and best practices in cloud computing, containerization, machine learning, big data, infrastructure, scalability, DevOps, IT management, automation, reliability, monitoring, performance tuning, security, databases, programming, datacenters, and more.

DevOps

DevOps Network Best Practices Programming

Microsoft Engineering loves SQLBits

SQL Server According to Bob

FEBRUARY 15, 2018

Microsoft engineering is actually sending quite a few folks over the Atlantic to come talk about SQL Server 2017, SQL Server on Linux, GDPR, Performance, Security, Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and Azure CosmosDB. Best practices on Building a Big Data Analytics Solution – Michael Rys.

Engineering

Engineering Azure Best Practices Servers

USENIX LISA 2018: CFP Now Open

Brendan Gregg

APRIL 29, 2018

In this year's CFP we’re looking for topics covering the latest trends and best practices in cloud computing, containerization, machine learning, big data, infrastructure, scalability, DevOps, IT management, automation, reliability, monitoring, performance tuning, security, databases, programming, datacenters, and more.

DevOps

DevOps Network Best Practices Programming

Delta: A Data Synchronization and Enrichment Platform

The Netflix TechBlog

OCTOBER 15, 2019

Delta is an eventual consistent, event driven, data synchronization and enrichment platform. Existing Solutions Dual Writes In order to keep two datastores in sync, one could perform a dual write, which is executing a write to one datastore following a second write to the other. Please stay tuned.

Transportation

Transportation Architecture Processing Storage

Bringing the Magic of Amazon AI and Alexa to Apps on AWS.

All Things Distributed

NOVEMBER 30, 2016

Effectively applying AI involves extensive manual effort to develop and tune many different types of machine learning and deep learning algorithms (e.g. automatic speech recognition, natural language understanding, image classification), collect and clean the training data, and train and tune the machine learning models.

AWS

AWS Lambda Artificial Intelligence Mobile

World’s Top Web Performance Leaders To Watch

Rigor

SEPTEMBER 11, 2019

Reading time 16 min Whether you’re a web performance expert, an evangelist for the culture of performance, a web engineer incorporating performance into your process, or someone new to the web performance entirely, you probably identify as curious, excited about new ideas, and always learning. Rick Byers.

Performance

Performance Education Google Website

Will AWS Have Anything New To Say About Sustainability at re:Invent 2024?

Adrian Cockcroft

NOVEMBER 18, 2024

In this lightning talk, learn how customers are using AWS to perform millions of calculations on real-time grid data to execute the scenario analysis, simulations, and operational planning necessary to operate a dynamic power grid. This presentation is brought to you by Wherobots, an AWS Partner. Discover how Scepter, Inc.

AWS

AWS Energy Lambda Government

Should You Use ClickHouse as a Main Operational Database?

Percona

JANUARY 14, 2019

With the latest ClickHouse version, all of these features are available, but some of them may not perform fast enough. Currently, an issue has been opened to make the “tailing” based on the primary key much faster: slow order by primary key with small limit on big data. Updating a single message in the past (e.g.

Database

Database Analytics Blockchain Healthcare

Technology Performance Pulse

3 Performance Tricks for Dealing With Big Data Sets

Write Optimized Spark Code for Big Data Applications

Trending Sources

A guide to Autonomous Performance Optimization

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

What is IT automation?

Turbocharge Your Apache Spark Jobs for Unmatched Performance

Auto-Diagnosis and Remediation in Netflix Data Platform

How Netflix uses eBPF flow logs at scale for network insight

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Optimizing anomaly detection and noise

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Python at Netflix

A Recap of the Data Engineering Open Forum at Netflix

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

Conducting log analysis with an observability platform and full data context

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

Data Movement in Netflix Studio via Data Mesh

Optimizing data warehouse storage

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Mastering Distributed SQL™ Databases in 2025

Incremental Processing using Netflix Maestro and Apache Iceberg

Web Performance Bookshelf

Music to my Ears - All Things Distributed

Why MySQL Could Be Slow With Large Tables

Structural Evolutions in Data

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

USENIX LISA 2018: CFP Now Open

Microsoft Engineering loves SQLBits

USENIX LISA 2018: CFP Now Open

Delta: A Data Synchronization and Enrichment Platform

Bringing the Magic of Amazon AI and Alexa to Apps on AWS.

World’s Top Web Performance Leaders To Watch

Will AWS Have Anything New To Say About Sustainability at re:Invent 2024?

Should You Use ClickHouse as a Main Operational Database?

Stay Connected