Big Data - Technology Performance Pulse

3 Performance Tricks for Dealing With Big Data Sets

DZone

AUGUST 21, 2021

This article describes 3 different tricks that I used in dealing with big data sets (order of 10 million records) and that proved to enhance performance dramatically. Trick 1: CLOB Instead of Result Set.

Big Data

Big Data Performance Tuning Mobile

Write Optimized Spark Code for Big Data Applications

DZone

MARCH 7, 2023

Apache Spark is a powerful open-source distributed computing framework that provides a variety of APIs to support big data processing. PySpark is the Python API for Apache Spark , which allows Python developers to write Spark applications using Python instead of Scala or Java.

Big Data

Big Data Code Tuning Open Source

Cutting Big Data Costs: Effective Data Processing With Apache Spark

DZone

SEPTEMBER 14, 2023

In today's data-driven world, efficient data processing plays a pivotal role in the success of any project. Apache Spark , a robust open-source data processing framework, has emerged as a game-changer in this domain.

Big Data

Big Data Processing Games Open Source

DataCentral: Uber’s Big Data Observability and Chargeback Platform

Uber Engineering

MARCH 21, 2024

Discover real-time query analytics and governance with DataCentral: Uber’s big data observability powerhouse, tackling millions of queries in petabyte-scale environments.

Big Data

Big Data Government Analytics

In-Stream Big Data Processing

Highly Scalable

AUGUST 20, 2013

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. Other flows are more sophisticated: one Storm topology can pass the data to another topology via Kafka or Cassandra. Towards Unified Big Data Processing. Apache Spark [10].

Big Data

Big Data Processing Lambda Database

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

OCTOBER 17, 2018

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog.

Big Data

Big Data Latency Transportation Traffic

How Amazon is solving big-data challenges with data lakes

All Things Distributed

JANUARY 20, 2020

Back when Jeff Bezos filled orders in his garage and drove packages to the post office himself, crunching the numbers on costs, tracking inventory, and forecasting future demand was relatively simple.

Big Data

Big Data Logistics Retail Government

Kubernetes for Big Data Workloads

Abhishek Tiwari

DECEMBER 27, 2017

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Organisations are already using Kubernetes for a variety of workloads [1] [2] and data workloads are up next. Key challenges. Performance.

Big Data

Big Data Storage Benchmarking Hardware

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

From driver and rider locations and destinations, to restaurant orders and payment transactions, every interaction on Uber’s transportation platform is driven by data.

Big Data

Big Data Transportation Engineering Storage

An overview of end-to-end entity resolution for big data

The Morning Paper

DECEMBER 13, 2020

An overview of end-to-end entity resolution for big data , Christophides et al., It’s an important part of many modern data workflows, and an area I’ve been wrestling with in one of my own projects. ACM Computing Surveys, Dec. 2020, Article No.

Big Data

Big Data Open Source Processing Analytics

Experiences with approximating queries in Microsoft’s production big-data clusters

The Morning Paper

SEPTEMBER 8, 2019

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries. VLDB’19. For the larger more production-like query analysed in §4.2.1,

Big Data

Big Data Analytics Latency Azure

Snowflake Workload Optimization

DZone

AUGUST 23, 2023

In the era of big data, efficient data management and query performance are critical for organizations that want to get the best operational performance from their data investments.

Big Data

Big Data Analytics Innovation Efficiency

Driving down the cost of Big-Data analytics - All Things Distributed

All Things Distributed

AUGUST 18, 2011

Driving down the cost of Big-Data analytics. The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. Driving down the cost of Big-Data analytics. Comments ().

Big Data

Big Data Analytics AWS Scalability

Scaling for Success: Why Scalability Is the Forefront of Modern Applications

DZone

JUNE 13, 2023

The reason is straightforward, today, applications generate enormous amounts of data. As we embrace new technologies like cloud computing, big data analysis, and the Internet of Things (IoT), there is a noticeable spike in the amount of data generated from different applications.

Scalability

Scalability IoT Big Data Internet

Introduction to Grafana, Prometheus, and Zabbix

DZone

FEBRUARY 6, 2024

If the data sources are not available then customized plugins can be developed to integrate these data sources. Grafana is used widely these days to monitor and visualize the metrics for 100s or 1000s of servers, Kubernetes Platforms, Virtual Machines, Big Data Platforms, etc.

Big Data

Big Data Open Source Virtualization Metrics

Turbocharge Your Apache Spark Jobs for Unmatched Performance

DZone

JULY 17, 2023

Apache Spark is a leading platform in the field of big data processing, known for its speed, versatility, and ease of use. Understanding Apache Spark Apache Spark is a unified computing engine designed for large-scale data processing. However, getting the most out of Spark often involves fine-tuning and optimization.

Big Data

Big Data Performance Open Source Tuning

Introduction to Azure Data Lake Storage Gen2

DZone

FEBRUARY 1, 2023

Built on Azure Blob Storage, Azure Data Lake Storage Gen2 is a suite of features for big data analytics. Azure Data Lake Storage Gen1 and Azure Blob Storage's capabilities are combined in Data Lake Storage Gen2. For instance, Data Lake Storage Gen2 offers scale, file-level security, and file system semantics.

Azure

Azure Storage Big Data Analytics

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

The Netflix TechBlog

OCTOBER 28, 2021

Netflix’s unique work culture and petabyte-scale data problems are what drew me to Netflix. During earlier years of my career, I primarily worked as a backend software engineer, designing and building the backend systems that enable big data analytics.

Data Engineering

Data Engineering Engineering Big Data Software Engineering

Data Engineers of Netflix?—?Interview with Kevin Wylie

The Netflix TechBlog

JULY 15, 2021

I was later hired into my first purely data gig where I was able to deepen my knowledge of big data. After that, I joined MySpace back at its peak as a data engineer and got my first taste of data warehousing at internet-scale. In the data engineering space, very little of the same technology remains.

Data Engineering

Data Engineering Engineering Entertainment Big Data

Stream Processing vs. Batch Processing: What to Know

DZone

JANUARY 31, 2023

Big data is at the center of all business decisions these days. It refers to large volumes of data generated through different sources, and this data then provides the foundation for business decisions. There are different ways through which we can process data.

Processing

Processing Big Data Systems

What is software automation? Optimize the software lifecycle with intelligent automation

Dynatrace

JUNE 26, 2023

Software analytics offers the ability to gain and share insights from data emitted by software systems and related operational processes to develop higher-quality software faster while operating it efficiently and securely. This involves big data analytics and applying advanced AI and machine learning techniques, such as causal AI.

Software

Software Software Analytics Big Data

Auto-Diagnosis and Remediation in Netflix Data Platform

The Netflix TechBlog

JANUARY 13, 2022

This blog will explore these two systems and how they perform auto-diagnosis and remediation across our Big Data Platform and Real-time infrastructure. The streaming platform recently added Data Mesh , and we need to expand Streaming Pensive to cover that.

Big Data

Big Data Infrastructure Metrics Hardware

Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

DZone

DECEMBER 27, 2023

The need to optimize offline data pipeline optimization has become a necessity with the growing complexity and scale of modern data pipelines. In this kickoff post, we delve into the intricacies of Apache Airflow and AWS EMR, a managed cluster platform for big data processing.

Best Practices

Best Practices Data Engineering Big Data Games

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

MAY 14, 2019

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., ASPLOS’19.

Big Data

Big Data Cloud Performance Hardware

Revolutionizing System Testing With AI and ML

DZone

JUNE 6, 2023

This can include the use of cloud computing, artificial intelligence, big data analytics, the Internet of Things (IoT), and other digital tools. The digital transformation of businesses involves the adoption of digital technologies to change the way companies operate and deliver value to their customers.

Artificial Intelligence

Artificial Intelligence Systems IoT Testing

Moving HPC to the Cloud: A Guide for 2020

High Scalability

SEPTEMBER 14, 2020

This is a guest post by Limor Maayan-Wainstein , a senior technical writer with 10 years of experience writing about cybersecurity, big data, cloud computing, web development, and more. High performance computing (HPC) enables you to solve complex problems which cannot be solved by regular computing.

Cloud

Cloud Big Data Virtualization Efficiency

What is Greenplum Database? Intro to the Big Data Database

Scalegrid

MAY 13, 2020

When handling large amounts of complex data, or big data, chances are that your main machine might start getting crushed by all of the data it has to process in order to produce your analytics results. Greenplum features a cost-based query optimizer for large-scale, big data workloads. Query Optimization.

Big Data

Big Data Database Artificial Intelligence Open Source

EDI and API: Which Trends Are Transforming the Modern Supply Chain Management?

DZone

JULY 22, 2022

Honestly, these two terms have recently been doing rounds in the big data world. These technologies specialize in transmitting large amounts of data across different trading partners and companies.

Big Data

Big Data Technology Technology Systems

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms.

Tuning

Tuning Efficiency Big Data Engineering

How to Optimize Elasticsearch for Better Search Performance

DZone

JULY 29, 2019

In today's world, data is generated in high volumes and to make something out of it, extracted data is needed to be transformed, stored, maintained, governed and analyzed. These processes are only possible with a distributed architecture and parallel processing mechanisms that Big Data tools are based on.

Big Data

Big Data Government Open Source Storage

Big / Bug Data: Analyzing the Apache Flink Source Code

DZone

DECEMBER 21, 2020

Applications used in the field of Big Data process huge amounts of information, and this often happens in real time. Naturally, such applications must be highly reliable so that no error in the code can interfere with data processing. It is an open-source framework for distributed processing of large amounts of data.

Code

Code Java Big Data Open Source

What Should You Know About Graph Database’s Scalability?

DZone

JANUARY 20, 2023

Having a distributed and scalable graph database system is highly sought after in many enterprise scenarios.

Scalability

Scalability Big Data Hardware Internet

Optimizing dbt and Google’s BigQuery

DZone

DECEMBER 21, 2020

Setting up a data warehouse is the first step towards fully utilizing big data analysis. Still, it is one of many that need to be taken before you can generate value from the data you gather. An important step in that chain of the process is data modeling and transformation.

Big Data

Big Data Google Scalability Processing

What is IT automation?

Dynatrace

JULY 6, 2022

This kind of automation can support key IT operations, such as infrastructure, digital processes, business processes, and big-data automation. Big data automation tools. These tools provide the means to collect, transfer, and process large volumes of data that are increasingly common in analytics applications.

Artificial Intelligence

Artificial Intelligence Tuning Strategy Big Data

Top 15 Software Testing Trends to Watch Out in 2021

DZone

DECEMBER 28, 2020

Nowadays, Big Data tests mainly include data testing, paving the way for the Internet of Things to become the center point. Digital transformation is yet another significant focus point for the sectors and the enterprises that are ranking top on cloud and business analytics. Besides, AI and ML seem to reach a new level.

Software

Software Software Testing Big Data

ScyllaDB Trends – How Users Deploy The Real-Time Big Data Database

Scalegrid

NOVEMBER 25, 2019

Google Cloud does offer their own wide column store and big data database called Bigtable which is actually ranked #111, one under ScyllaDB at #110 on DB-Engines. Google Cloud Platform (GCP) was the second most popular cloud provider for ScyllaDB, coming in at 30.4% of all cloud deployments.

Big Data

Big Data Database Open Source Azure

Spark Analysers: Catching Anti-Patterns In Spark Apps

Uber Engineering

JUNE 1, 2023

Uber runs more than 100K big data workloads per day using Apache Spark–at that scale it’s crucial to write optimized apps. The Delivery Data Solutions team built Spark Analysers, a real-time system to catch anti-patterns in the Spark application at Uber scale, helping Uber developers optimize their apps.

Big Data

Big Data Systems Development

No need to compromise visibility in public clouds with the new Azure services supported by Dynatrace

Dynatrace

JULY 6, 2020

Our customers have frequently requested support for this first new batch of services, which cover databases, big data, networks, and computing. See the health of your big data resources at a glance. Azure HDInsight supports a broad range of use cases including data warehousing, machine learning, and IoT analytics.

Azure

Azure Cloud Big Data Virtualization

Scaling Uber’s Apache Hadoop Distributed File System for Growth

Uber Engineering

APRIL 5, 2018

Three years ago, Uber Engineering adopted Hadoop as the storage ( HDFS ) and compute ( YARN ) infrastructure for our organization’s big data analysis.

Systems

Systems Big Data Storage Infrastructure

The Need for Real-Time Device Tracking

ScaleOut Software

JULY 19, 2021

And it can maintain contextual information about every data source (like the medical history of a device wearer or the maintenance history of a refrigeration system) and keep it immediately at hand to enhance the analysis.

IoT

IoT Analytics Big Data Architecture

Data Engineers of Netflix?—?Interview with Samuel Setegne

The Netflix TechBlog

JUNE 1, 2021

clinical data was often small enough to fit into memory on an average computer and only in rare cases would its computation require any technical ingenuity or massive computing power. There was not enough scope to explore the distributed and large-scale computing challenges that usually come with big data processing.

Data Engineering

Data Engineering Engineering Big Data Healthcare

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

The Netflix TechBlog

JUNE 1, 2021

At Netflix, the work that data engineers do to produce data in a robust, scalable way is incredibly important to provide the best experience to our members as they interact with our service.

Data Engineering

Data Engineering Engineering Software Engineering Big Data

When Performance Matters, Think NVMe

DZone

MAY 21, 2019

The demand for more IT resource-intensive applications has significantly increased today, whether it is to process quicker transactions, gain real-time insight, crunch big data sets, or to meet customer expectations. That’s because NVMe provides 6x higher bandwidth and IOPS advantage compared to SAS/SATA SSD.

Performance

Performance Big Data Storage Processing

Kubernetes in the wild report 2023

Dynatrace

JANUARY 16, 2023

Big data : To store, search, and analyze large datasets, 32% of organizations use Elasticsearch. Continuous integration and delivery : ArgoCD, Flux, Gitlab, and Jenkins are the most widely adopted CI/CD tools. Security : For security, organizations mostly use policy checkers and enforcers, such as Gatekeeper.

Open Source

Open Source Java Operating System Programming

3 Performance Tricks for Dealing With Big Data Sets

Write Optimized Spark Code for Big Data Applications

Trending Sources

Cutting Big Data Costs: Effective Data Processing With Apache Spark

DataCentral: Uber’s Big Data Observability and Chargeback Platform

In-Stream Big Data Processing

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

How Amazon is solving big-data challenges with data lakes

Kubernetes for Big Data Workloads

Databook: Turning Big Data into Knowledge with Metadata at Uber

An overview of end-to-end entity resolution for big data

Experiences with approximating queries in Microsoft’s production big-data clusters

Snowflake Workload Optimization

Driving down the cost of Big-Data analytics - All Things Distributed

Scaling for Success: Why Scalability Is the Forefront of Modern Applications

Introduction to Grafana, Prometheus, and Zabbix

Turbocharge Your Apache Spark Jobs for Unmatched Performance

Introduction to Azure Data Lake Storage Gen2

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Data Engineers of Netflix?—?Interview with Kevin Wylie

Stream Processing vs. Batch Processing: What to Know

What is software automation? Optimize the software lifecycle with intelligent automation

Auto-Diagnosis and Remediation in Netflix Data Platform

Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

Revolutionizing System Testing With AI and ML

Moving HPC to the Cloud: A Guide for 2020

What is Greenplum Database? Intro to the Big Data Database

EDI and API: Which Trends Are Transforming the Modern Supply Chain Management?

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

How to Optimize Elasticsearch for Better Search Performance

Big / Bug Data: Analyzing the Apache Flink Source Code

What Should You Know About Graph Database’s Scalability?

Optimizing dbt and Google’s BigQuery

What is IT automation?

Top 15 Software Testing Trends to Watch Out in 2021

ScyllaDB Trends – How Users Deploy The Real-Time Big Data Database

Spark Analysers: Catching Anti-Patterns In Spark Apps

No need to compromise visibility in public clouds with the new Azure services supported by Dynatrace

Scaling Uber’s Apache Hadoop Distributed File System for Growth

The Need for Real-Time Device Tracking

Data Engineers of Netflix?—?Interview with Samuel Setegne

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

When Performance Matters, Think NVMe

Kubernetes in the wild report 2023

Stay Connected