DEV Community: apachespark

Aligning Timeouts in Distributed Orchestration: Why Equal Airflow and Spark Limits Lead to Race Conditions

Reinaldo Del Dotore — Sun, 17 May 2026 18:19:24 +0000

Recently, I reviewed an Airflow DAG where each task submits a single Spark job. I found this configuration:

execution_timeout_minutes: 60 
spark-job-timeout-minutes: 60

At first glance, it looks redundant. Two timeouts, same value. Why do both exist? The answer reveals something important about how Airflow and Spark interact.

Different Layers, Different Clocks

execution_timeout_minutes: is the Airflow task timeout. Its clock starts when the task enters the running state and covers everything: job submission to the cluster, queue wait time, Spark execution, status polling, and cleanup.
spark-job-timeout-minutes: is the timeout applied only to the Spark processing running in the cluster. It basically says: "if this application runs longer than X, abort it". In other words: it does not include submission overhead, queueing time, or the processing the Airflow task performs before or after Spark execution.

Key Takeaway
These are two different clocks measuring two different things, and the Airflow clock starts ticking before the Spark application even exists.

The Problem with Setting Them Equal

With a 60/60 configuration, which timeout triggers first becomes timing-dependent. And because Airflow starts counting earlier, it tends to hit its timeout first in practice.

That is the worst-case scenario: Airflow terminates the task before Spark shuts down properly. Depending on the integration being used, the Spark job may continue running in the cluster orphaned, consuming resources until someone notices. Orphaned jobs are one of the biggest hidden cost drivers in shared clusters: they consume CPU, memory, and sometimes even autoscale nodes long after the orchestrator has given up.

The desired behavior is the opposite: Spark should hit its own timeout first, fail cleanly, and allow the Airflow task to receive that failure within its own execution window. In distributed systems, the layer responsible for the actual processing should ideally detect and terminate problematic execution first.

A Practical Rule
execution_timeout_minutes > spark-job-timeout-minutes

The gap between them must absorb submission time, queueing, polling, and cleanup: components that typically add a few minutes even for small jobs.

Since this overhead tends to vary little within the same environment, think in absolute time, not percentages:

Warm, fixed cluster: +5 min
Livy/REST submission with moderate queueing: +10 min
Ephemeral clusters (EMR on-demand, Databricks job clusters): +15 to 20 min

The Adjustment

Looking at the execution history, this DAG usually completed in 4 to 5 minutes. The original 60-minute limits were simply inherited defensive defaults nobody had revisited.

I reduced them to:

spark-job-timeout-minutes: 15 
execution_timeout_minutes: 20

This is roughly three to four times the observed average runtime — enough to absorb normal variance and occasional spikes without masking real hangs.

Inflated timeouts do not protect anything: they only delay alerts when something is genuinely stuck.

Final Thoughts

Timeouts are not arbitrary numbers. Each exists at a different layer (the orchestrator and the execution engine) with different responsibilities.
When they are aligned correctly (the orchestrator having some margin over the execution engine), failures become predictable.
When they are equal, you create a race that hides real problems.
And the correct value is rarely the one someone set two years ago and never reviewed again.
Timeouts are not safety nets: they are alarms. And alarms only work when they ring at the right time.

If you enjoyed this insight on Data Platform Engineering, feel free to connect with me on LinkedIn for more discussions on data architecture and orchestration.

Broadcast Joins vs. Sort-Merge Joins: Choosing the Right Join Strategy in Apache Spark

harshvardhan — Tue, 12 May 2026 17:17:41 +0000

In distributed data processing systems such as Apache Spark, joins are among the most expensive operations. The strategy used to join datasets can significantly impact execution time, memory consumption, and overall cluster performance. Two of the most widely used join techniques are Broadcast Joins and Sort-Merge Joins.

Although both are designed to combine datasets efficiently, they solve different performance challenges. Understanding when to use each can help optimize ETL pipelines, analytics workloads, and large-scale data processing applications.

What Is a Broadcast Join?

A Broadcast Join is typically used when one dataset is very small compared to the other. Instead of shuffling both datasets across the cluster, the smaller table is copied, or “broadcasted,” to every worker node. Each executor then performs the join locally with its partition of the larger dataset.

For example:

Orders table → 2 TB
Product table → 10 MB

Rather than moving the 2 TB dataset over the network, the system distributes the 10 MB product table to all executors and joins locally. This avoids expensive shuffle operations and greatly improves performance.

In Apache Spark, Broadcast Joins are commonly implemented using hash joins internally and are especially effective in star-schema data warehouse models where large fact tables are joined with small dimension tables.

Benefits of Broadcast Joins

Broadcast Joins are extremely fast for small-large joins because they minimize network shuffling. Since the large dataset remains partitioned as-is, execution becomes more efficient and query latency decreases significantly.

Other advantages include:

Reduced shuffle and disk spill.
Faster execution for lookup-style joins.
Excellent performance for dimension tables.
Ideal for interactive analytics workloads.

However, Broadcast Joins also have limitations. The smaller dataset must fit comfortably into executor memory. Broadcasting a table that is too large can cause memory pressure, garbage collection overhead, or executor failures. In very large clusters, repeatedly distributing even moderately sized tables can also become expensive.

A typical Spark example looks like this:

from pyspark.sql.functions import broadcast

result = large_df.join(
    broadcast(small_df),
    "customer_id"
)

Here, small_df is explicitly broadcast to all worker nodes.

What Is a Sort-Merge Join?

A Sort-Merge Join (SMJ) is designed for situations where both datasets are large and broadcasting is impractical. Instead of replicating data, both datasets are shuffled across the cluster so rows with matching join keys end up on the same executor.

The process usually involves three stages:

Repartitioning both datasets on the join key
Sorting data within each partition
Merging sorted partitions to generate joined rows

Consider this example:

Customer events → 4 TB
Transaction logs → 3 TB

Since neither table is small enough to broadcast, a Sort-Merge Join becomes the preferred strategy.

Sort-Merge Joins are highly scalable and are commonly used in enterprise ETL pipelines and large data lake architectures. Unlike Broadcast Joins, they process sorted streams incrementally, making them more memory-efficient for huge datasets.

Benefits of Sort-Merge Joins

The biggest advantage of Sort-Merge Joins is scalability. They can efficiently handle joins involving terabytes or petabytes of data without requiring one dataset to fit in memory.

Additional advantages include:

Suitable for very large distributed joins
More stable for batch processing workloads
Better memory handling for massive datasets
Works well with partitioned or pre-sorted data

Despite these strengths, Sort-Merge Joins are more expensive than Broadcast Joins because they involve heavy shuffling and sorting operations. Network transfer, CPU usage, and disk I/O can become significant bottlenecks, especially when data skew exists.

In Spark, Sort-Merge Join is often the default strategy for large joins:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

result = large_df1.join(
    large_df2,
    "customer_id"
)

Disabling automatic broadcast forces Spark to select another strategy, commonly Sort-Merge Join.

How Spark Automatically Chooses Join Types

Apache Spark uses the Catalyst Optimizer and cost-based optimization techniques to decide which join strategy to use.

By default:

Small tables below the broadcast threshold are broadcasted
Large joins typically use Sort-Merge Join

The key configuration is:

spark.sql.autoBroadcastJoinThreshold

Default value: 10 MB

If a dataset is smaller than this threshold, Spark may automatically choose a Broadcast Join.

Modern Spark versions also support Adaptive Query Execution (AQE), which can dynamically switch join strategies during runtime. For instance, Spark may initially plan a Sort-Merge Join but later convert it into a Broadcast Join if runtime statistics reveal that one dataset is small enough.

Performance Optimization Tips

For Broadcast Joins:

Keep broadcast tables small
Remove unnecessary columns before joining
Apply filters early
Avoid broadcasting medium-sized datasets without memory analysis

For Sort-Merge Joins:

Repartition datasets carefully
Use high-cardinality join keys when possible
Optimize skewed data distributions
Enable adaptive query execution

Data skew remains one of the biggest challenges in distributed joins. A few heavily repeated keys can overload certain executors and slow down the entire pipeline. Techniques such as salting and skew join optimization can help mitigate these issues.

How I debugged a Delta Lake DESCRIBE HISTORY timeout (and what's actually causing it)

Abhishek Ambare — Mon, 04 May 2026 06:13:16 +0000

If you have ever run DESCRIBE HISTORY on a Delta table that receives streaming data every 60 seconds and watched it either hang for hours or crash with an OutOfMemoryError, you are not alone and you are not doing anything wrong. The problem is architectural, and once you understand the internals, the fix becomes a lot clearer.

Here is what I learned after digging into why this happens and what you can actually do about it.

How the Delta transaction log works
Every write to a Delta table, INSERT, UPDATE, DELETE, MERGE, schema change, gets recorded as a JSON file in a directory called _delta_log at the root of the table. Files are named with zero-padded twenty-digit integers:

_delta_log/
├── 00000000000000000000.json
├── 00000000000000000001.json
├── 00000000000000000002.json
...
├── 00000000000000000010.parquet  (checkpoint)

Each JSON file contains an array of "actions":

{
  "commitInfo": {
    "timestamp": 1714915200000,
    "operation": "STREAMING UPDATE",
    "operationMetrics": {
      "numOutputRows": "1240",
      "scanTimeMs": "320"
    },
    "isolationLevel": "WriteSerializable",
    "isBlindAppend": true
  }
}

{
  "add": {
    "path": "part-00001-abc123.snappy.parquet",
    "partitionValues": {},
    "size": 1048576,
    "stats": "{\"numRecords\":1240,\"minValues\":{...},\"maxValues\":{...}}"
  }
}

Every 10 commits, Delta generates a Parquet checkpoint file that captures the entire active table state as a compressed, columnar snapshot. When you run a normal query, Spark reads the latest checkpoint and applies only the small delta of JSON commits after it, which is why standard queries stay fast.

Why DESCRIBE HISTORY cannot use checkpoints
This is the core issue. The Delta protocol explicitly drops commitInfo when writing checkpoints. Checkpoints are optimized for state reconstruction, not provenance. So when you run:

DESCRIBE HISTORY my_streaming_table;

or in Python:

deltaTable.history().show()

Spark gets zero benefit from checkpoints. It has to parse every JSON file in _delta_log from scratch to extract the commitInfo blocks.

A pipeline that triggers every 60 seconds generates 1,440 commits per day. After a year, that is over half a million JSON files Spark has to read sequentially for a single DESCRIBE HISTORY call.

The three things that actually make it slow

Cloud storage listing overhead

AWS S3, Azure ADLS, and GCS do not have real directory structures. Listing _delta_log requires paginated API calls. S3's ListObjectsV2 returns at most 1,000 keys per request, so listing one million JSON files means 1,000 sequential HTTP requests before a single read task is scheduled. This is a pure I/O bottleneck. Adding more workers does not help here.

Small file JSON parsing

JSON is row-based text. Each two-kilobyte file requires a separate TCP connection to open, a full text parse to find the nested commitInfo struct, and type casting on every field. Multiply that by millions of files and executor CPU gets overwhelmed.

Driver OOM on shuffle

After executor nodes parse the JSON files, they shuffle the commitInfo structs back to the driver for aggregation and sorting. The driver's JVM heap has to hold all of this at once. When millions of records with nested maps like operationMetrics and operationParameters hit the driver simultaneously, you get:

java.lang.OutOfMemoryError: GC overhead limit exceeded

And the query dies.

What you can do about it
Reduce log retention (immediate impact)

ALTER TABLE my_streaming_table
SET TBLPROPERTIES (
  'delta.logRetentionDuration' = 'interval 7 days',
  'delta.deletedFileRetentionDuration' = 'interval 7 days'
);

This tells Delta to purge old JSON commit files during checkpointing. DESCRIBE HISTORY will now only parse 7 days of history instead of 30. One constraint to know: starting with Databricks Runtime 18.0, logRetentionDuration must be greater than or equal to deletedFileRetentionDuration, otherwise you get a validation error.

Enable Minor Log Compaction (Delta 3.0+)

Delta 3.0 introduced Minor Log Compaction, which combines multiple sequential JSON commits into a single consolidated file:

_delta_log/00000100.00000200.compact.json

This dramatically reduces the file count DESCRIBE HISTORY has to work through. It is enabled by default in modern runtimes, but you can explicitly control it with:

spark.conf.set(
  "spark.databricks.delta.deltaLog.minorCompaction.useForReads", "true"
)

Use Unity Catalog system tables instead

For systematic auditing, querying system.access.audit is significantly faster than DESCRIBE HISTORY because it is a pre-optimized Delta table, not a raw JSON parse:

SELECT
  event_time,
  user_identity.email,
  action_name,
  request_params
FROM system.access.audit
WHERE request_params['table_full_name'] = 'my_catalog.my_schema.my_table'
ORDER BY event_time DESC;

Similarly, system.query.history gives you execution metrics and durations for writes without ever touching the transaction log.

Upgrade driver memory

When you cannot avoid querying large histories, switch to a memory-optimized driver instance. On AWS, migrating from m5.xlarge to r5.4xlarge gives the JVM enough heap to aggregate millions of records without hitting OOM.

Medallion Architecture for high-frequency pipelines

If your pipeline runs MERGE operations frequently against a table that also gets queried, the pattern that works is to ingest raw streaming data as append-only writes into a Bronze table, then run a scheduled bulk MERGE on an hourly cadence into Silver or Gold. This keeps downstream tables clean while the Bronze table handles the commit volume.

Also worth looking at: Deletion Vectors (available in modern Databricks runtimes), which mark rows as logically deleted via compressed bitmap files instead of rewriting the entire Parquet file on every UPDATE or MERGE. This cuts AddFile and RemoveFile churn in the JSON commits significantly.

What I would do differently
If I were designing a high-frequency Kafka-to-Delta pipeline today, I would set a 7-day log retention from day one, enable Minor Log Compaction, route all compliance auditing to Unity Catalog system tables rather than DESCRIBE HISTORY, and extend the streaming trigger to at least 5 minutes unless the downstream business process genuinely needs sub-minute freshness. The transaction log bloat problem is much easier to prevent than to fix after the fact.

Desenvolvendo Pipelines de Dados Escaláveis com Ferramentas Modernas

Daniel Capelari — Sun, 26 Apr 2026 19:54:40 +0000

O desenvolvimento de pipelines de dados eficientes é crucial para empresas que lidam com grandes volumes de dados. A capacidade de processar e analisar esses dados de forma rápida e escalável pode ser um fator decisivo na tomada de decisões. No entanto, criar pipelines de dados que atendam às necessidades de uma empresa pode ser um desafio, especialmente quando se lida com fontes de dados diversificadas e formatos de dados variados.

Introdução às Ferramentas Modernas

As ferramentas modernas de processamento de dados, como o Apache Beam e o Apache Spark, oferecem soluções para esses desafios. Essas tecnologias permitem que os engenheiros de dados criem pipelines de dados escaláveis e eficientes, capazes de lidar com grandes volumes de dados de forma rápida e confiável.

Arquitetura de Pipelines de Dados

Uma arquitetura de pipeline de dados bem projetada deve considerar fatores como a fonte dos dados, o processamento, o armazenamento e a análise. O Apache Beam, por exemplo, permite que os desenvolvedores criem pipelines de dados que podem ser executados em diferentes ambientes, como o Google Cloud Dataflow, o Apache Flink e o Apache Spark.

from apache_beam import Pipeline, ParDo, GroupByKey
from apache_beam.options.pipeline_options import PipelineOptions

# Definindo as opções do pipeline
options = PipelineOptions()

# Criando o pipeline
with Pipeline(options=options) as p:
    # Lendo os dados de uma fonte
    dados = p | ReadFromText('dados.txt')

    # Processando os dados
    dados_processados = dados | ParDo(ProcessarDados())

    # Agrupando os dados por chave
    dados_agrupados = dados_processados | GroupByKey()

    # Escrevendo os dados processados em um arquivo
    dados_agrupados | WriteToText('dados_processados.txt')

Considerações de Desempenho

Para garantir o desempenho dos pipelines de dados, é importante considerar fatores como a paralelização do processamento, o uso de memória e a otimização dos algoritmos. O Apache Spark, por exemplo, oferece recursos como a paralelização automática do processamento e a otimização de consultas para melhorar o desempenho.

Na Prática

Para começar a desenvolver pipelines de dados escaláveis e eficientes, siga os passos abaixo:

Escolha uma ferramenta de processamento de dados adequada para as necessidades da sua empresa, como o Apache Beam ou o Apache Spark.
Defina a arquitetura do pipeline de dados, considerando a fonte dos dados, o processamento, o armazenamento e a análise.
Implemente o pipeline de dados usando a ferramenta escolhida, considerando fatores como a paralelização do processamento e a otimização dos algoritmos.
Teste e otimize o pipeline de dados para garantir o desempenho e a escalabilidade.

Em resumo, o desenvolvimento de pipelines de dados escaláveis e eficientes é fundamental para empresas que lidam com grandes volumes de dados. Com as ferramentas modernas de processamento de dados, como o Apache Beam e o Apache Spark, os engenheiros de dados podem criar pipelines de dados que atendam às necessidades da empresa de forma rápida e confiável. Ação para hoje: Explore o Apache Beam ou o Apache Spark e comece a desenvolver um pipeline de dados simples para entender melhor como essas ferramentas podem ajudar na sua empresa.

Spark Solved Distributed Compute. QIS Solves Distributed Intelligence.

Rory | QIS PROTOCOL — Thu, 09 Apr 2026 22:45:54 +0000

The Layer Zaharia Solved

When Matei Zaharia introduced Resilient Distributed Datasets in his 2012 NSDI paper, he solved a problem that had bottlenecked every large-scale data pipeline for a decade: how do you keep intermediate results in memory across a cluster without rewriting them to disk between every transformation step?

The answer was RDDs — immutable, partitioned collections that could be rebuilt from lineage rather than replicated eagerly. MapReduce had forced a disk-write-read cycle between every stage. Spark eliminated it. The DAG scheduler could pipeline transformations, coalesce shuffles, and exploit data locality so that compute moved to where the data already sat.

This was not an incremental improvement. It was an architectural insight. And it earned Zaharia the ACM Prize in Computing because the downstream impact was enormous: Spark became the substrate for batch analytics, stream processing, ML pipelines, and graph computation across the industry. Flink, Ray, and Dask each extended the idea in different directions, but they all operate on the same fundamental assumption.

That assumption is the subject of this article.

The Assumption Underneath

Every system in the Spark lineage — including Spark itself, Flink, Ray, Dask, and the various managed query engines — operates on a shared premise:

Raw data exists. It must be processed. Processing requires coordinated compute. Coordinated compute requires cluster management, schema agreement, and shuffle optimization.

This is correct. And for the problems these systems address, it is the right framing. If you have 40 TB of event logs and you need to compute a sessionized funnel, you need distributed compute. You need to partition the data, schedule tasks, manage stragglers, and materialize results. Spark does this brilliantly.

But notice what this framing takes for granted: that the valuable artifact is the raw data, and that intelligence emerges only after you process it centrally (or in coordinated clusters).

What if the intelligence already exists at the edge — and the problem is not processing, but routing?

A Different Layer

Consider a concrete scenario. A hospital in Nairobi has a patient with an unusual drug interaction outcome. The clinician documents the result. That result — the outcome — is roughly 512 bytes of structured data: the drug pair, the patient phenotype cluster, the observed effect, the confidence, the timestamp.

In the Spark paradigm, this outcome would need to be ingested into a centralized data lake, harmonized against a common schema (OMOP, FHIR, whatever the institution uses), joined against the global dataset, and then surfaced through a query or ML pipeline. The infrastructure cost for that pipeline — the cluster, the ETL, the schema mapping, the governance — is why most hospitals in low-resource settings never participate in global evidence networks. They produce outcomes every day. Those outcomes never leave the building.

QIS — Quadratic Intelligence Swarm — operates at a different layer entirely. It does not process raw data. It routes pre-distilled outcome packets by semantic similarity.

The Nairobi clinician's outcome gets emitted as a ~512-byte packet. No raw patient data leaves the facility. No schema harmonization is required. The packet contains only the distilled outcome and enough metadata for similarity matching. QIS routes it to every node in the network whose registered interest profile matches — and those nodes' outcomes route back.

This is not a data processing problem. It is a routing problem. And the distinction matters because the scaling properties are fundamentally different.

Scaling: Linear vs. Quadratic

Spark scales compute linearly. Add N worker nodes, get roughly N times the throughput (minus shuffle overhead, straggler effects, and the usual distributed systems tax). This is good. Linear scaling is what enabled Spark to handle petabyte workloads.

QIS scales intelligence quadratically. N nodes in a QIS network produce N(N-1)/2 unique synthesis paths — every pair of nodes can generate a novel insight from the combination of their outcomes. The intelligence function is:

I(N) = Θ(N²)

And the cost per node is:

C = O(log N)   [O(1) achievable with locality-aware routing]

This is not marketing language. It is a direct consequence of the architecture. Each node emits a ~512-byte outcome packet. The routing layer matches packets by semantic similarity. Every pair of matched outcomes creates a synthesis opportunity. The number of pairs grows quadratically with N. The cost per node grows logarithmically (or stays constant) because each node only processes its own matches, not the entire network's traffic.

For a distributed systems engineer, the analogy is: imagine if adding a node to your Spark cluster didn't just add linear throughput — it added combinatorial insight from every pairwise interaction with existing nodes. And imagine if the shuffle cost for that interaction was bounded by log N rather than N.

That is what happens when you shift from the compute layer to the routing layer.

Why the Layers Are Different

Let me be precise about what separates these layers, because the distinction is architectural, not just rhetorical.

The Spark layer (distributed compute):

Input: raw data (events, logs, records, streams)
Operation: transformation (map, reduce, join, aggregate, window)
Coordination: DAG scheduling, shuffle, partition management
Output: computed results (tables, models, aggregates)
Scaling unit: compute throughput per node
Failure mode: data loss, straggler delay, shuffle bottleneck

The QIS layer (distributed intelligence routing):

Input: pre-distilled outcome packets (~512 bytes each)
Operation: semantic similarity matching and routing
Coordination: Three Elections (Hiring, The Math, Darwinism)
Output: routed insights, synthesis across matched pairs
Scaling unit: synthesis paths per node (quadratic)
Failure mode: routing misalignment (corrected by Darwinism election)

These are genuinely different layers. Spark needs to know the schema of your data. QIS needs to know the semantic fingerprint of your outcome. Spark coordinates a cluster to process a job. QIS coordinates a network to route outcomes. Spark fails when shuffles get too expensive. QIS fails when similarity matching drifts — and self-corrects through competitive network selection (the Darwinism election).

A useful mental model: Spark is Layer 4-5 in the intelligence stack (transport and processing). QIS is Layer 6-7 (presentation and application of intelligence). They do not compete. They compose.

The Three Elections

For readers unfamiliar with QIS internals, the coordination mechanism is worth examining because it solves the same class of problems that Spark's DAG scheduler solves — just at a different layer.

1. The Hiring Election. Domain experts define what "similar" means for a given context. In a pharmacovigilance network, similarity might weight drug class and patient phenotype heavily. In a climate science network, it might weight geographic region and measurement modality. This is analogous to how a Spark developer defines partitioning keys — except in QIS, the partitioning is semantic rather than structural.

2. The Math Election. Outcomes accumulate through the network. As more nodes contribute outcomes for a given similarity cluster, the aggregate signal strengthens. This is pure mathematics — no central coordinator decides when enough evidence exists. The accumulation is the evidence. Byzantine fault tolerance emerges naturally: a single malicious or erroneous node cannot distort an aggregate of hundreds of independent outcome packets. The math does not require consensus. It requires accumulation.

3. The Darwinism Election. Multiple QIS networks can operate simultaneously over the same node population. Networks that route outcomes more effectively — measured by downstream validation — survive and grow. Networks that route poorly lose participants. This is the self-correction mechanism that prevents drift in the similarity matching over time.

Together, these three elections form a complete loop. Hiring defines the routing criteria. The Math validates through accumulation. Darwinism selects for effective routing configurations. The breakthrough is the loop itself — not any single component.

What This Means for the Spark Engineer

If you have spent years optimizing shuffle operations, tuning partition counts, and fighting data skew, you understand viscerally that the hardest problems in distributed systems are coordination problems. Getting the right data to the right place at the right time is harder than the actual computation.

QIS takes that intuition and applies it one layer up. The "data" is already processed — it is a 512-byte outcome. The "right place" is any node whose interest profile matches semantically. The "right time" is whenever the outcome is emitted. There is no batch window. There is no job scheduling. There is no shuffle.

This is why QIS is protocol-agnostic on transport. It does not care whether outcome packets travel over Kafka, NATS, MQTT, gRPC, REST, Redis Pub/Sub, Apache Pulsar, or even a folder on a shared drive. The routing logic is independent of the transport layer. If you can move 512 bytes from point A to point B, you can participate in a QIS network.

Compare this to Spark, which requires a specific cluster manager (YARN, Mesos, Kubernetes, standalone), a shared storage layer (HDFS, S3, etc.), and compatible serialization formats. These requirements exist because Spark must coordinate compute across nodes. QIS does not coordinate compute. It routes outcomes. The coordination requirements collapse.

Complementary, Not Competing

The strongest deployment architecture uses both layers. A hospital might run Spark (or Ray, or Dask) locally to process its own patient records, generate ML model outputs, and produce distilled outcomes. Those outcomes — 512 bytes each — then enter the QIS routing layer and propagate to every semantically matched node in the network.

Spark produces the insight. QIS routes it.

This is the same pattern that made the internet powerful: the application layer does not compete with TCP/IP. It depends on it. And TCP/IP does not compete with the physical layer. Each layer solves a different problem, and the stack composes.

The engineers who built Spark understood this. Zaharia's insight was not "processing data is important" — everyone knew that. His insight was that the right abstraction (RDDs, lineage-based recovery, in-memory DAG execution) could make distributed compute radically more efficient. The abstraction was the contribution.

QIS is an analogous abstraction for the next layer. The insight is not "routing intelligence is important" — anyone coordinating multi-site studies or federated learning networks already knows that. The insight is that the right abstraction (~512-byte outcome packets, semantic similarity routing, three self-correcting elections) can make distributed intelligence routing radically more efficient. And that efficiency scales quadratically with participation.

The Numbers

A QIS network of 1,000 nodes produces 499,500 unique synthesis paths. A network of 10,000 nodes produces 49,995,000. A network of 100,000 nodes produces 4,999,950,000.

The cost per node at 100,000 participants is O(log 100,000) ≈ 17 routing hops, or O(1) with locality-aware optimization. Each hop processes a 512-byte packet. The total bandwidth per node is measured in kilobytes per second, not gigabytes.

Compare this to what it would cost to ingest 100,000 sites' raw data into a centralized Spark cluster, harmonize schemas, and run global queries. The infrastructure alone would cost millions. The governance and compliance overhead would take years. And many of those 100,000 sites — the rural clinics, the under-resourced research labs, the field stations — would never participate because they cannot afford the ETL pipeline.

QIS lets them participate with a device that can emit 512 bytes.

The Paradigm Shift

Zaharia showed us: stop moving data to compute. Move compute to data.

QIS shows us: the insight already exists at the edge. Stop trying to centralize it. Route it.

These are the same class of insight — an architectural recognition that the bottleneck is not computation but coordination, and that the right abstraction can eliminate unnecessary movement. Spark eliminated unnecessary disk I/O between stages. QIS eliminates unnecessary centralization of raw data between institutions.

The result in both cases is that participation barriers collapse, scaling properties improve by orders of magnitude, and problems that were previously intractable become tractable.

For distributed systems researchers who spent a career thinking about data locality, shuffle optimization, and partition strategies: the QIS routing layer is the next problem worth studying. The math is clean. The architecture is complete. The scaling is quadratic.

QIS — Quadratic Intelligence Swarm — was discovered by Christopher Thomas Trevethan on June 16, 2025. The protocol is protected by 39 provisional patents. It is free for nonprofits, research institutions, and educational use. Commercial licenses fund humanitarian deployment to underserved regions.

For technical specifications, architecture documentation, and the full mathematical framework, visit the QIS knowledge base.

Published by Rory | QIS Protocol Deep Dives

Your Customer Table Has Duplicates You Can't See With SQL How I Built a Cross-Platform Identity Resolution Layer for a Dark Kitchen Data Platform

SARAN TEJA MALLELA — Thu, 09 Apr 2026 16:29:22 +0000

I'm Saran Teja Mallela — a data engineer based in Houston, TX. I build batch and streaming pipelines on Azure during the day at URL Systems, and at night I've been working on GhostKitchen: a full data platform for cloud kitchen operations.

GhostKitchen simulates 50 dark kitchens across 10 Texas cities — Houston, Dallas, Austin, San Antonio, and six others — each running 3–5 virtual restaurant brands. Orders flow in from three delivery platforms (Uber Eats, DoorDash, and the kitchen's own app), alongside kitchen IoT sensor readings, delivery GPS pings, and menu change events. The platform processes 7,500+ events per minute across four Kafka topics and lands everything in Delta Lake using a Medallion Architecture.

But here's the thing — none of that engineering mattered until I could answer a deceptively simple question: who is this customer?

The Problem Nobody Warns You About

Let me show you exactly what I mean. Here's a real scenario from my Houston test data:

Uber Eats order, 7:42 PM:

{
  "customer_uid": "ue_cust_48291",
  "customer_email": "Saran.Mallela@Gmail.com",
  "customer_name": "Saran Mallela",
  "total_amount": 18.75
}

DoorDash order, 8:15 PM, same evening:

{
  "dasher_customer_id": "dd_u_73625",
  "customer_email": "saran.mallela@gmail.com",
  "customer_name": "Saran T Mallela",
  "order_value": 22.40
}

OwnApp order, next day:

{
  "user_id": "app_saran_482",
  "customer_email": "SARAN.MALLELA@GMAIL.COM",
  "customer_name": "Saran Teja Mallela",
  "amount_cents": 1650
}

Three orders. Three platforms. Three different customer IDs (customer_uid, dasher_customer_id, user_id). Three email capitalizations. Three name variations. Two currency formats (dollars vs cents). Zero shared keys.

The kitchen thinks it served three different people. It served me three times.

Run a JOIN ON customer_id — zero matches. Try JOIN ON email — the casing kills it. Even after you lowercase everything, 2% of my test events have null emails entirely (I injected those deliberately to simulate real-world incomplete data).

Without resolving this, customer lifetime value is understated by roughly 35%. Cohort analysis breaks. Personalization is impossible. The business literally doesn't know who its best customers are.

Why I Chose Lambda Over Kappa for This

Quick architecture context before the resolution logic.

GhostKitchen uses Lambda Architecture — dual batch + streaming paths. I know the data engineering internet has strong opinions about this, so here's my reasoning:

Orders are stateful. A single order transitions through up to 7 states: placed → confirmed → preparing → ready → picked_up → delivered (or cancelled). Getting exact daily revenue requires knowing each order's final state, which sometimes needs late corrections — a "delivered" status might arrive hours after the order was placed.

The streaming path (Kafka → Spark Structured Streaming → Delta Lake) gives approximate numbers in ~30 seconds. The batch path (Airflow → PySpark → Delta Lake MERGE) recomputes exact numbers overnight. A reconciliation DAG at 02:00 UTC makes batch authoritative.

For identity resolution specifically, this dual-path matters: a customer's first order might arrive via streaming with a null email. A later batch correction fills it in. The batch path catches matches that streaming missed.

I built PulseTrack (a healthcare IoT pipeline on Azure) with pure Kappa, because wearable sensor readings are append-only — a heart rate of 72 bpm at 2:34 PM never gets "corrected." Different data semantics → different architecture. That distinction matters more than any blanket rule about which pattern is "better."

The Data Model That Makes Resolution Possible

Before I could resolve identities, I needed a model that could handle three conflicting schemas without forcing premature alignment.

Silver layer: Data Vault 2.0. This is where the magic happens.

Data Vault separates data into three table types:

Hubs — core business entities with their business keys. One row per unique entity. hub_customer holds the resolved customer identity.
Links — relationships between hubs. link_order_customer connects which customer placed which order.
Satellites — descriptive attributes with full history (SCD Type 2). sat_customer_profile tracks name, email, and platform IDs over time.

Why not Star Schema in Silver? Because Uber defines a customer as customer_uid + total_amount (float). DoorDash uses dasher_customer_id + order_value (float). OwnApp uses user_id + amount_cents (integer). Star Schema would force me to pick one representation and lose the others. Data Vault keeps every source's raw attributes in separate satellites while unifying identity in hubs.

Gold layer: Star Schema. Once identities are resolved, analysts need simple queries: SELECT sum(order_total) FROM fact_order JOIN dim_kitchen GROUP BY city. Star makes that a 2-table join. Dimensions are flat (kitchen, brand, delivery zone) — no deep hierarchies that would justify a Snowflake Schema. PII stays locked in Silver; only hashed surrogate keys propagate to Gold.

The Resolution Pipeline: Four Stages

Stage 1: Email Normalization

Strip whitespace, lowercase everything, remove dots before the @ in Gmail addresses (Gmail ignores dots — saran.mallela@gmail.com and saranmallela@gmail.com deliver to the same inbox).

After normalization, Saran.Mallela@Gmail.com, saran.mallela@gmail.com, and SARAN.MALLELA@GMAIL.COM all become saranmallela@gmail.com.

This alone resolves about 82% of cross-platform matches in my test data.

Stage 2: Deterministic Grouping via MD5 Hash

Hash the normalized email with MD5. All records sharing the same hash receive the same customer_hk (hash key) in the Data Vault hub.

Why hash instead of a database-generated auto-increment ID? Because three platforms run as independent Spark jobs. There's no central coordinator to hand out sequential IDs. A hash is deterministic — same email always produces the same key, on any node, any platform, independently. No coordination needed.

from pyspark.sql import functions as F

df = df.withColumn(
    "email_normalized",
    F.lower(F.regexp_replace(
        F.trim(F.col("customer_email")),
        r"\.(?=.*@gmail\.com$)", ""
    ))
)

df = df.withColumn(
    "customer_hk",
    F.md5(F.col("email_normalized"))
)

After this stage, my three Houston test orders — the pad thai on Uber Eats, the burrito bowl on DoorDash, and the fried chicken on OwnApp — all share one customer_hk. They're one person in the system now.

Stage 3: Fuzzy Matching for the Remaining 18%

What about the 2% with null emails? And the cases where the same person uses different email addresses across platforms?

This is where I added probabilistic matching using Jaro-Winkler string similarity on name + delivery address combinations:

"Saran Mallela" vs "Saran T Mallela" → similarity: 0.94 → match
"Saran Mallela" vs "Sarah Miller" → similarity: 0.68 → no match

I also incorporated Soundex phonetic matching as a fallback — it catches typos and transliterations that string distance alone misses. "Mallela" and "Malela" produce the same Soundex code.

The confidence threshold is 0.88. Above that → automatic merge. Between 0.70 and 0.88 → flagged for manual review. Below 0.70 → treated as distinct customers.

Stage 4: Confidence Scoring and Audit Trail

Every single match gets a confidence score and a matched_via field:

customer_hk	platform	platform_id	email	matched_via	confidence
a3f8c2...	uber_eats	ue_cust_48291	saran.mallela@gmail.com	email_hash	1.00
a3f8c2...	doordash	dd_u_73625	saran.mallela@gmail.com	email_hash	1.00
a3f8c2...	own_app	app_saran_482	saran.mallela@gmail.com	email_hash	1.00
b7d1e9...	uber_eats	ue_cust_91034	NULL	fuzzy_name_addr	0.91

The audit trail is critical. When an analyst questions a number, you can trace it back to exactly why two records were merged and how confident the system was about it. No black boxes.

Stress-Testing With Deliberately Dirty Data

Here's what most identity resolution tutorials skip: if you test on clean data, you're testing your assumptions, not your algorithm.

I injected five types of noise into my data generators:

5% duplicate events — simulating Kafka consumer retries and at-least-once delivery
2% null emails — simulating incomplete user profiles on mobile sign-ups
3% late-arriving events — orders showing up 6–24 hours after they happened
Name variations — middle names, initials, typos (Mallela vs Malela vs mallela)
Address inconsistencies — "Houston, TX" vs "Houston, Texas" vs "HTX"

Against this deliberately hostile test data, the resolution pipeline achieves 94% match accuracy across 200 customer identities.

The remaining 6%? Documented as known unresolvable ambiguity — cases where two records have no email, different names, and addresses that are close but not close enough. That's not a failure. That's honest engineering. Because 100% accuracy in identity resolution doesn't exist. You're always trading precision for recall, and pretending otherwise is how you end up merging two different people into one profile.

Handling Late-Arriving Data

The 3% late-arriving events deserve their own section because they interact with identity resolution in a way that isn't obvious.

The streaming path uses a 24-hour watermark. Any event older than 24 hours gets routed to a Dead Letter Queue (DLQ). The nightly reconciliation DAG picks up DLQ events and reprocesses them through the batch path.

Why does this matter for identity? Imagine this sequence:

8:00 PM — An order arrives via streaming with a null email. Gets assigned a new customer_hk as an unmatched record.
2:00 AM — A batch correction arrives with the email filled in.
Reconciliation — Batch path re-runs identity resolution, discovers this "new" customer actually matches an existing one. Delta Lake MERGE updates the Gold layer. The temporary unmatched record gets absorbed into the correct customer profile.

Without the batch path, that customer remains a phantom forever — inflating your customer count and deflating per-customer metrics. This is why Lambda exists for this use case. It's not about speed. It's about eventual correctness.

What I'd Do Differently in Production

Three things I'd add for a production system at scale:

1. Graph-based resolution. My current approach is pairwise — compare record A to record B. A graph database would let me do transitive matching: if A matches B, and B matches C, then A matches C — even if A and C share zero attributes. I actually built this pattern in PulseTrack, where patient identity resolution chains 7 identifier types in a multi-hop graph: device_account → email → MRN → pharmacy_id → insurance_id → phone_hash → ssn_hash.

2. ML-based scoring. Replace hand-tuned Jaro-Winkler with a trained classifier that learns match weights from confirmed true matches. The 0.88 threshold is hand-tuned — a model could optimize it per-field and adapt as data distributions change.

3. Real-time resolution in the streaming path. Currently, fuzzy matching only runs in the batch path because it's computationally expensive. With a pre-computed blocking index in Redis, the streaming path could do approximate matching at ingestion time instead of waiting for the nightly batch.

Try It Yourself

The entire platform is open source and runs locally with one command:

docker-compose up -d

GitHub: github.com/Nerdboss-stm/ghostkitchen
Live demo: ghostkitchen-portfolio.vercel.app

The live demo lets you run the full pipeline, explore the schema (including the identity resolution layer), browse the Data Vault Silver and Star Schema Gold tables, and see 43 automated data quality checks per pipeline run.

The Gold layer has fact_order (2,274 rows), fact_sensor_hourly, dim_customer (576 resolved profiles from ~800 raw platform records), dim_kitchen, dim_brand, dim_date, dim_time, and more — all with full data lineage from source to serving.

If you work on multi-source identity problems — in food delivery, retail, fintech, healthcare, or any multi-marketplace domain — I'd genuinely like to hear how you're solving it. What matching strategies work for your data? Where do you draw the precision/recall line?

I'm Saran Teja Mallela — data engineer, University of Houston MS grad (Data Science, 4.0 GPA), currently building data platforms at URL Systems in Houston. You can find me on LinkedIn or GitHub.

I write about data architecture decisions, not just code. Because the hardest part of engineering is never the syntax — it's deciding what "good enough" means.

🚀 Apache Spark Just Killed the Microbatch Barrier (And Why Flink Should Be Worried)

Siddhesh Surve — Wed, 18 Mar 2026 02:30:02 +0000

If you've spent any time working in Big Data and Cloud Computing, you know the classic dilemma: Throughput vs. Latency.

Historically, if you needed high-throughput ETL processing, you spun up Apache Spark. But if you needed ultra-low-latency, real-time event streaming (like fraud detection or live telemetry), you had to build an entirely separate architecture using something like Apache Flink.

That era is officially over.

Databricks just detailed the architectural changes behind Apache Spark 4.1’s new Real-Time Mode (RTM), and it is a massive paradigm shift. Spark Structured Streaming can now achieve millisecond-level latencies, effectively eliminating the need to maintain two separate streaming engines.

Here is a breakdown of how Databricks broke the microbatch barrier, the clever architecture behind it, and why this is a game-changer for data engineering.

🛑 The Problem with Microbatches

Spark’s legacy superpower was the microbatch architecture. It gathers a chunk of data, processes it, writes state to object storage (for fault tolerance), and spits it out. This is incredible for high-throughput because it amortizes overhead and utilizes hardware efficiently.

So, why not just make the batches smaller to get lower latency?

Because of the fixed costs. Every microbatch carries fixed overhead: planning the batch, task serialization, scheduling, and writing state/logs to durable storage. If you shrink a batch to 10ms, the fixed overhead might still take 500ms. You hit a mathematical wall where smaller batches actually increase end-to-end latency.

🧠 The Hybrid Execution Solution

To solve this, the Databricks engineering team couldn't just tweak settings; they had to fundamentally evolve how Spark handles data flow. They introduced a Hybrid Execution Model built on three core pillars:

1. Longer Epochs, Continuous Flow

Instead of chopping data into tiny microbatches, RTM uses longer duration epochs. However, it changes how data behaves inside that epoch. Instead of waiting for a batch to fill, data streams continuously through the stages without blocking. The epoch boundary essentially becomes a checkpoint interval for fault tolerance, rather than a processing bottleneck.

2. Concurrent Processing Stages

In traditional Structured Streaming, stages ran sequentially. Reducers sat idle, waiting for mappers to completely finish their jobs.
With RTM, stages are concurrent. As soon as a mapper processes a row and generates a shuffle file, the reducer starts processing it immediately. No more waiting.

3. Non-Blocking Operators

Classic batch operators love to buffer. A groupBy aggregation would traditionally buffer all records, pre-aggregate, and emit at the very end. RTM introduces non-blocking operators that minimize buffering, emitting results continuously as data flows through the pipeline.

💻 What This Looks Like for Developers

The beauty of this update is that you don't need to learn a new framework or rewrite your complex business logic in Flink. You just flip the switch on your existing Structured Streaming jobs.

Here is a conceptual example of how you might configure a continuous, ultra-low latency stream in Spark 4.1:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window

# Initialize Spark Session with RTM enabled 
spark = SparkSession.builder \
    .appName("UltraLowLatencyFraudDetection") \
    .config("spark.sql.streaming.realTimeMode.enabled", "true") \
    .getOrCreate()

# Read from a high-throughput source like Kafka
transactions = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:29092") \
    .option("subscribe", "financial-transactions") \
    .load()

# Apply your existing business logic
fraud_alerts = transactions \
    .selectExpr("CAST(value AS STRING) as payload") \
    .filter(col("payload").contains("suspicious_pattern")) \
    .groupBy(window(col("timestamp"), "1 second"), col("account_id")) \
    .count()

# Write the stream using the continuous processing trigger
query = fraud_alerts.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:29092") \
    .option("topic", "fraud-alerts") \
    .trigger(continuous="100 milliseconds") # The magic happens here
    .start()

query.awaitTermination()

By leveraging the new continuous trigger and RTM architecture, this standard Spark code will now process records with sub-100ms latency, bypassing the traditional microbatch blocking phases.

🌍 The Big Picture

Over on the AI Tooling Academy channel, we talk constantly about simplifying architectures. Managing a Lambda architecture (maintaining both a batch layer and a speed layer) has always been an expensive, operational nightmare.

Databricks benchmarked this new real-time mode against Flink for feature engineering workloads, and Spark actually outperformed it in many scenarios.

We are finally entering an era of unified data processing. If you are handling large-scale telemetrics, live AI feature extraction, or financial feeds, you no longer have to choose between throughput and latency.

Are you planning to migrate your Flink workloads back to Spark now that RTM is here? Let's debate it in the comments below! 👇

Should you join Data Engineering?A guide to the tools you'll use

Collins Njeru — Mon, 16 Mar 2026 12:11:50 +0000

Introduction

Many aspiring technologists find themselves at a crossroad:is data engineering the right career path for me.The hesitation often comes from uncertainty about the tools and technologies involved. This article breaks down the core categories of data engineering tools, giving you a clear picture of what you’ll be working with if you decide to join the field.

Core categories of data engineering tools

1.Data ingestion & Integration

Data engineering starts with collecting information from multiple sources

Fivetran /Stitch/ Hevo Data : Automate extraction from SaaS apps and databases

Apache Kafka : Real-time streaming and event-driven pipelines.

Apache Nifi : Flow-based ingestion and routing.

2.Data storage & Warehousing

Once data is ingested, it needs a reliable home.

Snowflake:Cloud-native warehouse with scalability.

Google BigQuery:Serverless, highly scalable analytics warehouse.

Amazon Redshift :AWS-based warehouse optimized for queries.

3.Data processing & transformation

Raw data must be cleaned and transformed before use.

Apache spark:Distributed computing for batch and streaming.

Hadoop:Large-scale storage and batch processing.

Dbt (Data Build Tool):SQL-based transformations for analytics teams.

4. Workflow & orchestration

Pipelines need automation and scheduling.

Apache Airflow:Workflow automation and DAG scheduling.

Prefect/luigi :Alternatives for managing complex workflows.

5.Infrastructure & Deployment

Behind the scenes, infrastructure ensures scalability.

Docker & Kubernetes:Containerization and orchestration.

Terraform:Infrastructure as Code for cloud resources.

6.Monitoring & Quality

Data must be trustworthy and pipelines reliable.

Great expectations :Data validation and quality checks.

Datadog / Prometheus :Monitoring pipelines and infrastructure

Key Considerations

Scalability: Spark and Snowflake excel with large datasets.

Real-Time vs Batch: Kafka is unmatched for streaming; Hadoop and Spark dominate batch workloads.
Cloud Integration: Align tools with your provider (AWS Redshift, GCP BigQuery, Azure Synapse ).
Cost:Open-source tools are free but require setup; managed services reduce overhead but add licensing costs.

Conclusion

Joining data engineering means stepping into a field where you’ll design the backbone of modern businesses. The tools may seem overwhelming at first, but each one solves a specific problem together, they form a powerful toolkit. If you’re excited about building systems that move, store, and transform data at scale, then data engineering isn’t just a career option; it’s a future-proof calling.

Using Gravitino with Apache Spark for ETL

Yue @ Datastrato (Admin) — Thu, 05 Feb 2026 23:24:45 +0000

Author: Minghuang Li

Last Updated: 2026-01-31

Overview

In this tutorial, you will learn how to use Apache Gravitino with Apache Spark for ETL (Extract, Transform, Load) operations. By the end of this guide, you'll be able to build data pipelines that seamlessly access multiple heterogeneous data sources through a unified catalog interface.

What you'll accomplish:

Configure Gravitino Spark Connector to enable unified access to multiple data sources in Spark
Register multiple catalogs including MySQL and Iceberg in Gravitino for federated access
Build an ETL pipeline that extracts data from MySQL, transforms it, and loads it into Iceberg
Execute federated queries across different data sources using Spark SQL and PySpark

Apache Spark is one of the most popular unified analytics engines for large-scale data processing. In a typical ETL pipeline, Spark often needs to interact with multiple heterogeneous data sources (like MySQL, HDFS, S3, Hive, Iceberg). Managing connectivity, credentials, and schema information for these diverse sources can be complex and error-prone.

Apache Gravitino simplifies this by acting as a unified metadata lake. By using the Gravitino Spark Connector, you can access multiple data sources through a single catalog interface in Spark, without having to manually configure each source's connection details in your Spark jobs.

Key benefits:

Unified catalog: Access Hive, Iceberg, MySQL, PostgreSQL, and other sources under a unified namespace
Centralized metadata: Metadata is managed in Gravitino, changes are reflected immediately
Simplified configuration: Configure the Gravitino connector once, and access all managed catalogs
Federated querying: Easily join data across different sources (e.g., join MySQL data with Iceberg table)

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
JDK 17 or higher installed and properly configured
Apache Spark 3.3, 3.4, or 3.5 installed

Required Components:

Gravitino server installed and running (see 02-setup-guide/README.md)
MySQL instance for testing JDBC catalog functionality

Optional Components:

HDFS or S3 for Iceberg data storage in production environments

Before proceeding, verify your Java and Spark installation:

${JAVA_HOME}/bin/java -version
${SPARK_HOME}/bin/spark-submit --version

Architecture overview:

Setup

Step 1: Download Gravitino Spark Connector

You need the Gravitino Spark Connector jar file to enable Spark integration with Gravitino.

Obtain the connector

Download from Maven Central Repository

For Spark 3.5, download the connector from:
gravitino-spark-connector-runtime-3.5

Additional dependencies

For JDBC sources (MySQL, PostgreSQL), you also need the specific JDBC driver jar (e.g., mysql-connector-j for MySQL) in your classpath.

Step 2: Configure Spark Session

To use Gravitino with Spark, you need to configure the specialized Gravitino Spark IO plugin.

Configure Spark SQL with Gravitino

Start Spark SQL with the Gravitino connector

# Set the location of your Gravitino server
GRAVITINO_URI="http://localhost:8090"
# The metalake you want to access
METALAKE_NAME="default_metalake"

spark-sql \
  --packages org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1 \
  --conf spark.plugins=org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin \
  --conf spark.sql.gravitino.metalake=$METALAKE_NAME \
  --conf spark.sql.gravitino.uri=$GRAVITINO_URI \
  --conf spark.sql.gravitino.enableIcebergSupport=true

Configuration notes:

Replace 1.1.0 with the actual version you are using
Ensure the Spark connector version matches your Spark version
Set spark.sql.gravitino.enableIcebergSupport=true to enable Iceberg catalog support

Step 3: Prepare Metadata in Gravitino

Before running ETL jobs, you need to register the catalogs for your data sources in Gravitino. You can do this via the Gravitino REST API or Web UI.

Register MySQL Catalog

Create a MySQL catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "mysql_catalog",
  "type": "relational",
  "provider": "jdbc-mysql",
  "properties": {
    "jdbc-url": "jdbc:mysql://localhost:3306",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Register Iceberg Catalog

Create an Iceberg catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "iceberg_catalog",
  "type": "relational",
  "provider": "lakehouse-iceberg",
  "properties": {
    "warehouse": "file:///tmp/iceberg-warehouse",
    "catalog-backend": "jdbc",
    "uri": "jdbc:mysql://localhost:3306/iceberg_metadata",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-initialize": "true"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Note: This example uses a local file system for Iceberg data storage. For production environments, consider using HDFS or S3. For more detailed Iceberg catalog configuration options, see 03-iceberg-catalog/README.md.

Step 4: Build an ETL Pipeline from MySQL to Iceberg

In this scenario, we will extract user data from a MySQL database, perform some transformations, and load it into an Apache Iceberg table for analytical queries, all managed through Gravitino.

Verify Catalogs in Spark

1. Start your Spark SQL session

Use the configuration from Step 2 to start your Spark SQL session.

2. Verify catalog visibility

-- Due to Spark catalog manager limitations, SHOW CATALOGS only displays 'spark_catalog' initially
SHOW CATALOGS;

-- Use a Gravitino-managed catalog to make it visible
USE mysql_catalog;
USE iceberg_catalog;

-- Now both catalogs are visible in the output
SHOW CATALOGS;

Note: The SHOW CATALOGS command initially only displays the Spark default catalog (spark_catalog). After explicitly using a Gravitino-managed catalog with the USE command, that catalog becomes visible in subsequent SHOW CATALOGS output.

Prepare Sample Data in MySQL

1. Create a sample database and table

-- Switch to MySQL catalog
USE mysql_catalog;

-- Create a sample database
CREATE DATABASE IF NOT EXISTS users_db;
USE users_db;

-- Create a users table
CREATE TABLE IF NOT EXISTS users (
  id INT,
  username STRING,
  email STRING,
  status STRING,
  created_at TIMESTAMP
);

2. Insert sample data

-- Insert sample data
INSERT INTO users VALUES 
  (1, 'Alice', 'alice@example.com', 'active', TIMESTAMP '2024-01-15 10:00:00'),
  (2, 'Bob', 'bob@example.com', 'active', TIMESTAMP '2024-02-20 14:30:00'),
  (3, 'Charlie', 'charlie@example.com', 'inactive', TIMESTAMP '2024-03-10 09:15:00'),
  (4, 'Diana', 'diana@example.com', 'active', TIMESTAMP '2024-04-05 16:45:00'),
  (5, 'Eve', 'eve@example.com', 'inactive', TIMESTAMP '2024-05-12 11:20:00');

-- Verify the data
SELECT * FROM users;

Extract Data from MySQL

Verify data extraction

-- Read data from MySQL
SELECT * FROM mysql_catalog.users_db.users LIMIT 10;

Transform and Load Data to Iceberg

1. Create an Iceberg table

-- Switch to Iceberg catalog
USE iceberg_catalog;
CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE IF NOT EXISTS analytics.active_users (
  user_id INT,
  username STRING,
  email STRING,
  created_at TIMESTAMP
) USING iceberg;

2. Execute ETL query

-- ETL Query: Insert into Iceberg from MySQL with transformation
INSERT INTO analytics.active_users
SELECT 
  id as user_id, 
  LOWER(username) as username, 
  LOWER(email) as email, 
  created_at 
FROM mysql_catalog.users_db.users 
WHERE status = 'active';

Note: For JDBC catalogs (like MySQL), operations UPDATE, DELETE, and TRUNCATE are NOT supported. Only SELECT and INSERT are supported.

Verify ETL Results

Query the target Iceberg table

SELECT count(*) FROM analytics.active_users;
SELECT * FROM analytics.active_users LIMIT 5;

PySpark Example

If you prefer using Python, the logic is very similar using the DataFrame API.

Configure PySpark Session

Create a PySpark session with Gravitino connector

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("GravitinoSparkETL") \
    .config("spark.jars.packages", "org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
    .config("spark.plugins", "org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin") \
    .config("spark.sql.gravitino.metalake", "default_metalake") \
    .config("spark.sql.gravitino.uri", "http://localhost:8090") \
    .config("spark.sql.gravitino.enableIcebergSupport", "true") \
    .getOrCreate()

Execute ETL Pipeline

Read, transform, and write data using DataFrame API

# Read from MySQL
mysql_df = spark.table("mysql_catalog.users_db.users")

# Transform
active_users = mysql_df.filter("status = 'active'") \
    .selectExpr("id as user_id", "lower(username) as username", "lower(email) as email", "created_at")

# Write to Iceberg
active_users.write \
    .format("iceberg") \
    .mode("append") \
    .saveAsTable("iceberg_catalog.analytics.active_users")

print("ETL Job Completed successfully.")

Troubleshooting

Common issues and their solutions:

Connector and classpath issues:

ClassNotFoundException: org.apache.gravitino.spark.connector.GravitinoCatalog: The Gravitino Spark Connector JAR is missing from the classpath. Ensure you added the correct package with --packages or placed the JAR in $SPARK_HOME/jars
Missing JDBC Driver: When connecting to JDBC sources (MySQL/PostgreSQL) via Gravitino, Spark still needs the JDBC driver JARs in its classpath. Add the MySQL/PostgreSQL JDBC driver packages to your Spark startup command (e.g., --packages mysql:mysql-connector-java:8.0.33) or put the jar in jars/ folder

Connection issues:

Connection refused to Gravitino Server: Spark cannot reach the Gravitino server. Check if Gravitino server is running and the spark.sql.gravitino.uri config is correct
Catalog not found: Ensure the catalogs are properly registered in Gravitino and the metalake name is correct

Query execution issues:

UPDATE/DELETE not supported on JDBC catalogs: For JDBC catalogs (like MySQL), only SELECT and INSERT operations are supported through Gravitino
Table not found: Verify the fully qualified table name format: catalog.schema.table

Congratulations

You have successfully completed the Gravitino Spark ETL tutorial!

You now have a fully functional Spark environment with Gravitino integration, including:

A configured Gravitino Spark Connector for unified catalog access
Multiple registered catalogs (MySQL and Iceberg) in Gravitino
A working ETL pipeline that extracts, transforms, and loads data across heterogeneous sources
Understanding of federated query capabilities and PySpark integration

Your Spark environment is now ready to leverage Gravitino for unified metadata management across your data ecosystem.

Next Steps

Explore Using Gravitino with Trino for federated querying
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Apache Spark vs Apache Hadoop—10 Crucial Differences (2025)

Pramit Marattha — Mon, 17 Nov 2025 03:29:16 +0000

Big data—it's a whole lot to handle, and it's only getting bigger. In just a few years, the amount of data has ballooned, changing how we store, process, and analyze it. To manage all this data, big data frameworks have become a must-have. Apache Hadoop and Apache Spark are two of the biggest names in the game. They're both built for handling massive datasets, but they have different approaches and are better suited for different tasks. Apache Hadoop came first, starting the big data revolution by providing an affordable way to store massive datasets (via Hadoop Distributed File System (HDFS)) and process them in batches (via Hadoop MapReduce). Spark arrived later, building on Hadoop's strengths and focusing on speed and versatility, especially with its in-memory capabilities. But here's the thing—Hadoop and Spark aren't always competitors; often, they work together.

In this article, we'll break down the 10 key differences between Apache Spark and Apache Hadoop. We'll dig into their guts—architecture, speed, ecosystems, and more—so you can figure out what works for your needs. Batch processing? Real-time analytics? Machine learning? We've got you covered.

So, What Exactly is Apache Hadoop?

Alright, let's talk about Apache Hadoop. Apache Hadoop is an open source big data processing framework. It's designed to tackle a specific challenge: efficiently storing and processing huge datasets across clusters of computers. We're talking massive amounts of data here—from gigabytes to terabytes to petabytes. What makes Apache Hadoop unique is its ability to use clusters of regular, off-the-shelf hardware, rather than requiring a single high-powered (and expensive) machine.

What is Apache Hadoop, Really?

Apache Hadoop is built for distributed computing. It breaks down big data problems into smaller pieces and distributes the work across many machines, processing them in parallel. Because of this, handling huge amounts of data is faster and more manageable.
Apache Hadoop isn't just one thing; it's a collection of modules working together. The main ones you'll hear about are:

We'll go over these in further detail later.

Apache Hadoop Features

So, why did Apache Hadoop become so popular for big data? It boils down to these key features derived from its architecture:

1) Open Source Framework
Apache Hadoop’s source code is freely available. It is fully open sourced (licensed under Apache 2.0). You can modify it to fit your project’s needs without paying licensing fees.

2) It's Built for Scale (Scalability)
Apache Hadoop is fundamentally designed to scale horizontally. You can increase the cluster's storage and processing capacity by adding more commodity hardware machines (nodes).

3) Handles Hardware Failure Smoothly (Fault Tolerance)
Hadoop is designed to handle hardware failures within large clusters.
Data Resilience — The Hadoop Distributed File System (HDFS) automatically replicates data blocks (typically 3 times by default) across different nodes and racks. If a node fails, data remains accessible from other replicas
Computation Resilience — The cluster resource manager, YARN (Yet Another Resource Negotiator), monitors running tasks. If a node executing a task fails, YARN can reschedule that task on a healthy node.

4) High Data Availability
Apache Hadoop’s replication and distributed storage mean that you always have access to your data. The system automatically assigns tasks to nodes that hold the data you need.

5) Distributed Storage and Processing
Apache Hadoop processes data where it is stored by using the Hadoop Distributed File System (HDFS) for storage and Apache Hadoop MapReduce for computation.

6) Stores All Kinds of Data (Flexibility)
Apache Hadoop doesn't force your data into a rigid structure beforehand. Apache Hadoop accepts structured data (like from databases), semi-structured data (like XML or JSON files), or completely unstructured data (like text documents or images). You don’t have to convert or predefine schemas before storing your data, giving you the freedom to work with a variety of formats.

7) High Throughput Batch Processing
Hadoop is optimized for high throughput on very large datasets by distributing data and processing tasks across many nodes in parallel. It excels at large-scale batch processing workloads such as ETL, log analysis, and data mining, and can handle vast amounts of data efficiently.

8) Rich Ecosystem
Aside from its fundamental components (HDFS, YARN, MapReduce, and Common Utilities), Hadoop is supported by a large ecosystem of complementary projects that provide higher-level services and tools. These include Apache Hive (SQL interface), Apache Pig (data flow scripting), Apache HBase (NoSQL database), Apache Spark (often used with Hadoop for advanced processing), Apache Sqoop (data import/export), Apache Oozie (workflow scheduling), and many more.

9) Brings Computation to the Data (Data Locality)
Hadoop attempts to move the computation to the data to minimize costly network data transfers. YARN's scheduler, in coordination with HDFS, tries to assign processing tasks to nodes where the required data blocks reside locally, or at least within the same network rack, resulting in dramatically improved performance.

And What About Apache Spark?

Apache Spark is a different beast. So, what is Apache Spark?

Apache Spark is also an open source analytics engine that can handle large-scale data processing tasks. It's designed for speed, simplicity, and adaptability, making it a popular choice for big data tasks. So, whether you're working with batch processing or real-time analytics, Spark provides a consistent framework that makes these tasks easier. Spark was developed at UC Berkeley in 2009 as a quicker alternative to Hadoop MapReduce architecture, capable of processing jobs up to 100 times faster in memory and 10 times faster on disk.

Spark’s architecture is built around several high‑level abstractions:

Apache Spark Features

Alright, let's look under the hood. What capabilities does Apache Spark bring to the table?

1) Speed
Spark processes data incredibly fast compared to traditional systems like Apache Hadoop. Its in-memory computing reduces disk I/O operations, enabling applications to run up to 100 times faster in memory and significantly faster on disk.

2) Simplicity
Apache Spark simplifies application development by providing APIs in many languages (Java, Python, Scala, and R). Its high-level operators simplify distributed processing tasks.

3) Fault Tolerance
Spark achieves fault tolerance through its primary data abstraction, the Resilient Distributed Dataset (RDD), and by extension, DataFrames/Datasets which are built upon RDDs.

4) Scalability
You can scale Spark horizontally by adding more nodes to your cluster. It handles large datasets efficiently across distributed environments.

5) In-Memory Processing
Spark is not entirely in-memory; rather, it intelligently uses memory (caching and persistence) to store intermediate datasets throughout multi-step operations. This is especially useful for iterative algorithms (common in machine learning) and interactive data processing, which eliminate repeated disk reads. It can smoothly dump data to disk if memory gets limited.

6) Multi-Language Support
Spark’s APIs support Java, Python, Scala, and R—giving you flexibility in choosing your preferred programming language.

7) Machine Learning Integration
Spark includes Spark MLlib, a library for machine learning tasks like classification, regression, clustering, and collaborative filtering. This makes it ideal for building predictive models directly within the framework.

8) Structured Streaming
Apache Spark Structured Streaming high-level, fault-tolerant stream processing engine built on the Spark SQL engine. It treats data streams as continuously appending unbounded tables, allowing developers to use the same batch-like DataFrame/Dataset API for stream processing, simplifying the development of end-to-end applications. (This largely supersedes the older RDD-based Spark Streaming/DStreams micro-batching model).

9) Graph Processing
Spark GraphX (built-in Spark library) enables graph-based computations such as social network analysis or recommendation systems within Spark’s ecosystem.

10) Compatibility
Spark can read from and write to a wide variety of data sources, including:

Distributed file systems: Hadoop Distributed File System (HDFS), Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS).
NoSQL databases: Apache Cassandra, HBase, MongoDB.
Relational databases: Via JDBC/ODBC.
Message queues: Apache Kafka, Flume.
Data formats: Apache Parquet, Avro, ORC, JSON, CSV, text files, sequence files, and more.

It integrates closely with Apache Hive, often leveraging the Hive Metastore for persistent table metadata. It can run on various cluster managers like Standalone, Apache Mesos, Hadoop YARN, and Kubernetes.

Apache Spark’s a compute engine, not a storage system. It often piggybacks on Hadoop Distributed File System (HDFS) or other storage like S3. That’s where Apache Spark vs Apache Hadoop starts to get interesting—they’re not always rivals.

What Is the Difference Between Apache Hadoop and Apache Spark?

Okay, before we dive deep into the differences, here’s a snapshot of Apache Spark vs Apache Hadoop:

Apache Spark vs Apache Hadoop—Head-to-Head Comparison

	Apache Hadoop	Apache Spark
Main Role	Storage (HDFS), Resource Mgmt (YARN), Batch Processing (MapReduce)	Fast, Unified Processing Engine
Architecture	Master-slave (HDFS, YARN, MapReduce)	Driver, Executors, Cluster Manager
Performance	Disk-based, slower	In-memory, up to 100x faster*
Ecosystem	Full-stack platform	Compute-focused, pairs with HDFS
Memory Usage	Low RAM, disk-driven	High RAM, memory-hungry
Languages	Java + streaming APIs	Scala, Java, Python, R, SQL
Cluster Management	Yet Another Resource Negotiator	YARN, Mesos, Kubernetes, Standalone
Storage	Includes native distributed storage (HDFS)	Relies on external storage (HDFS, S3, etc.)
APIs / Ease of Use	Files/Blocks (HDFS), Key-Value Pairs (MapReduce)	Resilient Distributed Datasets (RDDs), DataFrames, Datasets
Data Processing	Primarily Batch (MapReduce)	Batch, Interactive SQL, Streaming, ML, Graph
Real-Time Processing	No (MapReduce is batch-only)	Yes (Spark Streaming, Structured Streaming)
Fault Tolerance	HDFS replication, Task retries (YARN/MapReduce)	RDD/DataFrame lineage, Checkpointing (optional)
Security	Robust (Kerberos, Ranger)	Basic, leans on Apache Hadoop’s tools
Machine Learning	Mahout	Spark MLlib, Spark GraphX

Now, let’s break it down piece by piece.

1) Apache Spark vs Apache Hadoop—Architecture Breakdown

Apache Hadoop Architecture

Apache Hadoop's architecture is set up to handle massive amounts of data across distributed clusters. If you're dealing with big data, understanding how Hadoop works can help you store and process information efficiently. Let’s break down its components and how they work together.

➥ Hadoop Distributed File System (HDFS)
HDFS stores your data across multiple machines, splitting files into blocks (default size: 128 MB) and replicating them for fault tolerance. The NameNode (master) tracks where data blocks are stored, while DataNodes (workers) hold the actual data. If a node fails, HDFS automatically uses a replica—no manual intervention needed.

➥ YARN (Yet Another Resource Negotiator)
YARN manages cluster resources like CPU and memory. It separates processing from resource management, letting you run multiple workloads simultaneously.

ResourceManager (RM): There's usually one global RM. It's the ultimate authority that knows the overall resource availability in the cluster. It decides which applications get resources and when.
NodeManager (NM): Each machine in the cluster runs a NodeManager. It manages the resources on that specific machine and reports back to the ResourceManager. It's also responsible for launching and monitoring the actual tasks.
ApplicationMaster (AM): When you submit a job (an "application" in YARN terms), YARN starts a dedicated ApplicationMaster for it. The AM negotiates resources from the ResourceManager and works with the NodeManagers to get the application's tasks running. It oversees the execution of that specific job.

➥ MapReduce
This processing model splits tasks into smaller chunks. A Map function filters and sorts data, while a Reduce function aggregates results.

➥ Hadoop Common
Shared utilities and libraries (e.g., file system access, authentication) that support other modules. Without this, tools like Hive or Pig couldn’t interact with HDFS.

So, a typical flow looks like this:

You load data into HDFS. It gets broken into blocks and replicated across DataNodes. The NameNode keeps track of everything.
You submit an application (like a MapReduce job or a Spark job) to the YARN ResourceManager.
The ResourceManager finds a NodeManager with available resources and tells it to launch an ApplicationMaster for your job.
The ApplicationMaster figures out what tasks need to run and asks the ResourceManager for resource containers.
The ResourceManager grants containers on various NodeManagers (ideally close to the data needed).
The ApplicationMaster tells the relevant NodeManagers to launch the tasks within the allocated containers.
Tasks read data from HDFS, do their processing (Map, Reduce, or other operations), and write results back to HDFS.
Once the job is done, the ApplicationMaster shuts down, and its resources are released back to YARN.

Apache Spark Architecture

Apache Spark architecture follows a master-worker pattern. Let’s break down how its components interact and why they matter for your data pipelines.

➥ Driver Program
The driver is the control center of a Spark application. When you submit a job, it translates your code into a series of tasks. It creates a SparkContext or SparkSession (the entry point for all operations) and communicates with the cluster manager to allocate resources.

➥ Executors
Executors are worker processes on cluster nodes that run tasks and store data in memory or on disk. Each application gets its own executors, which:

Execute tasks sent by the driver.
Cache frequently accessed data (like RDDs) to speed up repeated operations.
Report task status back to the driver.

The number of executors directly impacts parallelism—more executors mean more tasks can run simultaneously.

➥ Cluster Manager
Spark relies on cluster managers (like Kubernetes, YARN, or Mesos) to allocate CPU, memory, and network resources. The cluster manager launches executors on worker nodes. And monitors resource usage and redistributes workloads if nodes fail.

➥ Worker Nodes
Worker nodes are the machines in the cluster where executors run. Each worker node can host multiple executors, and the tasks are distributed among these executors for parallel processing.

So, a typical flow looks like this:

When a user submits a Spark application, the driver program is launched. The driver communicates with the cluster manager to request resources for the application.
The driver converts the user's code into jobs, which are divided into stages. Each stage is further divided into tasks. The driver creates a logical DAG representing the sequence of stages and tasks.
The DAG scheduler divides the DAG into stages, each containing multiple tasks. The task scheduler assigns tasks to executors based on the available resources and data locality.
Executors run the tasks on the worker nodes, process the data, and return the results to the driver. The driver aggregates the results and presents them to the user.

Check out the following articles for an in-depth analysis:

Apache Spark architecture 101: How Spark works (2026)

Apache Spark 101—its origins, key features, architecture and applications in big data, machine learning and real-time processing.

flexera.com

Comparing Apache Spark alternatives: Storm, Flink, Hadoop and more

Find out the top 7 Apache Spark alternatives that provide fast, fault-tolerant processing for modern real-time and batch workloads.

flexera.com

2) Apache Spark vs Apache Hadoop—Performance & Speed

Right off the bat, Apache Spark is generally faster than Apache Hadoop's MapReduce, its original processing engine. How much faster? You'll often hear figures up to 100 times faster, but take that with a grain of salt—it highly depends on the specific job you're running.

Why the speed difference? It's mostly about memory.

Apache Spark processes data in-memory. Spark uses Resilient Distributed Datasets (RDDs), DataFrames or Datasets, which let it keep intermediate data (the results between steps of your job) in the memory of the worker nodes across multiple operations. It only goes to disk when absolutely necessary or explicitly told to. This avoids the time-consuming process of reading and writing to physical disks repeatedly. Spark also uses a more advanced Directed Acyclic Graph (DAG) execution engine, which allows for more efficient scheduling of tasks compared to Hadoop MapReduce's rigid Map -> Reduce steps.

Hadoop MapReduce, on the other hand, was designed when RAM was more expensive and clusters were often disk-heavy. Hadoop MapReduce writes the results of its map and reduce tasks back to the Hadoop Distributed File System (HDFS) on disk. If you have a multi-step job, each step involves reading from the disk and writing back to the disk. Disk I/O (Input/Output) is way slower than accessing RAM. That's the primary bottleneck Hadoop MapReduce faces compared to Spark for many data processing tasks.

3) Apache Spark vs Apache Hadoop—Ecosystem Integration & Compatibility

Alright, let's dive into how Apache Spark and Apache Hadoop play together, focusing on Apache Spark vs Apache Hadoop ecosystem integration & compatibility. It's less of a competition and more about how they can work in tandem, though they do have different strengths.

Apache Hadoop has a very rich and mature ecosystem that has grown over many years. Beyond Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator, and Hadoop MapReduce, you have:

Apache Hive — Provides a SQL-like interface to query data stored in Hadoop Distributed File System (HDFS) or other compatible stores.
Apache Pig — Offers a high-level scripting language (Pig Latin) for data analysis flows.
Apache HBase — A NoSQL, column-oriented database that runs on top of Hadoop Distributed File System (HDFS), good for real-time random read/write access.
Apache Sqoop — Tool for transferring bulk data between Apache Hadoop and structured datastores like relational databases.
Apache Flume — For collecting, aggregating, and moving large amounts of log data.
Apache Oozie — A workflow scheduler system to manage Hadoop jobs.

And many more...

Because of this rich ecosystem, Apache Hadoop can often act as a more complete, end-to-end platform for distributed storage and batch processing needs.

Apache Spark, on the other hand, itself is more focused on the compute aspect. While it includes libraries like Spark SQL, Spark MLlib, Spark Streaming, and Spark GraphX, it's designed to integrate smoothly with various storage systems and resource managers rather than providing its own comprehensive storage solution.

➥ Storage Integration — Spark integrates seamlessly with Apache Hadoop's HDFS. In fact, running Spark on Yet Another Resource Negotiator using HDFS for storage is arguably the most common deployment pattern. But Spark isn't limited to HDFS; it can read from and write to many sources like Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), Apache Cassandra, HBase, MongoDB, Apache Kafka, Flume, Apache Hive, Apache Mesos and many more.

➥ Compute Layer — Spark is often used as the compute layer within a broader Apache Hadoop ecosystem or a modern data platform due to its versatility. It can replace or supplement Hadoop MapReduce for processing data stored in HDFS or accessed via other Apache Hadoop tools.

So, while Apache Hadoop offers a wider built-in ecosystem, Spark offers greater flexibility in integrating with different storage and cluster management systems, often leveraging Hadoop components.

4) Apache Spark vs Apache Hadoop—Memory & Hardware

What do they demand from your machines?

Apache Hadoop MapReduce was fundamentally designed for large-scale batch processing, prioritizing throughput and fault tolerance using commodity hardware. Its processing model inherently relies heavily on disk I/O:

➥ Intermediate Data Storage: After each Map and Reduce phase, Hadoop MapReduce writes intermediate results back to the Hadoop Distributed File System (HDFS) or local disk. This persistence ensures fault tolerance but introduces significant disk I/O latency, often becoming the primary performance bottleneck.

➥ Memory Requirements: Consequently, Hadoop MapReduce tasks generally have lower active memory requirements compared to Spark for holding data during computation. Clusters running primarily Hadoop MapReduce workloads could often be built with nodes having moderate RAM, focusing instead on sufficient disk capacity and throughput.

➥ Hardware Cost Profile: Historically, this disk-centric approach allowed Hadoop clusters to be built using less expensive "commodity" hardware with substantial disk storage but relatively less RAM per node. While Hadoop MapReduce can utilize available RAM for buffering, it's not optimized for keeping large working datasets entirely in memory across stages.

Apache Spark was developed to overcome the latency limitations of Hadoop MapReduce, particularly for iterative algorithms (like machine learning) and interactive analytics, by leveraging in-memory processing:

➥ In-Memory Data Storage — Apache Spark processes data primarily in RAM using Resilient Distributed Datasets (RDDs) or DataFrames/Datasets. It keeps intermediate data in memory between stages within a job, avoiding costly disk writes whenever possible.

➥ Memory Requirements — To achieve its performance potential, Spark benefits greatly from having sufficient RAM across the cluster to hold the data partitions being actively processed. While Spark can operate with less memory by "spilling" excess data to disk, this incurs substantial performance penalties as disk I/O becomes involved. Therefore, Spark clusters are typically provisioned with significantly more RAM per node (often ranging from tens to hundreds of GiB) compared to traditional Hadoop MapReduce clusters designed for similar data scales.

➥ Hardware Cost Profile — The need for larger amounts of RAM generally makes the hardware for a Spark-optimized cluster more expensive on a per-node basis compared to a traditional, disk-focused Hadoop MapReduce node. But, the Total Cost of Ownership (TCO) comparison can be complex; Spark's speed might allow for smaller clusters or faster job completion (reducing operational costs, especially in cloud environments).

TL;DR: Apache Hadoop MapReduce is a cost-effective option upfront since it gets by with less RAM and leans on disk storage. The downside is, it can be sluggish with batch processing. Apache Spark, though, is typically way faster, especially when it comes to iterative or interactive tasks. The catch is you'll need to spend more on memory-rich hardware to get that speed.

5) Apache Spark vs Apache Hadoop—Programming Language Support

How easy is it for developers to work with them?

Apache Hadoop is primarily written in Java and—via mechanisms like Hadoop Streaming—allows developers to write Hadoop MapReduce programs in virtually any language (such as Python, Ruby, or others). However, its native API is Java, which often results in verbose, low-level code when writing Hadoop MapReduce jobs directly. On the flip side, Apache Spark was developed in Scala and provides robust, first‐class APIs in Scala, Java, Python (via PySpark), R, and SQL (via Spark SQL). This multi-language support lets developers choose the programming language they are most comfortable with, thereby reducing the learning curve.

A key advantage of Apache Spark is its interactive development mode. Spark offers REPLs—such as the spark‑shell for Scala and PySpark for Python—that allow developers to explore and manipulate data interactively. On top of that, Spark’s high‑level abstractions (originally built around Resilient Distributed Datasets, and now primarily through DataFrames and Datasets) provide a rich set of operators that simplify complex data transformations and iterative processing.
On the other hand, Hadoop MapReduce development typically requires a deeper understanding of low‑level APIs and often involves writing extensive boilerplate code, making it more cumbersome and less flexible for rapid development.

6) Apache Spark vs Apache Hadoop—Scheduling and Resource Management

Apache Spark and Apache Hadoop uses distinct approaches to scheduling computations and managing cluster resources.

Apache Spark uses the Spark Scheduler to manage task execution across a cluster. The Spark Scheduler is responsible for breaking down the Directed Acyclic Graph (DAG) into stages, each containing multiple tasks. These tasks are then scheduled to executors, which are computing units that run on worker nodes. The Spark Scheduler, in conjunction with the Block Manager, handles job scheduling, monitoring, and data distribution across the cluster. The Block Manager acts as a key-value store for blocks of data, enabling efficient data management and fault tolerance within Spark.

On the other hand, Apache Hadoop's resource management is natively handled by YARN (Yet Another Resource Negotiator), which consists of:

ResourceManager — Global resource arbitrator allocating cluster resources
NodeManager — Per-node agent managing containers (resource units)
ApplicationMaster — Per-application component negotiating resources and monitoring tasks

For workflow scheduling, Hadoop can be integrated with Apache Oozie – a separate service that orchestrates Directed Acyclic Graphs of dependent jobs (MapReduce, Hive, Pig) through XML-defined workflows.

7) Apache Spark vs Apache Hadoop—Latency & Real-Time Analytics Capabilities

How quickly can you get results? What about live data?

Apache Hadoop MapReduce was designed primarily as a batch-processing system. In a typical Hadoop MapReduce job, data is read from the Hadoop Distributed File System (HDFS), processed by map tasks, written back to disk as intermediate output, and then read again by reduce tasks before writing the final output to disk. Due to this heavy reliance on disk I/O at multiple critical stages, especially between the Map and Reduce phases, it introduces significant latency. As a result, Hadoop MapReduce jobs generally take minutes—or even hours—to complete, making them unsuitable for real-time or near-real-time data processing use cases. Despite this, Hadoop MapReduce remains effective for processing massive datasets when throughput is prioritized over speed.

Apache Spark was engineered to overcome the latency challenges of Hadoop MapReduce. Its key innovation is in-memory processing—loading data into RAM across the cluster and retaining intermediate data in memory between stages whenever possible. Because of this design, it dramatically reduces disk I/O overhead and significantly speeds up processing, especially for iterative algorithms (such as those used in machine learning) and interactive data analysis.

Spark provides specialized streaming libraries for real-time and near real-time processing:
➥ Spark Streaming (DStreams) — Processes data streams by breaking them into micro-batches, allowing near-real-time processing.
➥ Structured Streaming — This newer API treats incoming data streams as continuously appended tables. It also typically operates on a micro-batching engine—achieving end-to-end latencies that can be as low as around 100 milliseconds while providing exactly-once fault tolerance.
➥ Continuous Processing Mode (Experimental) — Introduced in Spark 2.3, this mode aims to reduce latency further—potentially into the low-millisecond range—but comes with certain limitations (e.g., limited API support and at-least-once processing guarantees).

Thus, while Hadoop MapReduce is confined to high-latency batch processing, Apache Spark offers a unified platform that can efficiently handle both batch and low-latency stream processing.

8) Apache Spark vs Apache Hadoop—Fault Tolerance

What happens when things go wrong?

Apache Spark and Apache Hadoop both have strong fault-tolerance mechanisms to keep failures from forcing a complete restart of apps. But, they tackle this challenge in different ways.

Apache Hadoop’s fault tolerance is built into its core components. In Hadoop Distributed File System (HDFS), data is broken down into blocks that are copied (by default, three copies) across different nodes. If a DataNode fails, the data's still available from another node because of this copying. Also, within the Hadoop MapReduce framework, the master (or ResourceManager in Yet Another Resource Negotiator(YARN)) monitors task execution. If a task fails—say, a node crashes—the framework automatically retries the task on another node. This two-part approach (HDFS copies data, Hadoop MapReduce re-executes tasks) makes Hadoop pretty robust against node failures, but it does add some extra overhead from writing intermediate data to disk.

Spark’s fault tolerance is achieved at the application level using Resilient Distributed Datasets (RDDs). Each Resilient Distributed Dataset maintains a complete lineage—a record of the transformations (stored in the DAG) used to derive it. If a partition is lost due to an executor failure, Spark can recompute that partition from its lineage without restarting the entire job. On top of that, Spark supports checkpointing, where Resilient Distributed Datasets (RDDs) or streaming state are periodically saved to reliable storage (like Hadoop Distributed File System (HDFS)) to truncate long lineages and speed up recovery. For streaming applications, Spark’s Structured Streaming also leverages write-ahead logs and state checkpointing to provide exact-once processing guarantees.

TL;DR: Apache Hadoop relies on block-level replication and task re-execution within Hadoop MapReduce to handle failures, which is well-suited for disk-based batch processing. Apache Spark, on the other hand, uses in-memory recomputation based on RDD lineage (supplemented by checkpointing when needed), providing a more flexible and often faster recovery for interactive and iterative workloads.

9) Apache Spark vs Apache Hadoop—Security & Data Governance

How secure are they, and how well can you manage access?

Apache Hadoop is built with security in mind. Most modern Hadoop distributions offer secure configurations by default. They use strong authentication mechanisms—most notably Kerberos—as well as fine-grained authorization with tools like Apache Ranger and LDAP integration. Hadoop's file system also enforces standard file permissions and supports access control lists (ACLs), so data is protected when it's not being used. These security features, combined with auditing and metadata management (supported by Apache Atlas), provide a comprehensive data governance framework for enterprises.

Apache Spark can be made equally secure, though its default configuration (especially in standalone mode) is not as locked down, meaning that a standalone Spark deployment may be vulnerable if not properly secured. Spark’s built-in authentication mechanism—when enabled via configuration (such as enabling spark.authenticate)—relies on a shared secret for communication between the driver and executors. However, when Spark is deployed within a secure Apache Hadoop ecosystem (such as on Yet Another Resource Negotiator(YARN) with Kerberos enabled), it can inherit many of the underlying security features. And it can also be set up with SSL/TLS encryption for data in transit. Moreover, integrations with external security frameworks (such as Apache Ranger) are available to extend Spark’s access controls and audit capabilities. In essence, while Spark’s default settings are less secure, it can be hardened significantly when deployed in a secured environment.

10) Apache Spark vs Apache Hadoop—Machine Learning & Advanced Analytics

What about running complex analytics like ML?

Apache Hadoop’s core MapReduce framework does not include native machine learning libraries. Historically, developers used external libraries such as Apache Mahout to implement ML algorithms on Hadoop. Mahout’s early implementations relied on Hadoop MapReduce, which—because of its disk-based, batch-oriented design—incurred significant latency and inefficiency for iterative algorithms common in machine learning. These limitations often resulted in performance bottlenecks, particularly when processing large data fragments. In response, recent versions of Mahout have shifted toward leveraging Spark’s in-memory processing capabilities rather than Hadoop MapReduce to overcome these challenges.

Apache Spark was designed with iterative and interactive analytics in mind. Its native machine learning library, Spark MLlib, offers high-level APIs for tasks such as classification, regression, clustering, collaborative filtering, dimensionality reduction, and more. Spark MLlib benefits from Spark’s in-memory computing model, which minimizes the latency inherent in disk-based processing and dramatically accelerates iterative computations. Due to this integration, it is considerably easier to develop, prototype, and deploy machine learning applications. Moreover, Spark’s active community and extensive ecosystem further simplify the development of advanced analytics applications, enabling real-time analytics, interactive data exploration, and seamless integration with other Spark components.

Apache Spark vs Apache Hadoop—Use Cases

Knowing the technical differences helps, sure, but the real question for you is probably: when should you pick one over the other, or maybe even use them together? Let's break down the typical scenarios for Apache Spark vs Apache Hadoop.

Apache Spark Use Cases—When to Use Apache Spark?

🔮 Use Apache Spark When:

You need fast processing — Spark processes data in memory (RAM) using Resilient Distributed Datasets (RDDs), which is way faster than Hadoop MapReduce's approach of writing intermediate results to disk.
You're doing machine learning — Spark's speed is a huge advantage for iterative algorithms common in machine learning (training models often involve repeatedly processing the same data). Its built-in Spark MLlib library is designed for large-scale ML tasks and integrates well with other ML tools.
You need to process streaming data — Spark Streaming (and its successor, Structured Streaming) handles real-time data streams effectively, processing data in small batches (micro-batching).
You want a unified platform — Spark offers APIs for SQL (Spark SQL), streaming, ML (Spark MLlib), and graph processing (Spark GraphX), letting you combine different types of processing in a single application.
Ease of use is important — Spark offers high-level APIs in Python, Scala, Java, and R, which many find easier to work with than writing Java MapReduce code. Its interactive shells (like PySpark) are also handy for exploration.

Apache Hadoop Use Cases—When to Use Apache Hadoop?

🔮 Use Apache Hadoop When:

You need massive, affordable, reliable storage — Hadoop Distributed File System (HDFS) is designed for storing enormous files across clusters of commodity hardware. It's highly scalable and fault-tolerant through data replication. If your data volume is truly massive and doesn't fit comfortably in RAM across your cluster, HDFS is a solid, cost-effective storage foundation.
Cost is a major factor — Apache Hadoop clusters can be built using relatively inexpensive commodity hardware. Since Hadoop MapReduce (if used) is disk-based, it doesn't demand the high RAM requirements that Spark's in-memory approach does, making the hardware potentially cheaper.
Batch processing is sufficient — If you have large jobs that can run overnight or don't require immediate results (like generating monthly reports, large-scale ETL, log analysis for historical trends), Hadoop MapReduce (or Hive on Hadoop) is perfectly capable and economical. Its processing model is well-suited for linear processing of large data volumes.
Data archiving — Hadoop Distributed File System (HDFS) provides a cost-effective way to archive massive datasets for long-term retention or compliance.

Which is better: Apache Spark vs Apache Hadoop? (Apache Spark vs Apache Hadoop—Pros & Cons)

No tool is perfect. Let's weigh the advantages and disadvantages.

Apache Spark Benefits and Apache Spark Limitations

Apache Spark Benefits:

Fast in-memory processing speeds up iterative tasks and interactive queries.
Supports batch, streaming, SQL, machine learning, and graph processing in one framework.
Provides user-friendly APIs in Scala, Java, Python, and R for ease of development.
Offers high-level abstractions (DataFrames/Datasets) that simplify distributed data handling.
Strong community support.
Robust fault tolerance; recovers from failures via lineage and optional checkpointing.

Apache Spark Limitations:

High memory usage can lead to increased infrastructure cost and requires careful tuning.
Lacks a built-in file system and depends on external storage systems like Hadoop Distributed File System (HDFS) or cloud services.
Micro-batch streaming introduces latency that may not suit true real-time needs.
Demands manual adjustments and performance tuning for complex jobs.

Apache Hadoop Advantage and Apache Hadoop Limitations

Apache Hadoop Advantages:

Designed for batch processing of massive datasets using cost-effective commodity hardware.
Uses Hadoop Distributed File System (HDFS) to replicate data, providing robust fault tolerance and resilience.
Comes with a wide ecosystem (Hive, Pig, HBase, etc.) that extends its capabilities.
Operates at a lower per-unit cost due to disk-based processing.

Apache Hadoop Limitations:

Disk I/O in Hadoop MapReduce slows performance compared to in-memory solutions.
Programming with Hadoop MapReduce can be less intuitive for iterative or interactive workloads.
Not built for low-latency or near-real-time processing without adding extra tools.
Handling a large number of small files can strain the NameNode and reduce efficiency.

Conclusion

And that’s a wrap! So, when comparing Apache Spark vs Apache Hadoop, it's clear they address different (though related) problems, and they often work better together.
Apache Hadoop, particularly HDFS and YARN, laid the groundwork, offering a way to store and manage resources for truly massive datasets. Its original processing engine, Hadoop MapReduce, was revolutionary for its time but showed its age in terms of speed and flexibility.
Apache Spark emerged as a powerful successor to the Hadoop MapReduce processing component. It delivered speed through in-memory computation and versatility through its unified engine for batch, streaming, SQL, ML, and graph workloads.

The key takeaway? It's rarely a strict "either/or" choice today. More often, the question is how to best combine them or which components to use. You might use:
➤ Spark on YARN with Hadoop Distributed File System (HDFS) (a common on-prem setup).
➤ Spark on Kubernetes with cloud storage (a common cloud-native setup).
➤ Just Hadoop Distributed File System (HDFS) for cheap, large-scale storage, accessed by various tools.
➤ Just YARN to manage resources for diverse applications.

Spark is undeniably the leading engine for large-scale data processing now. Hadoop's components, especially Hadoop Distributed File System (HDFS) and YARN, remain relevant as infrastructure elements, although cloud alternatives and Kubernetes are changing the landscape. Understanding their distinct strengths helps you build the right data platform for your specific challenges.

In this article, we have covered:

What is Apache Hadoop? -- What is Apache Hadoop used for?
What is Apache Spark? -- What is Apache Spark used for?
What Is the Difference Between Apache Hadoop and Apache Spark? -- Apache Spark vs Apache Hadoop—Architecture Breakdown -- Apache Spark vs Apache Hadoop—Performance & Speed -- Apache Spark vs Apache Hadoop—Ecosystem Integration -- Apache Spark vs Apache Hadoop—Memory & Hardware -- Apache Spark vs Apache Hadoop—Programming Language Support -- Apache Spark vs Apache Hadoop—Scheduling & Resource Management -- Apache Spark vs Apache Hadoop—Latency & Real-Time Analytics -- Apache Spark vs Apache Hadoop—Fault Tolerance -- Apache Spark vs Apache Hadoop—Security & Data Governance -- Apache Spark vs Apache Hadoop—ML & Advanced Analytics
Apache Spark vs Apache Hadoop—Use Cases -- When to Use Apache Spark -- When to Use Apache Hadoop
Apache Spark vs Apache Hadoop — Pros & Cons … and so much more!!

FAQs

What is Apache Spark used for?
Apache Spark is used for fast data processing across various workloads: quick batch jobs, interactive SQL queries, real-time stream analysis, large-scale machine learning, and graph computations.

Should I learn Hadoop or Spark?
Spark is usually the better choice for data engineering and science roles. It's flexible and can handle various tasks. However, understanding basic Hadoop concepts like HDFS and YARN is still important. You can ignore Hadoop MapReduce unless you work with older systems.

Does Apache Spark run on Hadoop?
Yes, very commonly. Spark can run on Apache Hadoop's YARN resource manager and use HDFS for storage. This is a popular deployment model, allowing Spark to leverage existing Apache Hadoop clusters and infrastructure. Spark can also run independently (standalone mode, Kubernetes, Mesos) using other storage systems (like S3).

Why is Spark faster than Hadoop?
The main reason is Spark's ability to perform computations in memory, drastically reducing the slow disk read/write operations that bottleneck Hadoop MapReduce. Spark also uses optimized execution plans (DAGs).

Is Apache Spark used for big data?
Absolutely. Apache Spark was specifically designed for big data workloads. Its ability to distribute processing across a cluster and handle large datasets (both in-memory and spilling to disk when necessary) makes it a cornerstone technology for big data analytics, ETL (Extract, Transform, Load), machine learning on large datasets, and real-time data processing.

Is Apache Spark and Hadoop the same?
Nope, definitely not. Spark is primarily a processing engine, while Hadoop (originally) bundled storage (HDFS) and processing (Hadoop MapReduce) with resource management (YARN). Spark is generally focused on computation speed and flexibility, often leveraging memory. Hadoop MapReduce, its traditional processing counterpart, is more disk-based and batch-oriented.

Is Spark outdated?
No, Apache Spark is far from outdated. It's actively developed, with new releases bringing performance improvements and features. It has a large, vibrant community and is a core technology in the big data and machine learning landscape, widely used across many industries and integrated into major cloud platforms.

Is Hadoop Still Used? Is It Outdated?
Let's break it down:
➥ HDFS & YARN: These components of Hadoop are still widely used. Hadoop Distributed File System (HDFS) is a great option for large-scale, cost-effective storage, especially if you're on-premises. That said, cloud object storage like S3 is a strong competitor. Yet Another Resource Negotiator (YARN) remains a popular resource manager in many established clusters.
➥ Hadoop MapReduce: The original Hadoop MapReduce engine isn't the go-to choice for new development anymore. Instead, Spark, Flink, and other engines offer better performance and are more user-friendly for most tasks. However, some organizations still have legacy Hadoop MapReduce jobs running.
➥ The Ecosystem: Many tools that were developed within the Hadoop ecosystem, like Hive, HBase, and Pig, are still in use. They're often used alongside Spark.

What Replaced Hadoop (MapReduce)?
For the processing part (Hadoop MapReduce), Apache Spark is the most prominent replacement. Other frameworks like Apache Flink (especially for streaming) and query engines like Presto/Trino also serve as alternatives or complementary tools in the big data space. For storage (HDFS), cloud object stores like Amazon S3, Google Cloud Storage, Azure Blob Storage are very popular alternatives, especially in cloud environments.

Is Hadoop easy to learn?
"Easy" is relative. Hadoop (especially the full ecosystem including Hadoop MapReduce) generally has a steeper learning curve than some newer tools. It involves understanding distributed systems concepts, configuring clusters (though this is often handled by specific platforms or cloud services now), and learning the specifics of Hadoop Distributed File System (HDFS), YARN, and potentially Hadoop MapReduce programming (primarily in Java).

Is Hadoop a programming language?
No, Hadoop is not a programming language. It's a framework written primarily in Java. You typically write applications for Hadoop (like Hadoop MapReduce jobs) using languages like Java, or use tools within the ecosystem (like Hive with SQL-like HQL, Pig with Pig Latin, or Spark with Python, Scala, Java, R, SQL) that interact with Hadoop components.

Who uses Apache Hadoop?
Many tech giants across various sectors (finance, healthcare, tech, retail, government) still use components of the Hadoop ecosystem, particularly Hadoop Distributed File System (HDFS) for storage and YARN for resource management, often in conjunction with Spark or other processing engines for analytics, data warehousing, and handling large batch jobs. While newer cloud-native stacks are popular for new projects, established big data infrastructure often involves Hadoop elements.

[Apache Iceberg] Iceberg Performance: The Hidden Cost of NULLS FIRST

Yu-Chuan Hung — Sun, 16 Nov 2025 16:47:45 +0000

Introduction

Apache Iceberg is a widely-used table format in Data Lakehouse architectures. It provides flexibility in how data is written, with two key optimizations: partition, which splits data into segments, and sort, which reorders data within those segments. These optimizations can significantly reduce the amount of data scanned by query engines, ultimately boosting query performance.

When querying data with high-cardinality columns (e.g., IDs or serial numbers), quickly filtering out unnecessary values is crucial. Sorting becomes particularly valuable in these scenarios. The rationale is simple: if data is written in order, query engines can rapidly locate the needed data rather than performing a full table scan and discarding irrelevant rows.

When configuring Iceberg table sort properties, engineers can specify whether sorting follows ascending or descending order—with ascending as the default. While reading about this configuration, a question came to mind: Is there any performance difference between these two ordering approaches? If so, which one performs better, and why? To answer these questions, I designed an experiment to find out.

Experiment

Detailed code and performance analysis can be found in my repo: https://github.com/CuteChuanChuan/Dive-Into-Iceberg

Testing Materials

Generated 1,000,000 rows with 30% null values
Created two identically configured Iceberg tables with different null sorting orders (i.e., NULLS FIRST vs. NULLS LAST)

Queries Executed to Evaluate Performance

select count(*) from table where value is not null
select sum(value) from table where value is not null
select avg(value) from table where value is not null
select count(*) from table where value is null
select count(*) from table

Performance Evaluation Metrics

Query plan: Whether different sorting orders generate different execution plans
Execution time with statistical analysis: Overall query time comparison
CPU profiling: Detailed CPU usage analysis

Findings

To obtain a complete picture, I planned to conduct three types of analysis. First, I compared query plans to see whether different null placements generate different plans, which might influence query performance. Second, I conducted statistical analysis on execution times for rigorous examination. Since query time differences are the observable outcome, we need to identify the root cause if significant differences exist. Therefore, if statistical significance is found, CPU profiling will be conducted in the final phase.

Query Plan

Details

`select count(*) from table where value is not null`

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1557]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#508)
               +- BatchScan local.db.test_nulls_first[value#508] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1574]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#521)
               +- BatchScan local.db.test_nulls_last[value#521] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

`select sum(value) from table where value is not null`

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#886)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3045]
      +- HashAggregate(keys=[], functions=[partial_sum(value#886)])
         +- Filter isnotnull(value#886)
            +- BatchScan local.db.test_nulls_first[value#886] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#899)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3064]
      +- HashAggregate(keys=[], functions=[partial_sum(value#899)])
         +- Filter isnotnull(value#899)
            +- BatchScan local.db.test_nulls_last[value#899] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

`select avg(value) from table where value is not null`

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1264)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4535]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1264)])
         +- Filter isnotnull(value#1264)
            +- BatchScan local.db.test_nulls_first[value#1264] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1279)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4554]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1279)])
         +- Filter isnotnull(value#1279)
            +- BatchScan local.db.test_nulls_last[value#1279] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

`select count(*) from table where value is null`

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6023]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1646)
               +- BatchScan local.db.test_nulls_first[value#1646] local.db.test_nulls_first (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6040]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1659)
               +- BatchScan local.db.test_nulls_last[value#1659] local.db.test_nulls_last (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []

`select count(*) from table`

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1895L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7045]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1895L)])
         +- Project [count(*)#1896L AS agg_func_0#1895L]
            +- LocalTableScan [count(*)#1896L]

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1904L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7060]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1904L)])
         +- Project [count(*)#1905L AS agg_func_0#1904L]
            +- LocalTableScan [count(*)#1905L]

Conclusion

For both tables, the execution plans for all queries are identical.

File-Level Statistics Analysis

Although the query plans are the same, a deeper look at the Parquet file statistics reveals important differences in how data is physically organized.

Partition Statistics Comparison

Below are the min/max statistics for each partition in both configurations:

Partition	NULLS FIRST	NULLS LAST	Min Value Difference
cat_0-2	All nulls	All nulls	N/A
cat_3	min=103, max=993	min=103, max=993	Same
cat_4	min=4, max=994	min=4, max=994	Same
cat_5	min=405, max=995	min=355, max=995	-50
cat_6	min=106, max=996	min=6, max=996	-100
cat_7	min=517, max=997	min=487, max=997	-30
cat_8	min=228, max=998	min=208, max=998	-20
cat_9	min=619, max=999	min=609, max=999	-10

Why Are Statistics Different?

The different min/max values reveal that physical data layout differs between the two configurations:

Different File Boundaries: When sorting with NULLS FIRST vs. NULLS LAST, Spark writes data in different orders, causing file splits to occur at different points. Even though both tables contain identical data, the way rows are distributed across files differs.
File Organization Pattern:

NULLS FIRST: Files begin with null values, followed by non-null values. The minimum non-null value appears after skipping nulls within each file.
NULLS LAST: Files begin with non-null values immediately. The minimum value is at or near the start of the file.

Metadata Quality: NULLS LAST produces "better" statistics for non-null queries:

In NULLS FIRST (e.g., cat_6): min=106 means the file starts with nulls, and 106 is the first non-null value encountered.
In NULLS LAST (e.g., cat_6): min=6 means the file immediately starts with value 6, providing more accurate bounds.

Impact on Query Execution

For queries with WHERE value IS NOT NULL:

NULLS FIRST:

Files contain nulls at the beginning, causing mixed value distribution
Query engine must scan through null values before reaching non-null data
Statistics indicate the presence of non-null values, but they're not immediately accessible

NULLS LAST:

Files with non-null data have those values at the beginning
Query engine can immediately start processing valid values
Better sequential access pattern for counting non-null values

This file-level organization difference, combined with CPU microarchitecture optimizations, explains why NULLS LAST performs better for counting non-null values even though logical query plans are identical.

Execution Time Analysis

Data Collection

5 queries, each executed 100 times

Statistical Methods

T-test: Compare whether query times are statistically different
Cohen's d: Calculate the effect size of null ordering settings

Details

`select count(*) from table where value is not null`: Null Last performs better

Descriptive Statistics:
  NULLS FIRST: mean=41.46ms, sd=8.38ms
  NULLS LAST:  mean=31.55ms, sd=2.40ms

Paired t-test:
  t-statistic = 11.9367 
  p-value = 0.000000 
  95% CI: [8.26, 11.55] ms
  Result: *** HIGHLY SIGNIFICANT (p < 0.001)

Effect Size (Cohen's d):
  d = 1.1937 
  Interpretation: Large 

Summary:
  Mean difference: 9.91 ms
  Percentage difference: 23.90 %
  Winner: NULLS LAST

`select sum(value) from table where value is not null`: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=34.14ms, sd=5.12ms
  NULLS LAST:  mean=33.40ms, sd=6.43ms

Paired t-test:
  t-statistic = 0.8759 
  p-value = 0.383195 
  95% CI: [-0.94, 2.43] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

`select avg(value) from table where value is not null`: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=28.84ms, sd=3.42ms
  NULLS LAST:  mean=27.95ms, sd=3.26ms

Paired t-test:
  t-statistic = 1.9654 
  p-value = 0.052165 
  95% CI: [-0.01, 1.80] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

`select count(*) from table where value is null`: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=24.00ms, sd=4.64ms
  NULLS LAST:  mean=23.16ms, sd=3.43ms

Paired t-test:
  t-statistic = 1.3804 
  p-value = 0.170582 
  95% CI: [-0.37, 2.05] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

`select count(*) from table`: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=14.95ms, sd=2.41ms
  NULLS LAST:  mean=14.39ms, sd=2.45ms

Paired t-test:
  t-statistic = 1.6356 
  p-value = 0.105090 
  95% CI: [-0.12, 1.25] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

Conclusion

NULLS LAST is significantly faster than NULLS FIRST when counting non-null values.

CPU Profiling: Analyzing Count Non-Null Values Query

Details

Please refer to the flame graphs in my repo.

The performance difference observed in execution time analysis can be attributed to both file-level organization and CPU microarchitecture optimizations:

File-Level Organization Impact: As shown in the file statistics analysis, NULLS LAST creates files where non-null values are positioned at the beginning. This layout means when the query engine scans data with WHERE value IS NOT NULL, it immediately encounters a continuous block of valid values rather than having to skip over nulls first. This reduces unnecessary I/O operations and deserialization overhead.
CPU Microarchitecture Optimizations:
1. SIMD (Single Instruction, Multiple Data): Modern CPUs can process multiple data elements simultaneously using SIMD instructions. When counting non-null values with NULLS LAST, the query engine encounters a continuous block of non-null values at the start of each file. This layout allows SIMD instructions to efficiently process multiple valid values in parallel. For example, when checking isnotnull(value) on 8 consecutive values that are all non-null, a single SIMD instruction can validate and count them in one operation.
2. Branch Prediction: Modern CPUs use branch predictors to anticipate the outcome of conditional statements (like if (value != null)). With NULLS LAST, the query engine scans data following a highly predictable pattern: a long sequence of non-null values followed by nulls. This consistency allows the branch predictor to achieve high accuracy, keeping the CPU pipeline running smoothly. In contrast, NULLS FIRST presents a less predictable pattern at file boundaries where nulls transition to non-nulls, potentially causing pipeline stalls.

The CPU profiling data supports these optimizations: NULLS LAST (2,238 samples) uses approximately 11.7% less CPU time than NULLS FIRST (2,536 samples). This reduction results from the combined effects of better file organization, improved SIMD vectorization, and enhanced branch prediction accuracy.

Conclusion

NULLS LAST occupies less CPU time due to a combination of better file-level data organization and CPU microarchitecture optimizations.

Conclusion and Future Exploration

This exploration reveals that while different null value placements do not create different query plans, they significantly impact query performance through physical data organization.

Key Findings:

File-Level Statistics Matter: NULLS LAST produces better min/max statistics, with non-null values positioned at file beginnings. This creates more favorable data layouts for queries filtering on non-null values.
CPU Microarchitecture Synergy: The continuous blocks of non-null values in NULLS LAST enable CPU optimizations including SIMD vectorization and improved branch prediction, resulting in ~11.7% less CPU time.
Significant Performance Impact: For SELECT COUNT(*) WHERE value IS NOT NULL, NULLS LAST achieves 23.90% faster execution time—a substantial improvement for such a common OLAP operation.

Practical Recommendations:

If counting non-null values is a frequent operation in your workload—which is common in OLAP scenarios—configuring Iceberg tables with NULLS LAST can provide measurable performance improvements. The benefits stem from both better file organization and CPU-level optimizations working in tandem.

Future Exploration:

This experiment tested 5 queries on a 1-million-row dataset with 30% null values. Future investigations could explore:

Various query patterns frequently used in OLAP scenarios (e.g., window functions like LAG, complex aggregations)
Larger datasets with multiple files per partition to amplify metadata pruning effects
Different null percentage distributions (10%, 50%, 70%) to understand the threshold where NULLS LAST benefits diminish
Impact on different data types (strings, decimals) and column cardinalities
Performance with Iceberg's metadata-based filtering in more complex predicates

These investigations would provide a more complete understanding of optimal Iceberg table sorting configurations across diverse workloads.

HOW TO: Run Spark on Kubernetes with AWS EMR on EKS (2025)

Pramit Marattha — Sat, 15 Nov 2025 11:00:51 +0000

Running Apache Spark on Kubernetes with AWS EMR on EKS brings big benefits – you get the best of both worlds. AWS EMR's optimized Spark runtime and AWS EKS's container orchestration come together in one managed platform. Sure, you could run Spark on Kubernetes yourself, but it's a lot of manual work. You'd need to create a custom container image, set up networking, and handle a bunch of other configurations. But with EMR on EKS, all that hassle goes away. With EMR on EKS, AWS supplies the Spark runtime as a ready-to-use container image, handles job orchestration, and ties it all into EKS. Just submit your Spark job to an EMR virtual cluster (which maps to an EKS namespace), and it runs as a Kubernetes pod under EMR’s control. You still handle some IAM and networking setup, but the heavy lifting like runtime tuning, job scheduling, container builds, is all handled for you.

In this article, we will first explain why EMR on EKS is useful, show how its architecture works, compare EMR on EC2 vs EMR on EKS. Finally, we will give you a step-by-step recipe (with actual AWS CLI commands and config samples) to get a Spark job running on Kubernetes via EMR on EKS.

Why Use AWS EMR on EKS for Spark Workloads?

First, why use AWS EMR on EKS at all? What do you gain by running Spark on Kubernetes under EMR instead of the familiar EMR on EC2 or even self-managed Spark on EKS? The short answer is flexibility and ease of management. EMR on EKS offers the best of both worlds: managed Spark plus Kubernetes. It avoids the hassle of building Spark containers and managing Spark clusters by hand.

What are the benefits of EMR on EKS?

AWS EMR on EKS model offers several advantages:

Benefit 1: Simplified Spark Runtime Management
You get the same managed Spark experience that EMR on EC2 provides, but on Kubernetes. EMR takes care of provisioning the Spark runtime (with pre-built, optimized Spark versions), auto-scaling, and provides development tools like EMR Studio and the Spark UI. AWS handles the Spark container images and integration so you don’t have to assemble them yourself.

Benefit 2: Cost Optimization via Kubernetes Resource Sharing
Your Spark jobs run as pods on an EKS cluster that can also host other workloads, so you avoid waste from idle clusters. Nodes come up and down automatically, and you pay only for actual usage. AWS specifically points out that with EMR on EKS “compute resources can be shared” and removed “on demand to eliminate over-provisioning”, leading to lower costs.

Benefit 3: Fast Job Startup and Performance Improvements
You can reuse an existing Kubernetes node pool, so there’s no need to spin up a fresh cluster for each job. This eliminates the startup lag of launching EC2 instances. In fact, AWS claims EMR’s optimized Spark runtime can run some workloads up to 3× faster than default Spark on Kubernetes.

Benefit 4: Flexible Spark and EMR Version Management
You can run different Spark/EMR versions side by side on the same cluster. EMR on EKS lets one EKS namespace host Spark 2.4 apps and another host Spark 3.0. According to AWS, you can use a single EKS cluster to run applications that require different Apache Spark versions and configurations. This is handy if some jobs need legacy code while others take advantage of newer Spark features.

Benefit 5: Native Integration with Kubernetes and AWS Tools
EMR on EKS ties into Kubernetes APIs and IAM Roles for Service Accounts (IRSA). You can use your existing EKS authentication methods, networking, logging, and autoscaler to manage Spark pods.

Benefit 6: EMR Cloud-Native Experience on Kubernetes
Finally, you still get EMR conveniences like EMRFS (optimized S3 access), default security and logging settings, and support for EMR Studio or Step Functions. AWS even provides AWS Step Functions and EMR on EKS templates to streamline workflows.

All in all, EMR on EKS is great if you already have (or plan to use) Kubernetes for container workloads and want the managed Spark experience. It avoids the manual work of installing Spark on Kubernetes (which you’d have to do if you ran open-source Spark on EKS).

EMR on EKS System Architecture Explained

At a very high level, EMR on EKS loosely couples Spark to Kubernetes. EMR (the control plane) simply tells EKS what pods to run, and EKS handles the actual compute (EC2 / Fargate). Here’s how it works under the hood:

The EMR on EKS architecture is a multi-layer pipeline. At the top level you have AWS EMR, which now has a “virtual cluster” registered to a namespace in your AWS EKS cluster. When you submit a Spark job through EMR (for example, using aws emr-containers start-job-run), EMR takes your job parameters and tells Kubernetes what to run. Under the hood, EMR creates one or more Kubernetes pods for the Spark driver and executors. Each pod pulls a container image provided by EMR (Amazon Linux 2 with Spark installed) and begins processing.

The Kubernetes layer (AWS EKS) is responsible for scheduling these pods onto available compute. It can use either self-managed EC2 nodes or Fargate to supply the necessary CPU and memory. In practice, you often configure an EC2 Auto Scaling Group behind EKS so that new nodes spin up as Spark executors need them. The architecture supports multi-AZ deployments: pods can run on nodes in different availability zones, giving resilience and access to a larger pool of instances.

Below the compute layer, your data lives in services like AWS S3, and your logs/metrics flow to CloudWatch (or another sink). EMR on EKS handles the wiring: it automatically ships driver and executor logs to CloudWatch Logs and S3 if you configure it, and even lets you view the Spark History UI from the EMR console after a job completes.

TL;DR: EMR on EKS decouples analytics from infrastructure: EMR builds the Spark application environment and Kubernetes provides the execution environment.EMR on EKS Architecture (Source)

EMR on EKS “loosely couples” Spark to your Kubernetes cluster. When you run a job, EMR uses your job definition (entry point, arguments, configs) to tell EKS exactly what pods to run. Kubernetes does the pod scheduling onto EC2/Fargate nodes. Because it’s loose, you can run multiple isolated Spark workloads on the same cluster (even in different namespaces) and mix them with other container apps.

EMR on EC2 vs EMR on EKS: Detailed Comparison

It’s worth understanding the difference between the old-school EMR on EC2 vs EMR on EKS, so you know when to pick each. With EMR on EC2, Amazon launches a dedicated Spark cluster for you on EC2 instances (possibly with EC2 Spot for cost savings). Those instances are dedicated to EMR, and YARN or another scheduler allocates resources. You have full control of the cluster’s Hadoop/Spark config and node sizes, but the resources are siloed. In contrast, with EMR on EKS, you reuse your shared Kubernetes cluster. EMR on EKS simply runs Spark on that cluster’s nodes (alongside other apps).

EMR on EC2	🔮	EMR on EKS
Dedicated EC2 instances	Resource Allocation	Shared Kubernetes cluster
YARN-based scheduling	Orchestration	Kubernetes-native scheduling
Pay for dedicated instances	Cost Model	Pay only for actual resource usage
Limited to single EMR version per cluster	Multi-tenancy	Multiple versions and configurations
Slower due to EC2 instance provisioning	Startup Time	Faster using existing node pools
Native Hadoop ecosystem support	Integration	Cloud-native Kubernetes ecosystem
EMR managed scaling	Scaling	Kubernetes autoscaling + Karpenter/Fargate

🔮 Use EMR on EC2 when you want a standalone cluster per workload. If you have a stable, heavy Spark job schedule and don’t already have Kubernetes in the picture, EMR on EC2 can be straightforward. It’s the classic way to run Hadoop/Spark and it integrates with HDFS/other Hadoop ecosystem tools out of the box. EMR on EC2 might also make sense if you need features currently only in EMR’s YARN-based mode, or if containerization is not a requirement.

🔮 Use EMR on EKS when you have a Kubernetes environment (or plan to) and want to colocate Spark with other container workloads. It’s great for multi-tenancy and agility – one EKS cluster can host multiple Spark applications (even with different EMR versions) and also run other services (like Airflow, machine learning apps, etc.). If you’re already managing infrastructure with EKS and Helm or Terraform, adding Spark workloads there avoids siloing. EMR on EKS also handles the complex AWS integration (EMRFS, S3, IAM) for you, whereas manually running Spark on vanilla Kubernetes would require gluing together a lot of pieces.

Step-By-Step Guide to Run Spark on Kubernetes with AWS EMR on EKS

Now we get hands-on. We’ll walk through all the setup steps, including code snippets and YAML where appropriate. You can run these commands in any region (just add the --region or ARNs/URIs as needed).

Prerequisite:

First things first, make sure you have the following things configured:

AWS Management Console access with appropriate permissions
Basic understanding of EMR cluster architecture and Spark fundamentals
Familiarity with the AWS Management Console navigation
AWS CLI configured with appropriate credentials and permissions
kubectl (Kubernetes CLI) installed and configured
eksctl (EKS cluster CLI) installed and configured
Basic understanding of Kubernetes concepts (pods, namespaces, services)
An existing VPC with appropriate subnets or permission to create new networking resources
Understanding of IAM roles and policies for service integration

Step 1—AWS Console Access and CLI Setup for EMR and EKS

Log in to the AWS Console or make sure your AWS CLI is authenticated. If using the CLI, you should have a profile set up (using aws configure or environment variables) with credentials. You can test by running something like:

aws sts get-caller-identity

If this returns your account and user/role info, you’re ready. No specific AWS region is required for EMR on EKS itself, but keep in mind you’ll launch resources (like EKS nodes) in some region or AZs when prompted.

Note: Many AWS CLI commands require specifying a region or having a default region configured (~/.aws/config). Pick one (us-west-2) and use it consistently.

Step 2—Creating an AWS EKS Kubernetes Cluster

Now create an EKS cluster that Spark will run on. You can use eksctl for a simple setup.

eksctl create cluster \
  --name my-emr-on-eks-cluster \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 4 \
  --managed

As you can see, this command (in your default region) will create a new EKS cluster named my-emr-on-eks-cluster with 3 managed Linux node group instances (by default m5.large, but you can specify --node-type if you need something different). It also enables a node autoscaler (min 1, max 4).

Once it completes, eksctl updates your ~/.kube/config so that kubectl knows about this cluster. You can verify:

kubectl get nodes -o wide

You should see 3 (or up to 4 as they scale) EC2 instances ready. To view the workloads running on your cluster:

kubectl get pods -A -o wide

Note: In production, you might want to create nodegroups in multiple AZs, use Spot instances, a wider node type mix, etc. This example uses a simple default setup for clarity.

Step 3—Setting Up Kubernetes Namespace and EMR Access

We’ll dedicate a Kubernetes namespace for EMR Spark jobs. A “namespace” in Kubernetes isolates resources. Let’s make one (called spark for example):

kubectl create namespace spark

Next, we must let EMR’s service account access this namespace. AWS provides the eksctl create iamidentitymapping command to link EMR’s service-linked role to the namespace. Run:

eksctl create iamidentitymapping \
  --cluster my-emr-eks-cluster \
  --namespace spark \
  --service-name emr-containers

This command creates the necessary Kubernetes RBAC (Role & RoleBinding) and updates the aws-auth ConfigMap so that the AWSServiceRoleForAmazonEMRContainers role is mapped to the user emr-containers in the spark namespace. In other words, it gives EMR on EKS permission to create pods, services, etc. in spark. (If this fails, ensure you’re using a recent eksctl version and that your AWS credentials can modify the cluster’s IAM config).

Step 4—Create a Virtual Cluster for EMR (Register EKS Cluster with EMR)

Now register this namespace as an EMR virtual cluster. A virtual cluster in EMR on EKS terms is just the glue that tells EMR “use this EKS cluster and namespace for job runs”. It does not create new nodes; it just links to the existing cluster.

Use the AWS CLI emr-containers command:

aws emr-containers create-virtual-cluster \
    --name spark-vc \
    --container-provider '{
         "type": "EKS",
         "id": "my-emr-eks-cluster",
         "info": {"eksInfo": {"namespace": "spark"}}
    }'

Replace my-emr-eks-cluster with your cluster name (as above). You’ll get back a JSON with a virtualClusterId (it looks like vc-xxxxxxxx).

After running, you can verify the virtual cluster with:

aws emr-containers list-virtual-clusters

And note the ID for the one named spark-vc. We’ll use that in the next step. (The virtual cluster itself doesn’t create any servers; it just links EMR to the namespace).

Step 5—Registering EKS Cluster as EMR Virtual Cluster

Spark jobs running on EMR on EKS need an AWS IAM role to access AWS resources (for example, S3 buckets). This is called the job execution role. We create an AWS IAM role that EMR can assume, and attach a policy for S3 and CloudWatch logs.

5a—Define and Create the IAM Role (EMR Job Execution Role)

We’ll create a role that trusts EMR. One way is to trust the elasticmapreduce.amazonaws.com service and then update it for IRSA.

For example:

aws iam create-role --role-name EMROnEKSExecutionRole \
    --assume-role-policy-document '{
      "Version": "2012-10-17",
      "Statement": [{
         "Effect": "Allow",
         "Principal": {"Service": "elasticmapreduce.amazonaws.com"},
         "Action": "sts:AssumeRole"
      }]
    }'

Replace EMROnEKSExecutionRole with your own name. This sets up the role so EMR (service name elasticmapreduce.amazonaws.com) can assume it.

5b—Attach Required AWS Policies and Permissions

Next, attach an AWS IAM policy that grants permissions to this role. At minimum, give it read/write access to your S3 buckets and permission to write logs.

For example:

aws iam put-role-policy --role-name EMROnEKSExecutionRole --policy-name EMROnEKSExecutionPolicy \
    --policy-document '{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:DeleteObject"
          ],
          "Resource": [
            "arn:aws:s3:::YOUR-LOGS-BUCKET",
            "arn:aws:s3:::YOUR-LOGS-BUCKET/*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:DescribeLogStreams",
            "logs:DescribeLogGroups"
          ],
          "Resource": "arn:aws:logs:*:*:log-group:/aws/emr-containers/*"
        }
      ]
    }'

Replace YOUR-LOGS-BUCKET with your S3 bucket name (or use * to allow all buckets, but locking it down is better). This grants S3 and CloudWatch Logs access.
After this, note the role ARN (you can fetch it with aws iam get-role). We’ll use that in the job submission.

aws iam get-role --role-name EMROnEKSExecutionRole --query 'Role.Arn' --output text

Step 6—Enabling IRSA (IAM Roles for Service Accounts) in EKS

Prerequisites: Before running the update-role-trust-policy command, make sure that your EKS cluster has an OIDC identity provider associated. You can set this up with:

eksctl utils associate-iam-oidc-provider --cluster your-cluster-name --approve

AWS EMR on EKS uses AWS IAM Roles for Service Accounts (IRSA) under the hood. To let Spark pods assume our role, we update its trust policy. AWS provides a handy command:

aws emr-containers update-role-trust-policy \
    --cluster-name my-emr-eks-cluster \
    --namespace spark \
    --role-name EMROnEKSExecutionRole

This command modifies the role’s trust policy to allow the OIDC provider for your EKS cluster, specifically any service account named like emr-containers-sa-*-<ACCOUNTID>-<something> in the spark namespace to assume it. Essentially, it ties the role to the Kubernetes service account that EMR creates for each job. After running this, your Spark driver and executor pods (which use that service account) will be able to use the permissions of EMROnEKSExecutionRole.

You can verify the trust policy was updated correctly by checking the role:

aws iam get-role --role-name EMROnEKSExecutionRole --query 'Role.AssumeRolePolicyDocument'

The output should now include entries for both the EMR service and your EKS cluster's OIDC provider.

Step 7—Submitting Apache Spark Jobs to EMR Virtual Cluster

We’re ready to run a Spark job. Let’s assume you have a PySpark script my_spark_job.py in S3 (s3://my-bucket/scripts/my_spark_job.py) and you want the output in s3://my-bucket/output/. We’ll ask for 2 executors with 4 GiB each as a simple example.

Use the start-job-run command:

aws emr-containers start-job-run \
  --virtual-cluster-id <my-virtual-cluster-id> \
  --name example-spark-job \
  --execution-role-arn arn:aws:iam::123456789012:role/EMROnEKSExecutionRole \
  --release-label emr-6.10.0-latest \
  --job-driver '{
      "sparkSubmitJobDriver": {
          "entryPoint": "s3://my-bucket/scripts/my_spark_job.py",
          "entryPointArguments": ["s3://my-bucket/output/"],
          "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=4G"
      }
  }'

Replace <virtual-cluster-id> with the ID from Step 4.
Set the --execution-role-arn to your role’s ARN from Step 5.
--release-label chooses the EMR/Spark version (6.10.0 is Spark 3.x; pick as needed).
The JSON under --job-driver tells EMR to run spark-submit with our script. We pass the output path as an argument, and set Spark configs for 2 executors of 4 GiB memory.

You can add --configuration-overrides (in JSON) if you want to enable additional logging or set extra Spark configs. But the above is the basic form. After you run it, you’ll get a job-run ID. EMR on EKS will then schedule the Spark driver pod and executor pods on the cluster.

Step 8—Monitoring Spark Job Status and Viewing Results

After submission, you can track the job status. Use:

aws emr-containers describe-job-run \
  --virtual-cluster-id <virtual-cluster-id> \
  --id <job-run-id>

This will show status (PENDING, RUNNING, etc.) and more details. You can also see the job in the EMR console under Virtual Clusters, or use EMR Studio if you have it set up.

Logs: EMR on EKS sends logs to CloudWatch Logs and S3 (if configured) by default. Check CloudWatch for log group named like /aws/emr-on-eks/ or similar. You should see log streams for your driver and executor. Also, EMR keeps the Spark History. In the EMR console’s “Job runs” details, there’s a link to the Spark UI logs for debugging.
For example, after starting the job, you can run:

aws emr-containers list-job-runs \
  --virtual-cluster-id <virtual-cluster-id>

to see the job's progress and current status. Use describe-job-run for details like log URIs or final status.

Collecting and Viewing Job Output and Logs
Once the job completes, any output will be in your S3 path (e.g. s3://my-bucket/output/). Check there for results. You can also open the Spark History Server UI via the EMR console to inspect job stages and metrics (just click the link for that job’s Spark UI). All the data-processing was done by pods on your EKS cluster, so there’s no EMR cluster to terminate – it was purely virtual.

Step 9—Resource Cleanup: Deleting EMR Virtual Clusters, EKS Namespace, and Roles

When you’re done, you’ll want to delete what you created to avoid charges.
Delete the Spark job runs (they are ephemeral, so you really only need to delete the virtual cluster).

1) Delete the EMR virtual cluster:

aws emr-containers delete-virtual-cluster --id <my-virtual-cluster-id>

(You can list your virtual clusters to get the ID, or use the one from creation). This removes EMR’s registration.

2) Delete the Kubernetes namespace:

kubectl delete namespace spark-jobs

3) Delete the EKS cluster:

eksctl delete cluster --name spark-cluster

4) Remove the AWS IAM role and policies:

aws iam detach-role-policy \
  --role-name EMRContainers-JobExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEMRContainersServiceRolePolicy

aws iam delete-role \
  --role-name EMRContainers-JobExecutionRole

(If you attached any managed policies, detach them first).

Once cleaned up, you’ll only be charged for the time your nodes were up and any storage/transfer. There’s no separate “EMR on EKS” fee beyond normal EMR and EC2 usage.

Troubleshooting and Diagnosing Common EMR on EKS Issues

1) Fixing Pod Failures and Resource Constraint Errors

Issue: Jobs fail with insufficient resources errors.
Solution: Check node groups have adequate capacity and use appropriate instance types:

Check node capacity

kubectl describe nodes

Verify resource requests vs available capacity

kubectl top nodes

2) Resolving IRSA and AWS IAM Authentication Problems

Issue: Jobs fail with AWS permission errors despite correct AWS IAM policies.
Solution: Verify OIDC provider configuration and trust policy:

Check OIDC provider exists

aws iam list-open-id-connect-providers

Verify trust policy includes correct OIDC provider

aws iam get-role --role-name EMROnEKSExecutionRole \
  --query 'Role.AssumeRolePolicyDocument'

3) Addressing Networking and DNS Issues with Spark on EKS

Issue: Jobs cannot access S3 or other AWS services.
Solution: Verify VPC endpoints, security groups, and DNS configuration:

Check CoreDNS pods

kubectl get pods -n kube-system -l k8s-app=kube-dns

Verify VPC endpoints

aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=<your-vpc-id>"

Conclusion

And that’s a wrap! You have successfully set up and run Apache Spark applications on Kubernetes using AWS EMR on EKS. This powerful combination provides the flexibility of Kubernetes with the managed capabilities of EMR, enabling you to run scalable analytics workloads efficiently. EMR on EKS offers significant advantages in terms of resource utilization, cost optimization, and operational simplicity while maintaining the performance benefits of EMR's optimized Spark runtime. This makes it an excellent choice for organizations looking to modernize their big data infrastructure and adopt container-based architectures.

In this article, we have covered:

Why AWS EMR on EKS?
Architecture of EMR on EKS
Difference between EMR on EC2 vs EMR on EKS
Step-by-Step Guide to Run Spark on Kubernetes with AWS EMR on EKS

… and so much more!

Frequently Asked Questions (FAQs)

What is EMR on EKS?
AWS EMR on EKS is a deployment option for AWS EMR that enables running Apache Spark applications on AWS EKS clusters instead of dedicated EC2 instances. It combines EMR's performance-optimized runtime with Kubernetes orchestration capabilities.

What are the benefits of EMR on EKS?
The benefits of EMR on EKS include shared resource utilization, managed Spark versions, and faster startup. EMR on EKS allows you to consolidate analytical Spark workloads with other Kubernetes-based applications for better resource use. You get EMR’s automatic provisioning and EMR Studio support, and you only pay for the containers you run (nodes can scale down to zero). AWS also reports big performance gains using the EMR-optimized Spark runtime.

Why run Spark on Kubernetes instead of YARN?
Running Spark on Kubernetes can be simpler if you’re already using Kubernetes for other workloads. It lets you treat Spark jobs as container apps, using Kubernetes scheduling, monitoring, and autoscaling. As AWS explains, if you already run big data on EKS, EMR on EKS automates provisioning so you can run Spark more quickly. In contrast, YARN requires dedicated clusters and is tied to the Hadoop ecosystem. Kubernetes offers a unified platform and can make multi-tenancy and version management easier.

Do I need to build my own Spark Docker image?
No. EMR on EKS uses Amazon-provided container images with optimized Spark runtime. AWS manages the container image lifecycle, including security updates and performance optimizations, eliminating the need for custom image management.

Can I run multiple Spark versions on one EKS cluster?
Yes. EMR on EKS supports running different EMR release labels across separate virtual clusters (namespaces) on the same EKS cluster. This enables testing different Spark versions or maintaining legacy applications alongside modern workloads.

Is EMR on EKS more expensive than EMR on EC2?
Cost depends on usage patterns. EMR on EKS has no additional charges beyond standard EMR and compute costs. The shared resource model often reduces costs by eliminating idle cluster capacity, making it particularly cost-effective for variable or bursty workloads.

Can I use EMR Studio with EMR on EKS?
Yes. EMR Studio fully supports EMR on EKS virtual clusters through EMR interactive endpoints. You can attach Studio workspaces to virtual clusters for interactive development, debugging, and job authoring.

What is a virtual cluster in EMR on EKS?
A virtual cluster is a logical construct that maps AWS EMR to a specific Kubernetes namespace. It doesn't create physical resources but serves as the registration point for job submission and management within that namespace.

Does EMR on EKS use HDFS?
No. EMR on EKS typically uses AWS S3 via EMRFS for data storage rather than HDFS. This approach provides better durability, scalability, and cost-effectiveness for cloud-native architectures, though custom HDFS deployments are possible if required.

Do I need to manage Spark Operator or Spark-submit jobs?
EMR on EKS offers flexibility in job submission methods. You can use the AWS CLI/SDK with emr-containers commands for simplicity, or leverage Kubernetes-native approaches like the Spark Operator for more advanced orchestration scenarios.