DEV Community

Cover image for Apache Spark Interview Questions: Architecture, Shuffle, Caching, Tuning
Gowtham Potureddi
Gowtham Potureddi

Posted on

Apache Spark Interview Questions: Architecture, Shuffle, Caching, Tuning

apache spark interview questions focus on five distributed-systems pillars every Spark loop tests: architecture (driver, executors, cluster manager, the cluster modes), DAG and stage boundaries (lazy planning, narrow vs wide transformations, shuffle as the stage boundary), shuffle internals (sort-based shuffle, spill, network I/O, the dominant job cost), caching and persistence (storage levels, when to cache, when to checkpoint), and tuning (spark-submit configs, executor sizing, AQE, the Catalyst optimizer). Whether you're prepping for spark interview questions at a startup or apache spark interview questions and answers for experienced at FAANG, the same five pillars show up — and senior loops add streaming, Delta Lake / Iceberg, and the unified-batch-streaming story on top.

This guide walks through every theme in the spark interview questions and answers pdf ecosystem that reviewers love to test in data engineering interview questions: spark architecture with the master-worker model, the DAG scheduler and stage boundaries, shuffle in spark internals (sort-based shuffle, external sort, spill thresholds), spark caching and storage levels (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY), partitioning strategy for parallelism and skew, the Catalyst optimizer + Tungsten execution engine, Adaptive Query Execution (AQE), spark sql vs dataframe vs rdd comparison, and the seven gotchas (driver OOM, skewed stages, small files, GC pauses, broadcast threshold) that fail most candidates. Every section ends as spark interview questions with answers: a runnable PySpark snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown.

PipeCode blog header for an Apache Spark interview tutorial — bold white headline 'Apache Spark · Interview Questions' with subtitle 'architecture · shuffle · caching · tuning' and a minimal cluster diagram on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, reinforce real-time analytics patterns →, or widen coverage on the full data-analysis library →.


On this page


1. Why Apache Spark interview questions test five distributed-systems pillars

Spark interviews stay focused — five pillars cover 90% of the prompts

The one-sentence invariant: apache spark interview questions recycle the same five distributed-systems pillars (architecture, DAG / stages, shuffle, caching, tuning) regardless of seniority — spark interview questions for freshers cuts the depth on each, while apache spark interview questions and answers for experienced adds streaming, Delta / Iceberg, and the unified-batch-streaming story on top. Once you know the five pillars, every prompt becomes "which pillar is the reviewer probing?"

The five pillars at a glance.

  • Architecture — driver runs the program; executors run tasks; cluster manager (YARN / Kubernetes / standalone / Mesos) allocates resources.
  • DAG and stages — every action triggers a DAG; stages are bounded by shuffles; tasks process one partition each.
  • Shuffle internals — sort-based shuffle, external sort spill to disk, network I/O; the dominant job cost.
  • Caching and persistencecache() / persist(level) for reused DataFrames; checkpoint() to break long DAG lineage.
  • Tuning — executor sizing, spark.sql.shuffle.partitions, AQE, broadcast threshold; the operational reality of every Spark cluster.

Why interviewers love Spark.

  • It's the de facto big-data engine — most modern data lakes / lakehouses run on Spark.
  • It tests distributed-systems thinking — partitioning, shuffles, OOM, GC, lock-step coordination.
  • The optimization surface is rich — Catalyst, AQE, broadcast joins, caching, partitioning — every senior loop probes here.
  • Production-realism questions — "your job is slow; what's the first thing you check?".

What interviewers listen for.

  • Do you name the driver / executor / cluster-manager triad when asked about architecture? — basic-but-tested.
  • Do you reach for df.explain() before guessing? — senior signal.
  • Do you mention AQE when asked about modern Spark tuning? — Spark 3.0+ awareness.
  • Do you know that shuffles dominate cost and reach for broadcast joins, partition pruning, and column pruning first? — optimization fluency.

spark vs hadoop — when does Spark win?

  • In-memory speed — Spark caches intermediate results in RAM; Hadoop MapReduce writes every stage to disk.
  • Iterative workloads — ML training, graph algorithms; Spark's caching makes these 10-100× faster.
  • Unified API — batch (DataFrame), streaming (Structured Streaming), ML (MLlib), graph (GraphX) all in one engine.
  • Modern Spark also runs on YARN / Kubernetes / S3 — same deployment surface as Hadoop; you don't have to choose.

spark vs pyspark — same engine, different language.

  • Apache Spark — the underlying JVM engine, written in Scala.
  • PySpark — the Python API to Spark; thin wrapper around the JVM engine.
  • Performance — DataFrame ops are identical; Python UDFs are slow because of JVM-Python serialization (use pandas_udf).
  • Choose — Python (PySpark) for DE work and ML interop; Scala for performance-critical or type-safe code.

Worked example — a Spark job that touches all five pillars

Detailed explanation. A realistic Spark interview answer combines pillars. The snippet below configures the driver / executor sizing (pillar 1), reads from Parquet and builds a lazy DAG (pillar 2), avoids a shuffle via broadcast() (pillar 3), caches a reused intermediate (pillar 4), and tunes AQE on (pillar 5).

Question. Compute total paid revenue per region from a 1TB orders Parquet, joined to a 5MB customers lookup, on a cluster with 10 executors × 8 cores × 32GB.

Code (PySpark + Spark configuration).

spark = (
    SparkSession.builder
    .appName("revenue_by_region")
    .config("spark.executor.instances", "10")
    .config("spark.executor.cores", "8")
    .config("spark.executor.memory", "28g")
    .config("spark.sql.shuffle.partitions", "400")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.adaptive.skewJoin.enabled", "true")
    .config("spark.sql.autoBroadcastJoinThreshold", "104857600")  # 100MB
    .getOrCreate()
)

orders    = spark.read.parquet("s3://bucket/orders/")    # lazy
customers = spark.read.parquet("s3://bucket/customers/") # 5MB, will broadcast

paid = (
    orders
    .filter(col("status") == "paid")
    .select("order_id", "customer_id", "amount")   # column pruning
    .cache()
)

result = (
    paid
    .join(broadcast(customers), on="customer_id", how="inner")
    .groupBy("region")
    .agg(sum("amount").alias("total_revenue"))
)

result.explain("formatted")   # verify BroadcastHashJoin
result.write.mode("overwrite").parquet("s3://bucket/output/revenue/")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Architecture — 10 executors × 8 cores = 80 task slots; 28GB executor memory (leave 4GB for OS / JVM overhead).
  2. DAG / stagesread → filter → select are all narrow; join would normally shuffle, but the broadcast() skips it; groupBy then triggers a shuffle.
  3. Shuffle — only the groupBy shuffles, partitioned to 400 (matching cluster cores × 5).
  4. Cachingpaid.cache() materialises once; reused for the join and the count.
  5. Tuning — AQE on, skew join handling on, broadcast threshold bumped to 100MB.

Output. Top regions by total paid revenue.

region total_revenue
US 12500.00
EU 8750.00
APAC 5400.00

Rule of thumb: tune Spark by understanding the five pillars, not by copy-pasting spark-submit configs; every config change should map to a specific bottleneck you've identified in the Spark UI.

Python
Topic — etl
Spark ETL drills

Practice →

Python
Topic — data-analysis
Distributed compute patterns

Practice →


2. Spark architecture — driver, executors, cluster manager

Diagram of Spark architecture — a labelled cluster diagram with the driver on the left, three executor boxes in the middle (each holding tasks and a cache slot), and a cluster manager box on the right (YARN / Kubernetes / standalone), connected by arrows labelled 'tasks' and 'results', on a light PipeCode card.

spark architecture — driver, executors, cluster manager — the three roles you must name

spark architecture is the #1 most-asked apache spark interview questions prompt. The senior answer in one sentence: Spark uses a master-worker model — the driver runs your program and the DAG scheduler; executors are JVMs that run tasks on data partitions; the cluster manager (YARN / Kubernetes / standalone / Mesos) allocates resources between them.

The driver — the master.

  • Runs your main() function — the Python or Scala program.
  • Hosts the SparkContext / SparkSession — the entry point to the cluster.
  • Builds the DAG — the logical and physical plans from transformations.
  • DAG scheduler — splits the DAG into stages bounded by shuffles.
  • Task scheduler — dispatches tasks to executors based on data locality.
  • Single point of failure — if the driver dies, the job dies.

Executors — the workers.

  • JVM processes that run on cluster nodes.
  • Run tasks — one task per partition; configurable cores per executor.
  • Cache blocks — store cached DataFrames / RDDs in memory or on disk.
  • Heartbeat to the driver — report task progress and metrics.
  • Sized via spark.executor.cores (slots per executor) and spark.executor.memory (heap).

Cluster manager — resource allocator.

  • YARN — the Hadoop ecosystem default; most production clusters.
  • Kubernetes — increasingly common in cloud-native deployments.
  • Standalone — Spark's built-in cluster manager; simplest setup.
  • Mesos — legacy; rarely seen in modern stacks.
  • Local modespark-submit --master local[*]; single-machine development.

The job → stage → task hierarchy.

  • Job — one action triggers one job (df.write.parquet(...) is one job).
  • Stage — DAG split bounded by shuffles; pipelined tasks within a stage run as one unit.
  • Task — one partition processed by one executor core; the smallest unit of work.
  • Partition — one slice of data; typically 128MB-1GB; processed by one task.

spark client mode vs cluster mode — where does the driver run?

  • Client mode — driver runs on the submitting machine (laptop / edge node); good for interactive work; bad for production (laptop quits = job dies).
  • Cluster mode — driver runs inside the cluster (one of the worker nodes); good for production; the standard spark-submit --deploy-mode cluster.
  • For PySpark — both modes work; cluster mode is the canonical production answer.

Resource calculation rules of thumb.

  • Executor cores — 4-5 cores per executor; more = JVM context-switch overhead.
  • Executor memorycores × 4GB is a typical minimum; tune up for join-heavy workloads.
  • Memory overhead — Spark reserves ~10% above heap for JVM overhead; configure spark.executor.memoryOverhead.
  • Driver memory — start with 4GB; bump if you collect() to driver or run a metadata-heavy job.

Dynamic allocation.

  • spark.dynamicAllocation.enabled=true — executors scale up/down based on workload.
  • Useful for shared YARN clusters and bursty workloads.
  • Catch — requires external shuffle service to enable safe executor removal.

Python
Topic — etl
Spark architecture drills

Practice →

Python
Topic — data-analysis
Distributed system patterns

Practice →


3. DAG, stages, tasks — how Spark plans and schedules a job

Diagram of the DAG → stage → task hierarchy — top row shows a DAG of operations (read, filter, withColumn, groupBy, agg, write); middle row groups them into two stages with a shuffle boundary between; bottom row shows each stage broken into N parallel tasks running on executors, on a light PipeCode card.

spark dag — actions trigger jobs, shuffles bound stages, partitions become tasks

The #2 most-asked apache spark interview questions topic. The senior answer: every action triggers a job; the job is split into stages at shuffle boundaries; each stage runs many tasks in parallel, one per data partition.

The full pipeline — action → job → stages → tasks.

  • Actionshow, count, collect, write, take triggers execution.
  • Job — one logical query plan from inputs to outputs.
  • Stage — bounded by shuffles; one stage's tasks all pipeline together (no shuffle inside).
  • Task — one partition processed by one core; the smallest schedulable unit.

Narrow vs wide transformations — the stage-boundary rule.

  • Narrow transformation — each output partition depends on one input partition; no shuffle. Examples: select, filter, withColumn, map, union.
  • Wide transformation — each output partition depends on multiple input partitions; shuffle required. Examples: groupBy, join, distinct, orderBy, repartition.
  • Stage boundary — every wide transformation creates a new stage; narrow transformations are pipelined inside one stage.

Reading df.explain().

  • Parsed logical plan — unresolved AST from your code.
  • Analyzed logical plan — column / type resolution done.
  • Optimized logical plan — Catalyst rewrites (push-down, pruning).
  • Physical plan — the actual execution plan with operator selection (BroadcastHashJoin, SortMergeJoin, HashAggregate, etc.).
  • df.explain("formatted") — Spark 3.0+ readable format.
  • df.explain("cost") — adds row count estimates.

Spark UI — the operational view.

  • Jobs tab — every action that ran, with duration and status.
  • Stages tab — stages within each job, with task counts and shuffle metrics.
  • SQL tab — per-query DAG diagram and per-operator timing.
  • Storage tab — cached DataFrames and their memory footprint.
  • Environment tab — all Spark configs in effect.
  • Executors tab — per-executor task counts, memory, GC time, shuffle volume.

The four key metrics on a stage.

  • Duration — wall-clock time of the stage.
  • Shuffle Read Size / Records — bytes pulled from upstream shuffle.
  • Shuffle Write Size / Records — bytes shuffled to downstream stage.
  • Spill (Memory) / Spill (Disk) — bytes evicted from RAM under pressure; a sign of OOM.

Speculative execution.

  • spark.speculation=true — Spark re-launches slow tasks on different executors; whichever finishes first wins.
  • Use case — long-tail tasks caused by a slow node or skewed partition.
  • Cost — extra cluster work for tasks that might already be progressing.

Task / partition rules of thumb.

  • Target partition size — 128MB-1GB; smaller wastes scheduling, larger risks OOM.
  • Task count per stage — at least 2-4 × total_cores; defaults to spark.sql.shuffle.partitions (200) for shuffled stages.
  • Tune spark.sql.shuffle.partitions — to 2-4 × total_cores of your cluster.

Job failure modes.

  • Driver OOMcollect() on too much data; metadata explosion on too many partitions.
  • Executor OOM — partition too large for memory; skewed key; insufficient executor memory.
  • Lost executor — node failure or eviction; Spark retries tasks (spark.task.maxFailures=4 default).
  • Long GC pauses — heap pressure; tune executor memory or enable G1GC.

Python
Topic — etl
DAG / stages drills

Practice →

Python
Topic — data-manipulation
Spark execution patterns

Practice →


4. Shuffle internals — sort-based shuffle, spill, network I/O

shuffle in spark — the operation that dominates almost every Spark job's cost

shuffle in spark is the operation Spark uses to redistribute data across executors based on a key. Every wide transformation (groupBy, join, distinct, orderBy) shuffles. The shuffle costs network I/O + disk I/O + serialisation + sort; on most clusters, shuffles are the #1 cost contributor by a large margin.

Sort-based shuffle — the default since Spark 1.2.

  • Step 1 — map-side write — each task on the source side writes its output partitioned by the target key into one file per downstream partition; sorted within each.
  • Step 2 — reduce-side read — each downstream task pulls its assigned partition from every map task.
  • Step 3 — merge-sort — downstream task merge-sorts the streams from each map task.
  • External sort — if data exceeds memory, spills sorted chunks to disk and merges them.

Where shuffle data lives.

  • Local disk on each executorspark.local.dir config (defaults to /tmp — change for production!).
  • External shuffle service — separate JVM that serves shuffle data even after the producing executor dies; required for dynamic allocation.
  • Disk-IO bound — shuffles are sequential reads/writes; SSDs help.

Shuffle spill — when memory isn't enough.

  • Thresholdspark.shuffle.spill.numElementsForceSpillThreshold or memory pressure triggers spill.
  • Symptom in Spark UI — non-zero "Spill (Memory)" / "Spill (Disk)" columns.
  • Cost — every spilled chunk has to be re-read and merge-sorted; can multiply stage cost by 2-5×.
  • Fix — increase spark.executor.memory, increase spark.sql.shuffle.partitions to make partitions smaller, or remove the shuffle entirely (broadcast join).

Skew — the long-tail killer.

  • Symptom — one task in a shuffle stage takes 10-100× longer than peers; stage's tail "drags".
  • Cause — one key has 90% of the rows; one partition is huge.
  • Diagnosis — Spark UI's Tasks tab; "Shuffle Read Size" hugely uneven across tasks.
  • AQE fix (Spark 3.0+)spark.sql.adaptive.skewJoin.enabled=true auto-splits skewed partitions.
  • Manual fix — salt the key: salted_key = concat(key, "_", rand_int); join on salted; aggregate post-join.
  • Broadcast fix — broadcast the small side if applicable.

Reducing shuffle volume.

  • Filter earlydf.filter(...) before join / groupBy reduces the bytes shuffled.
  • Project narrowdf.select("k", "a") before shuffle cuts row width.
  • Pre-aggregategroupBy → join instead of join → groupBy when semantics allow.
  • Bucketed tables — if you join the same way repeatedly, bucket the storage by the join key (no shuffle on re-read).

spark.sql.shuffle.partitions — the key tuning knob.

  • Default — 200; almost always wrong.
  • Rule of thumb2-4 × total_cores of the cluster (e.g. 10 executors × 8 cores = 80 cores → 200-400 partitions).
  • AQE coalescespark.sql.adaptive.coalescePartitions.enabled=true dynamically reduces partitions for small shuffles.
  • Too few partitions — under-parallelism; some tasks too big.
  • Too many partitions — scheduling overhead dominates; tiny tasks.

Shuffle compression and serialisation.

  • spark.shuffle.compress=true — default; compresses shuffle files.
  • spark.io.compression.codec=lz4 — fastest codec; snappy and zstd are alternatives.
  • spark.serializer=org.apache.spark.serializer.KryoSerializer — faster than the default Java serialiser; recommended for production.

Python
Topic — etl
Shuffle internals drills

Practice →

Python
Topic — data-analysis
Skew handling patterns

Practice →


5. Caching and persistence — storage levels and checkpoint

spark caching — reuse intermediate DataFrames across multiple actions

spark caching is the #1 production tuning move. The pattern: a DataFrame used by N actions runs the DAG N times unless cached. With df.cache(), the first action materialises it; subsequent actions read from RAM (or disk fallback).

cache() vs persist(level) — the two entry points.

  • df.cache() — shorthand for df.persist(MEMORY_AND_DISK).
  • df.persist(StorageLevel.X) — explicit storage level (six choices, see below).
  • Both are lazy — DataFrame is marked for caching, but doesn't actually cache until an action fires.
  • Pair with df.count() to force materialisation: df.cache().count().

Storage levels.

  • MEMORY_ONLY — RAM only; spills are recomputed (not retained); fastest; OOM-prone.
  • MEMORY_AND_DISK — RAM with disk fallback; the default for cache().
  • DISK_ONLY — disk only; slowest in-cache; most durable; cheap memory.
  • MEMORY_ONLY_SER / MEMORY_AND_DISK_SER — serialised storage; ~2× smaller; more CPU on access.
  • OFF_HEAP — Tungsten off-heap memory; bypasses JVM GC; advanced.
  • Replicated variants (_2 suffix) — MEMORY_AND_DISK_2 keeps two copies on different executors for fault tolerance.

When to cache.

  • DataFrame used by ≥ 2 actions — caching saves the second computation.
  • Iterative ML workloads — training loops reuse the same data N times.
  • Branching DAGs — multiple downstream pipelines from one source.
  • Spark-shell exploration — interactive sessions benefit massively.

When NOT to cache.

  • One-shot DAGs — only one action ever fires; caching is wasted work.
  • Tight memory — caching evicts other tenants; can slow other jobs.
  • Very large DataFramesMEMORY_ONLY OOMs; MEMORY_AND_DISK may not fit on local disk.

unpersist() — explicit release.

  • df.unpersist() — drops cached blocks; frees memory.
  • df.unpersist(blocking=True) — waits for removal to complete.
  • Good practice — call when the DataFrame is no longer needed; explicit > LRU eviction.

Checkpoint — break long DAG lineage.

  • spark.sparkContext.setCheckpointDir("/tmp/checkpoints") — set the directory first.
  • df.checkpoint() — materialises the DataFrame to disk and breaks the lineage; downstream operations re-read from the checkpoint, not the original DAG.
  • Use case — very long DAGs where re-computing on failure would be expensive; iterative algorithms with long histories.
  • Cost — disk I/O to write the checkpoint; future reads don't re-execute the DAG above the checkpoint.

cache vs checkpoint — when to pick which.

  • cache — keeps lineage; on cache eviction, recomputes from the DAG.
  • checkpoint — drops lineage; on read, just reads the checkpoint file; no recomputation possible.
  • cache for short-DAG reuse; checkpoint for long-DAG reliability.

Inspecting cached data.

  • Spark UI Storage tab — shows all cached RDDs / DataFrames with size and storage level.
  • df.is_cached (DataFrame attribute) — True / False.
  • spark.catalog.clearCache() — drops all cached data.

Python
Topic — etl
Cache / persist drills

Practice →

Python
Topic — data-manipulation
Spark reuse patterns

Practice →


6. Catalyst optimizer, Tungsten engine, and Adaptive Query Execution

Diagram of the Catalyst + Tungsten + AQE pipeline — three labelled boxes connected by arrows (Catalyst optimizer turns the logical plan into a physical plan; Tungsten generates code and runs columnar execution; AQE re-plans at runtime based on shuffle statistics), with annotations under each, on a light PipeCode card.

catalyst optimizer + tungsten + aqe — the three engines making Spark fast

Three Spark-internal engines handle optimization: Catalyst rewrites your query plan; Tungsten code-generates the physical execution; Adaptive Query Execution (AQE) re-plans at runtime based on observed statistics. Senior apache spark interview questions probe all three.

Catalyst optimizer — query plan rewriting.

  • Four-phase pipeline — Parsed → Analyzed → Optimized → Physical Plan.
  • Optimizations — predicate pushdown, column pruning, constant folding, filter reordering, join reordering, broadcast detection, sub-expression elimination.
  • Rule-based + cost-based — modern Catalyst uses both; CBO requires spark.sql.cbo.enabled=true and table stats.
  • Source pushdown — for Parquet / ORC / JDBC sources, pushes filters and projections all the way into the file format.

Common Catalyst rewrites.

  • Predicate pushdowndf.filter(...).select(...) rewrites to read only filtered rows from source.
  • Column pruningdf.select("a", "b") reads only columns a and b from Parquet.
  • Constant foldingcol("x") + 0 simplifies to col("x").
  • Boolean simplificationWHERE x > 5 AND x > 10 simplifies to WHERE x > 10.
  • Subquery flattening — IN subqueries become semi-joins.

Tungsten — code generation and columnar execution.

  • Whole-stage code generation — multiple operators fuse into one JVM function with no virtual calls.
  • Vectorized Parquet reader — reads batches of rows columnarly.
  • Off-heap memory — escapes JVM GC for stable performance.
  • Cache-friendly layout — columnar batches fit in L2/L3 cache.
  • Result — 5-10× speedup over the pre-Tungsten row-by-row engine.

AQE — Adaptive Query Execution (Spark 3.0+).

  • spark.sql.adaptive.enabled=true — enable AQE; should be the default in production.
  • Three runtime optimizations:
    • Coalesce shuffle partitions — merges small post-shuffle partitions; controlled by spark.sql.adaptive.coalescePartitions.enabled.
    • Switch join strategy — promotes sort-merge to broadcast at runtime if one side turns out small.
    • Skew join handling — splits skewed partitions across multiple tasks; controlled by spark.sql.adaptive.skewJoin.enabled.

AQE in the Spark UI.

  • Logical plan vs final plan — UI shows both; the final plan reflects AQE rewrites.
  • "AQE optimized" badges — indicate runtime rewrites.
  • Final plan stage count — often fewer than the initial plan due to coalesce / broadcast promotion.

CBO — Cost-Based Optimizer.

  • spark.sql.cbo.enabled=true — enable CBO.
  • ANALYZE TABLE t COMPUTE STATISTICS — populate table stats.
  • ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS … — column-level stats.
  • CBO uses stats for join reordering, filter selectivity estimation, broadcast threshold decisions.

Spark SQL vs DataFrame API — same optimizer.

  • Both compile to the same Catalyst plan — no performance difference.
  • Pick the surface that's clearer — SQL for set-based queries; DataFrame API for programmatic transforms.
  • Mix freelydf.createOrReplaceTempView("t"); spark.sql("…").createOrReplaceTempView("u").

Examining the plan with df.explain().

df.explain("formatted")     # readable
df.explain("cost")          # adds row count estimates
df.explain(extended=True)   # all four phases
Enter fullscreen mode Exit fullscreen mode
  • Read the physical plan first — what actually runs.
  • Verify — broadcast hash join vs sort-merge join, partition count, predicate pushdown.

Python
Topic — etl
Catalyst / AQE drills

Practice →

Python
Topic — data-analysis
Spark optimization patterns

Practice →


7. Spark tuning — executor sizing, configs, and the seven gotchas

Seven tuning gotchas every senior Spark loop tests

Catalyst and AQE do a lot automatically, but seven gotchas separate the senior from the junior on operational questions.

Executor sizing — the foundation.

  • spark.executor.cores=4-5 — sweet spot; more = JVM context-switch overhead.
  • spark.executor.memorycores × 4-6GB typical; tune up for join-heavy or cache-heavy workloads.
  • spark.executor.memoryOverhead — ~10% of executor memory; reserved for JVM overhead, Python interpreter, etc.
  • spark.executor.instances — total executor count; depends on cluster size and other tenants.
  • Driver — start at 4GB; bump for metadata-heavy or collect()-heavy jobs.

spark.sql.shuffle.partitions — the most-tuned knob.

  • Default — 200; almost always wrong.
  • Rule of thumb2-4 × total_cores.
  • Symptom of too few — long-running stages with few tasks; under-parallelism.
  • Symptom of too many — scheduling overhead; many tiny tasks; OOM on small partitions.
  • AQE coalesce — reduces this need; modern clusters can start at 400-800 and let AQE optimize.

Gotcha 1 — Driver OOM from collect().

  • The bugdf.collect() pulls every row to the driver; OOMs the driver on > driver heap.
  • Fixdf.show(20), df.take(n), or write to storage and read back.

Gotcha 2 — Skewed stages causing one-task stragglers.

  • The bug — one key has 90% of rows; one task processes them all while others sit idle.
  • Spark UI sign — one task with 10-100× more rows than peers.
  • Fix — AQE skew join, salted keys, or broadcast the smaller side.

Gotcha 3 — Too many small output files.

  • The bugdf.write.parquet(out) with 1000 partitions produces 1000 tiny files; HDFS / S3 suffer.
  • Fixdf.coalesce(20).write.parquet(out) for ~128MB-1GB per file.
  • AQE — auto-coalesces small post-shuffle partitions if enabled.

Gotcha 4 — GC pauses dominating stage time.

  • The bug — long stop-the-world GC pauses inflate task duration; Spark UI shows >10% time in GC.
  • Fix — switch to G1GC (spark.executor.extraJavaOptions=-XX:+UseG1GC), increase executor memory, or reduce partition size to lower heap pressure.

Gotcha 5 — Broadcast threshold too low (or too high).

  • The bug at low — joins that should broadcast use sort-merge instead; slow.
  • The bug at high — Spark tries to broadcast a too-large table; driver OOM during broadcast collection.
  • Fixspark.sql.autoBroadcastJoinThreshold=104857600 (100MB) is a safe production default; tune based on executor memory.

Gotcha 6 — Python UDFs slowing PySpark jobs.

  • The bug — Python UDFs serialise every row JVM → Python → JVM; 10-100× slower than native.
  • Fix — prefer native pyspark.sql.functions; use pandas_udf (vectorised) when truly needed.

Gotcha 7 — Caching the wrong DataFrame.

  • The bug — caching a DF that's only used once wastes memory; not caching one used 5× re-runs the DAG 5×.
  • Fix — check the DAG; cache the branching point in a multi-action workflow.

spark-submit template — production starting point.

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-cores 5 \
  --executor-memory 28g \
  --driver-memory 8g \
  --conf spark.executor.memoryOverhead=4g \
  --conf spark.sql.shuffle.partitions=400 \
  --conf spark.sql.adaptive.enabled=true \
  --conf spark.sql.adaptive.coalescePartitions.enabled=true \
  --conf spark.sql.adaptive.skewJoin.enabled=true \
  --conf spark.sql.autoBroadcastJoinThreshold=104857600 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  job.py
Enter fullscreen mode Exit fullscreen mode

Python
Topic — etl
Spark tuning drills

Practice →

Python
Topic — real-time-analytics
Real-time analytics patterns

Practice →


Choosing the right Spark pattern (cheat sheet)

A one-screen cheat sheet for the most-asked apache spark interview questions patterns.

You want to … Spark pattern Notes
Build a Spark session SparkSession.builder.appName("…").getOrCreate() One per app
Read structured data spark.read.parquet(path) Columnar; lazy
Filter + project df.filter(col("x") > 100).select("a", "b") Catalyst pushes down
Aggregate by key df.groupBy("k").agg(sum("v").alias("total")) Triggers shuffle
Join df.join(other, on="k", how="inner") Sort-merge default
Broadcast small side df.join(broadcast(small), ...) Skip shuffle
Cache reused DF df.cache().count() Pair with action
Force N partitions df.repartition(200) Full shuffle
Reduce partitions df.coalesce(20) No shuffle; only down
Tune shuffle partitions spark.sql.shuffle.partitions = 2-4× cores Default 200
Enable AQE spark.sql.adaptive.enabled=true Spark 3.0+
Handle skew spark.sql.adaptive.skewJoin.enabled=true AQE auto-split
Inspect plan df.explain("formatted") Physical plan
Write Parquet df.write.mode("overwrite").parquet(path) Always pick mode
Avoid driver OOM df.show(20) / df.take(n) instead of collect() Driver safety
Submit to cluster spark-submit --master yarn --deploy-mode cluster … Production default
Sizing executors 4-5 cores × 4-6GB/core + 10% overhead
Checkpoint long DAG df.checkpoint() Breaks lineage

Frequently asked questions

What is Apache Spark architecture?

Apache Spark uses a master-worker architecture. The driver runs your program, hosts the SparkContext, builds the DAG of operations, and schedules tasks. Executors are JVM processes running on cluster nodes; each executor has multiple cores and runs tasks (one per partition per core). The cluster manager (YARN, Kubernetes, standalone, or Mesos) allocates resources between the driver and executors. The execution hierarchy is job → stage → task: every action triggers a job; the job is split into stages bounded by shuffles; each stage runs N parallel tasks (one per data partition). Production deployments use --deploy-mode cluster so the driver runs inside the cluster (worker node) rather than on the submitting machine; this is the answer to the spark client mode vs cluster mode interview question.

What's the difference between Spark and Hadoop MapReduce?

spark vs hadoop is one of the most-asked apache spark interview questions. Spark caches intermediate results in RAM (with disk fallback); Hadoop MapReduce writes every stage to disk. For iterative workloads (ML training, graph algorithms) Spark is 10-100× faster because the data stays in memory across iterations. Spark also provides a unified API for batch (DataFrame), streaming (Structured Streaming), ML (MLlib), and graph (GraphX) — all in one engine — while Hadoop's ecosystem is more fragmented. Both run on YARN, both can read from HDFS / S3 / GCS, and modern data lakes commonly use Hadoop's storage layer with Spark's compute layer. The senior interview answer: Spark won the in-memory / iterative / unified-API battle; Hadoop's HDFS is still common storage; MapReduce as a compute engine is largely deprecated in modern stacks.

How does shuffle work in Spark?

shuffle in spark is the operation that redistributes data across executors by a key. Every wide transformation (groupBy, join, distinct, orderBy, repartition) shuffles. Step-by-step: (1) each source task partitions its output by the target key into one file per downstream partition, sorted within each; (2) each downstream task pulls its assigned partition from every source task; (3) the downstream task merge-sorts the streams. If memory runs out, external sort spills sorted chunks to local disk and merges them. Shuffles cost network I/O + disk I/O + serialisation + sort; on most clusters, shuffles dominate job runtime. Reduce shuffle cost by filtering and projecting before the shuffle, broadcasting the small side of joins, pre-aggregating where semantics allow, bucketing storage by frequent join keys, and tuning spark.sql.shuffle.partitions to roughly 2-4 × total cores.

What is the Catalyst optimizer in Spark?

The Catalyst optimizer is Spark SQL's query planner. It transforms a query through four phases: Parsed (raw AST from your code), Analyzed (column / type resolution), Optimized (rule-based and cost-based rewrites), and Physical Plan (operator selection — BroadcastHashJoin vs SortMergeJoin, HashAggregate, etc.). Key rewrites include predicate pushdown (push filters into the source format like Parquet), column pruning (read only the columns you select), constant folding, filter reordering, join reordering, and broadcast detection for small tables. Catalyst applies to both the DataFrame API and Spark SQL strings — they compile to the same plan. Adaptive Query Execution (AQE) in Spark 3.0+ extends Catalyst with runtime optimizations based on observed shuffle statistics: coalesce small post-shuffle partitions, switch sort-merge to broadcast when one side proves small, and split skewed partitions across multiple tasks. Inspect the plan with df.explain("formatted") to verify the rewrites you expected fired.

How do I tune a slow Spark job?

Start with df.explain("formatted") and the Spark UI. The five-step diagnosis: (1) Which stage is slow? — check the Stages tab; long-tail tasks indicate skew, lots of "Shuffle Read" indicates over-shuffling. (2) Are partitions sized right? — aim for 128MB-1GB per partition; tune spark.sql.shuffle.partitions to 2-4 × total cores. (3) Can a join be broadcast? — if one side is < 100MB, use broadcast(small_df) or bump spark.sql.autoBroadcastJoinThreshold. (4) Is there skew? — enable AQE skew join (spark.sql.adaptive.skewJoin.enabled=true) in Spark 3.0+ or manually salt the join key. (5) Are intermediate DataFrames cached? — if used by ≥ 2 actions, df.cache().count() saves the recomputation. Other quick wins: enable AQE (spark.sql.adaptive.enabled=true), filter and project before shuffles, use pandas_udf instead of scalar Python UDFs, and switch to KryoSerializer for production. The senior apache spark interview questions and answers for experienced answer always references the Spark UI before guessing.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including Python practice keyed to Apache Spark architecture, DataFrame operations, distributed-compute thinking, and the optimization patterns every senior Spark loop tests. Whether you're drilling spark interview questions for freshers or grinding apache spark interview questions and answers for experienced, the practice library mirrors the same five-pillar mental model this guide teaches.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse data-manipulation patterns →; reinforce data-analysis drills →; widen coverage on the full streaming Python library →.

Top comments (0)