Gowtham Potureddi

Posted on May 24

PySpark Interview Questions: Top DataFrame, RDD & Optimization Patterns

#python #sql #dataengineering #interview

pyspark interview questions circle around six themes every DE loop tests: the DataFrame API vs the lower-level RDD API, lazy evaluation and the directed-acyclic-graph (DAG) execution model, transformations (lazy, return a new DataFrame) vs actions (eager, trigger computation), joins and shuffle (the most expensive operation), broadcast joins for small-table side joins, caching and persist() for repeated access, and partitioning for parallelism. Whether you're prepping for pyspark interview questions for data engineer at a startup or pyspark interview questions and answers for experienced at FAANG, the same six themes show up — and the senior loops add the Catalyst optimizer, adaptive query execution (AQE), and skew handling on top.

This guide walks through every theme in the pyspark interview questions and answers pdf ecosystem that reviewers love to test in data engineering interview questions: the DataFrame vs RDD comparison, pyspark dataframe transforms (select, filter, groupBy, agg, join, withColumn), the lazy DAG and Spark's execution model, shuffle in spark and why it dominates job cost, broadcast join in pyspark for skewed and small-table joins, cache vs persist pyspark, pyspark repartition vs coalesce, the Catalyst optimizer, and the seven gotchas (skew, OOM, wide vs narrow transformations, lazy-eval surprises, UDF overhead) that fail most candidates. Every section ends as pyspark interview questions with answers: a runnable PySpark snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown.

When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.

On this page

Why PySpark interview questions test the same six themes
DataFrame API vs RDD API — pick the high-level abstraction
Lazy evaluation, the DAG, and transformations vs actions
Joins and the shuffle — the most expensive operation in Spark
Broadcast joins — when one side fits in memory
Caching, persist, repartition and coalesce
Catalyst, AQE, and PySpark optimization gotchas
Choosing the right PySpark pattern (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why PySpark interview questions test the same six themes

The six-theme map covers 90% of PySpark interview prompts

The one-sentence invariant: pyspark interview questions recycle the same six themes (DataFrame vs RDD, lazy evaluation, transformations vs actions, joins / shuffle, broadcast joins, caching / partitioning) — pyspark interview questions for freshers cuts the depth on each, while pyspark interview questions and answers for experienced adds Catalyst, AQE, and skew handling on top. Once you know the six themes, every prompt becomes "which theme is the reviewer probing?"

The six themes at a glance.

DataFrame API vs RDD API — DataFrame is the high-level structured API (use it 95% of the time); RDD is the low-level interface (rare in modern code).
Lazy evaluation and the DAG — every transformation builds a logical plan; nothing runs until an action fires.
Transformations vs actions — transformations are lazy (select, filter, groupBy); actions are eager (show, collect, count, write).
Joins and shuffle — join reshuffles data across the cluster; the most expensive operation in any Spark job.
Broadcast joins — when one side fits in memory, broadcast it to every executor and skip the shuffle.
Caching and partitioning — cache() / persist() to reuse a DataFrame; repartition() / coalesce() to control parallelism.

Why interviewers love PySpark.

It maps to real DE production work — most modern data lakes / lakehouses run on Spark.
It tests distributed-systems thinking — partitioning, shuffles, skew, OOM are all distributed-systems concerns.
The DataFrame API mirrors SQL — engineers fluent in SQL can pick up PySpark quickly; interviews probe the gap.
Optimization is genuinely subtle — broadcast joins, caching, partitioning all have trade-offs.

What interviewers listen for.

Do you default to the DataFrame API instead of reaching for RDDs? — basic-but-tested.
Do you name shuffles as the dominant cost when discussing join performance? — senior signal.
Do you reach for broadcast() when one side of a join is small? — optimization signal.
Do you mention AQE (spark.sql.adaptive.enabled=true) when asked about modern Spark tuning? — bonus points (Spark 3.0+).

pyspark vs pandas — when does Spark win?

Data size > single-node memory — Spark scales horizontally; pandas is single-process.
Distributed compute available — Spark uses every core of every executor in the cluster.
Production batch / streaming pipelines — Spark's fault-tolerance and recovery beat pandas.
Ad-hoc analysis on small data (<10 GB) — pandas is simpler and often faster.

Worked example — a PySpark DataFrame query that hits all six themes

Detailed explanation. Real PySpark scripts combine multiple themes. The snippet below reads a Parquet, filters and aggregates with the DataFrame API (theme 1), chains lazy transformations (theme 3), broadcasts a lookup table for a small-side join (theme 5), caches a reused intermediate (theme 6), and triggers the DAG with .show() (theme 3).

Question. From orders (large Parquet) and customers (small Parquet), compute total paid revenue per region with the customer's country name, broadcasting the small customers table.

Code (PySpark).

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast, col, sum as _sum

spark = SparkSession.builder.appName("revenue_by_region").getOrCreate()

orders    = spark.read.parquet("s3://my-bucket/orders/")
customers = spark.read.parquet("s3://my-bucket/customers/")

paid = (
    orders
    .filter(col("status") == "paid")
    .cache()                                          # reused later
)

result = (
    paid
    .join(broadcast(customers), on="customer_id", how="inner")
    .groupBy("region", "country")
    .agg(_sum("amount").alias("total_revenue"))
    .orderBy(col("total_revenue").desc())
)

result.show(20, truncate=False)
result.write.mode("overwrite").parquet("s3://my-bucket/output/revenue/")

Step-by-step explanation.

spark.read.parquet(...) returns a lazy DataFrame — no data is read yet.
.filter(...) and .cache() add to the DAG; still lazy.
.join(broadcast(customers), ...) tells Spark to broadcast the small side — every executor gets a full copy of customers, no shuffle.
.groupBy().agg() is a wide transformation — would normally shuffle; but the broadcast join means only the post-aggregate exchange happens.
.show(20) is an action — the DAG executes; rows materialise.
.write.parquet(...) is also an action — runs the DAG again unless paid is cached (it is).

Output. Top regions × countries by total paid revenue.

region	country	total_revenue
US	United States	12500.00
EU	Germany	4200.00
EU	France	4150.00
APAC	Japan	3100.00

Rule of thumb: always prefer DataFrame API; broadcast the small side of small/big joins; cache reused intermediates; verify with .explain() that the planner picked the right strategy.

Python
Topic — etl
PySpark ETL drills

Practice →

Python
Topic — data-analysis
DataFrame Python practice

Practice →

2. DataFrame API vs RDD API — pick the high-level abstraction

`pyspark dataframe vs rdd` — default to DataFrame; reach for RDD only for control

The first pyspark interview questions and answers prompt every loop asks: when do you use DataFrame vs RDD? The senior answer in one sentence: always default to DataFrame for structured data — it goes through the Catalyst optimizer; reach for RDD only when you need fine-grained control over partitioning or when the data is unstructured (binary, log lines without a schema).

DataFrame — the high-level structured API.

What it is — a distributed table with named columns and types; conceptually a SQL table.
Lazy — transformations build a query plan; actions trigger execution.
Optimized — every query goes through the Catalyst logical/physical optimizer.
Tungsten execution — code-generated, off-heap, columnar; far faster than naive row-by-row.
SQL-friendly — df.createOrReplaceTempView("t"); spark.sql("SELECT ... FROM t").
The right default for 95% of PySpark work.

RDD — the low-level resilient distributed dataset.

What it is — a distributed collection of untyped Python objects.
Lazy — transformations build a DAG; actions trigger execution.
NOT optimized — the Catalyst optimizer doesn't apply; you control everything.
Manual partitioning — rdd.partitionBy(num_partitions, hash_fn); explicit control.
Unstructured-data friendly — raw text, binary, custom serialisation.
Rare in modern code — most PySpark 3+ codebases avoid RDDs except for low-level operations.

Dataset — the JVM-typed middle ground (Scala / Java only).

Type-safe DataFrame — case-class-backed in Scala / Java.
Not available in Python — PySpark only has DataFrame and RDD.
Bridge between the two — has the optimizer benefits of DataFrame and the type safety of RDD.

pyspark dataframe vs rdd — the decision table.

Concern	DataFrame	RDD
Structured tables	✓	manual schema
Optimizer (Catalyst)	✓	✗
Tungsten exec engine	✓	✗
Spark SQL integration	✓	manual conversion
Type safety (Python)	dynamic	dynamic
Custom partitioning	limited	full control
Raw / binary data	awkward	natural fit
Performance default	fast	usually slower
Use in modern code	95%+	< 5%

Common DataFrame transformations.

df.select("a", "b") — project columns; equivalent to SQL SELECT a, b.
df.filter(col("amount") > 100) — row filter; SQL WHERE.
df.withColumn("rev_eur", col("amount") * 0.92) — add a derived column.
df.groupBy("region").agg(sum("amount").alias("total")) — aggregate; SQL GROUP BY.
df.join(other, on="key", how="inner") — JOIN.
df.orderBy(col("total").desc()) — ORDER BY.
df.distinct() / df.dropDuplicates(["col"]) — dedupe.
df.union(other) — UNION ALL; df.unionByName(other) — same but match by column name.

Spark SQL — write SQL strings directly.

df.createOrReplaceTempView("orders")
result = spark.sql("""
    SELECT region, SUM(amount) AS total
    FROM   orders
    WHERE  status = 'paid'
    GROUP BY region
    ORDER BY total DESC
""")

Same optimizer — spark.sql(...) and the DataFrame API both compile to the same Catalyst plan.
Pick what's clearer — SQL for set-based queries; DataFrame API for programmatic transforms.
Temporary views — session-scoped; createOrReplaceGlobalTempView for cross-session.

Python
Topic — data-manipulation
DataFrame transform drills

Practice →

Python
Topic — etl
PySpark ETL patterns

Practice →

3. Lazy evaluation, the DAG, and transformations vs actions

`lazy evaluation in spark` — nothing runs until an action fires

The #2 most-asked pyspark interview questions topic. The senior answer: Spark builds a logical DAG (directed acyclic graph) from every transformation; the DAG only executes when an action fires (show, collect, count, write). This lets the Catalyst optimizer see the whole plan and rewrite it — push filters down, fuse operations, pick the best join strategy.

Transformations — lazy.

Return a new DataFrame without computing anything.
Build the DAG node by node as you chain calls.
Examples — select, filter, withColumn, groupBy, join, union, repartition, cache.
Narrow vs wide (next section) — determines whether a shuffle is needed.

Actions — eager.

Trigger DAG execution and materialise a result.
Examples — show, collect, count, take, first, head, write, foreach.
collect() warning — pulls every row to the driver; OOMs the driver on large data; use take(n) for sampling.

pyspark transformations vs actions — the decision rule.

If it returns a DataFrame → transformation (lazy).
If it returns a value (int, list, None) or writes to storage → action (eager).
Chain transformations freely — they're cheap; the DAG builds in memory.
Be careful with actions — each one re-runs the DAG from the source unless the DataFrame is cached.

Narrow vs wide transformations.

Narrow — each output partition depends on one input partition; no shuffle. Examples: map, filter, select, withColumn.
Wide — each output partition depends on multiple input partitions; requires a shuffle. Examples: groupBy, join, distinct, orderBy, repartition.
Shuffle is expensive — data crosses network boundaries; the dominant cost in most Spark jobs.

The DAG and Spark UI.

df.explain(extended=True) — prints the parsed logical plan, analysed plan, optimized plan, and physical plan.
df.explain("formatted") — newer, more readable format (Spark 3.0+).
Spark UI's SQL / DAG tab — visual graph of stages, tasks, shuffles, and timings.
Stages and tasks — a stage is bounded by shuffles; a task is one partition processed by one executor.

spark.read.parquet(...) is also lazy.

Reads schema metadata at definition time (so subsequent transformations type-check).
Doesn't read data until an action fires.
Column pruning and predicate pushdown are applied automatically based on subsequent transformations.

Common lazy-evaluation traps.

Side effects in transformations don't fire — print() inside a map only runs when the action triggers the DAG.
Caching without an action doesn't materialise — df.cache() marks the DataFrame for caching but doesn't actually cache until an action runs.
Re-running the same DataFrame re-runs the DAG — unless cached.
Variable mutation inside a transformation closure — Spark serialises the closure; mutations in the driver don't propagate to executors.

Forcing materialisation.

df.cache().count() — caches the DataFrame and immediately forces it to materialise.
df.write.mode("overwrite").parquet("/tmp/checkpoint") — writes to disk; downstream reads can re-read instead of re-computing.
df.checkpoint() — Spark-managed checkpoint that breaks the DAG lineage (for very long DAGs that get expensive to recompute).

Python
Topic — etl
Lazy DAG drills

Practice →

Python
Topic — data-analysis
PySpark execution patterns

Practice →

4. Joins and the shuffle — the most expensive operation in Spark

`shuffle in spark` — every join (without broadcast) means data crosses the network

shuffle in spark is the single biggest cost in most jobs and the #3 most-tested pyspark interview questions and answers topic. The senior signal: knowing exactly when a shuffle fires, what it costs, and how to avoid one.

What is a shuffle?

Spark moves data between executors based on key-based partitioning.
For a join(other, on="key"), both sides need the same "key" partitioned the same way; the engine repartitions both sides.
For a groupBy("region"), rows with the same region need to land on the same executor; data is shuffled by region.
Cost — network I/O + disk I/O (intermediate files) + serialisation; can dominate job runtime.

The default sort-merge join (SMJ).

Spark's default join strategy for non-broadcast joins.
Step 1 — partition both sides by the join key (shuffle).
Step 2 — sort each partition by the join key.
Step 3 — merge-join the sorted streams.
Cost — O((n + m) log p) where p is partition size; sort is the dominant cost after the shuffle.

Other join strategies.

Broadcast hash join (BHJ) — broadcast the small side; no shuffle on either side (see §5).
Shuffle hash join (SHJ) — like SMJ but builds a hash map of the smaller post-shuffle partition; faster when partitions fit in memory.
Cartesian product — no join key; every row of A pairs with every row of B; usually a sign of a bug.

spark join interview question — the canonical decision tree.

One side is small (< spark.sql.autoBroadcastJoinThreshold, default 10MB) → broadcast hash join (auto-picked by Catalyst).
Both sides are large, but one is bucketed by the join key → bucket-aware join (no shuffle).
Both sides are large and not bucketed → sort-merge join with shuffle.
Skewed join keys → AQE skew join optimization (Spark 3.0+).

Skew — the silent job-killer.

Skewed keys — one or a few key values appear in 90% of rows.
Symptom — one executor takes 10× longer than the others; the job is "almost done" forever.
Diagnosis — Spark UI shows one task with 10-100× more rows than peers.
Fixes:
- AQE skew join — spark.sql.adaptive.skewJoin.enabled=true (Spark 3.0+); automatic split of skewed partitions.
- Salted keys — add a random suffix to the skewed key, join on the salted key, then aggregate; manual but effective.
- Broadcast the smaller skewed table — if it fits.

Reducing shuffle volume.

Filter early — df.filter(...) before join reduces the data being shuffled.
Project only needed columns — df.select("k", "a") before join cuts row width.
Pre-aggregate — groupBy before join if it makes semantic sense.
Bucket both sides — if you join the same way repeatedly, bucket the storage layer.

Spark UI: reading the shuffle.

Stages with shuffle — visible as boundaries in the DAG; "Shuffle Write" and "Shuffle Read" columns.
Shuffle Read Size — total bytes read across all tasks of the stage; the dominant cost metric.
Spill to disk — when partitions exceed memory; visible in "Spill (Memory)" / "Spill (Disk)" columns; a sign of OOM pressure.

Python
Topic — etl
Shuffle / join drills

Practice →

Python
Topic — data-manipulation
PySpark join optimization

Practice →

5. Broadcast joins — when one side fits in memory

`broadcast join in pyspark` — eliminate the shuffle on small-side joins

When one side of a join is small enough to fit in each executor's memory, broadcasting it eliminates the shuffle entirely. The driver sends the small DataFrame to every executor; each executor joins its local partition of the large side against the in-memory broadcast copy. The result is a massive speedup — sometimes 10-100× — for the common "huge fact joined with small dimension" pattern.

broadcast() — the explicit hint.

from pyspark.sql.functions import broadcast

result = orders.join(broadcast(customers), on="customer_id", how="inner")

broadcast(df) — hint to the planner: "this side is small; broadcast it".
Catalyst still verifies — won't broadcast if the table is too large.
No shuffle on either side — every executor has the full small table locally.

Auto-broadcast threshold.

spark.sql.autoBroadcastJoinThreshold — default 10MB (10485760).
Catalyst auto-broadcasts any side smaller than this threshold without the explicit hint.
Tune up — increase to 100MB if executors have enough memory and you have many small lookup tables.
Disable — set to -1 to turn off auto-broadcast entirely; rarely needed.

Broadcast a Python dict — small-lookup pattern.

lookup = {1: "US", 2: "EU", 3: "APAC"}
bc_lookup = spark.sparkContext.broadcast(lookup)

def map_region(rid):
    return bc_lookup.value.get(rid, "unknown")

map_region_udf = udf(map_region, StringType())
result = orders.withColumn("region", map_region_udf(col("region_id")))

spark.sparkContext.broadcast(obj) — broadcasts any serializable Python object.
Access on executors — bc.value reads the local copy.
UDF overhead — Python UDFs are slow (Python ↔ JVM serialisation); prefer native functions or pandas_udf when possible.

When broadcast is the wrong choice.

Both sides are large — broadcasting a 5GB table to 100 executors uses 500GB of network bandwidth.
Driver memory pressure — broadcast() materialises the small DF on the driver before shipping; if it's bigger than driver memory, the job crashes there first.
Memory-constrained executors — broadcast copies take memory away from the join's working set.

Bloom filters — a less-heavy alternative.

spark.sql.optimizer.runtimeFilter.bloomFilter.enabled=true (Spark 3.2+) — Catalyst injects bloom filters into the large side based on the small side's keys.
Pre-filters the large side before the shuffle; reduces shuffle size.
Works well when the small-large ratio is in the middle (too big to broadcast, too small to ignore).

Verifying the broadcast.

df.explain() — look for BroadcastHashJoin in the physical plan instead of SortMergeJoin.
Spark UI — broadcast operations show as BroadcastExchange nodes.
If broadcast didn't fire — check the size of the broadcast side (df.count() / size estimation), bump autoBroadcastJoinThreshold, or add the broadcast() hint explicitly.

broadcast join interview question — the senior answer template.

"I'd broadcast customers because it's a small lookup (< 100MB), the join key is non-unique on the large side, and the alternative sort-merge join would cost a 10-billion-row shuffle. I'd verify with .explain() that the planner picked BroadcastHashJoin."

Python
Topic — etl
Broadcast join drills

Practice →

Python
Topic — data-analysis
PySpark optimization patterns

Practice →

6. Caching, persist, repartition and coalesce

`cache vs persist pyspark` and `repartition vs coalesce` — the two pairs you must know

Two PySpark interview pairs that come up in every loop: cache vs persist pyspark (when do you store intermediate results in memory?) and pyspark repartition vs coalesce (when do you reshape the partition layout?). Knowing the difference inside each pair is a sharp senior-vs-junior signal.

cache() and persist() — store intermediate results.

df.cache() — shorthand for df.persist(StorageLevel.MEMORY_AND_DISK).
df.persist(level) — explicit storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, plus serialised variants.
Lazy — doesn't actually cache until an action fires.
Use case — DataFrame reused across multiple actions; without caching, each action re-runs the DAG.
Always pair with an action that materialises — df.cache().count() forces the cache to populate.

cache vs persist pyspark — the comparison.

Aspect	`cache()`	`persist(level)`
Storage level	`MEMORY_AND_DISK` only	any level
Verbosity	shorter	explicit
Memory tier	yes	configurable
Disk fallback	yes	configurable

Storage levels — pyspark.StorageLevel.

MEMORY_ONLY — RAM only; spills not retained (recomputed); fastest but OOM-prone.
MEMORY_AND_DISK — RAM with disk fallback; the safest default; what cache() uses.
DISK_ONLY — disk only; slowest but most durable.
MEMORY_ONLY_SER / MEMORY_AND_DISK_SER — serialised storage; smaller footprint, more CPU.
OFF_HEAP — Tungsten off-heap memory; advanced.

When to cache.

DataFrame used by 2+ actions — .cache() saves the second computation.
Iterative algorithms — ML training loops, graph traversals.
Complex DAGs — checkpoint to break long lineage chains.

When NOT to cache.

One-shot DAG — only one action ever runs; caching is wasted work.
Memory-constrained cluster — caching evicts other tenants from the cache pool.
Large DataFrames — MEMORY_ONLY may OOM; MEMORY_AND_DISK may not fit on local disks.

unpersist() — explicit release.

df.unpersist() — drops the cached blocks; frees memory.
Good practice — call when the DataFrame is no longer needed; reduces memory pressure on the cluster.
Spark auto-evicts under memory pressure too (LRU), but explicit is cleaner.

pyspark repartition vs coalesce — reshape partition layout.

df.repartition(N) — forces a full shuffle; redistributes data into exactly N partitions of roughly equal size.
df.repartition(N, col) — partitions by hash of col into N partitions; useful before key-based operations.
df.coalesce(N) — merges existing partitions down to N without a shuffle; only valid when reducing partition count.
coalesce never increases partitions — calling coalesce(100) on a 20-partition DataFrame leaves it at 20.

When to repartition / coalesce.

Too few partitions for parallelism — df.repartition(200) if you have 8 nodes × 4 cores = 32 cores but only 4 partitions.
Too many tiny partitions — df.coalesce(20) before writing to reduce small-file output.
Skewed partitions — df.repartition(N, "balanced_key") to redistribute.
Pre-write tuning — coalesce to match your target file count.

Common partitioning rule of thumb.

Target partition size — 128MB-1GB; smaller wastes scheduling overhead, larger risks OOM.
Partition count — at least 2-4 × total_cores for good parallelism on shuffles.
spark.sql.shuffle.partitions — default 200; tune based on data size.
AQE — dynamically coalesces small post-shuffle partitions in Spark 3.0+.

Verifying partition count.

df.rdd.getNumPartitions() — exact count.
Spark UI — Tasks per stage = partitions in the input.
df.write.option("maxRecordsPerFile", N) — control output file size.

Python
Topic — etl
Caching / partition drills

Practice →

Python
Topic — data-manipulation
PySpark performance tuning

Practice →

7. Catalyst, AQE, and PySpark optimization gotchas

Seven optimization gotchas every PySpark interview tests

Catalyst (the query optimizer) and AQE (Adaptive Query Execution) do a lot for you automatically, but seven gotchas separate the senior engineer from the junior on optimization questions.

Catalyst — the query optimizer.

Logical plan → analysed plan → optimized plan → physical plan.
Optimizations — predicate pushdown, column pruning, constant folding, filter reordering, join reordering, broadcast detection.
Source-side pushdown — Catalyst pushes filters and column projections into the source (Parquet, JDBC); reads only what's needed.
df.explain(extended=True) — shows all four plan stages.

AQE — Adaptive Query Execution (Spark 3.0+).

spark.sql.adaptive.enabled=true — turn it on in production.
Runtime decisions — uses actual row counts (from shuffle metadata) to re-plan subsequent stages.
Three automatic optimizations:
- Coalesce shuffle partitions — merges small post-shuffle partitions.
- Skew join handling — splits skewed partitions across multiple tasks.
- Switch to broadcast — promotes a sort-merge join to a broadcast join if one side turns out to be small.

Gotcha 1 — Python UDFs are slow.

The bug — df.withColumn("x", py_udf(col("y"))) serialises every row to Python and back; 10-100× slower than native functions.
Fix — use built-in pyspark.sql.functions; fall back to pandas_udf (vectorised, ~10× faster than scalar UDFs) when truly needed.

Gotcha 2 — collect() to driver on big data.

The bug — df.collect() pulls every row to the driver; OOMs on > driver memory.
Symptom — driver crash with OutOfMemoryError.
Fix — use .show(20) for sampling, .take(n) for limited fetch, or write to storage and read back.

Gotcha 3 — re-running the same DataFrame without cache.

The bug — df.count(); df.show(); df.write.parquet(...) re-runs the DAG three times.
Fix — df.cache().count() to materialise once, then re-use cheaply.

Gotcha 4 — too many small files on write.

The bug — df.write.parquet(out) produces one file per partition; with 1000 partitions you get 1000 tiny files; downstream readers (and HDFS / S3) suffer.
Fix — df.coalesce(20).write.parquet(out) reduces to 20 files; tune to ~128MB-1GB per file.

Gotcha 5 — implicit shuffle on orderBy / distinct / groupBy.

The bug — these all trigger a full-cluster shuffle.
Symptom — slow stage with high "Shuffle Read" bytes.
Fix — pre-aggregate / filter / project before; partition by the relevant key if you'll do this repeatedly.

Gotcha 6 — broadcast join that doesn't fire.

The bug — the small side is just above autoBroadcastJoinThreshold (10MB default); Catalyst picks sort-merge join.
Symptom — surprisingly slow join; .explain() shows SortMergeJoin.
Fix — bump the threshold (spark.sql.autoBroadcastJoinThreshold=104857600) or use explicit broadcast(df).

Gotcha 7 — skewed keys causing one-task stragglers.

The bug — one key value (e.g. "US" in a region column) has 90% of rows; one executor processes them all while others sit idle.
Symptom — Spark UI shows one task with 100× more rows than peers; stage's tail is dragging.
Fix — enable AQE skew join (spark.sql.adaptive.skewJoin.enabled=true) or manually salt the join key.

Spark configuration quick wins.

spark.sql.adaptive.enabled=true — turn on AQE (Spark 3.0+).
spark.sql.adaptive.coalescePartitions.enabled=true — auto-coalesce small post-shuffle partitions.
spark.sql.adaptive.skewJoin.enabled=true — auto-handle skewed joins.
spark.sql.shuffle.partitions=400 — bump from default 200 if cluster has > 200 cores.
spark.sql.autoBroadcastJoinThreshold=104857600 — 100MB if you have many small lookup tables.

Python
Topic — etl
PySpark optimization drills

Practice →

Python
Topic — real-time-analytics
Distributed-compute patterns

Practice →

Choosing the right PySpark pattern (cheat sheet)

A one-screen cheat sheet for the most-asked pyspark interview questions patterns.

You want to …	PySpark pattern	Notes
Read structured data	`spark.read.parquet(path)`	Lazy; columnar; production default
Filter rows	`df.filter(col("x") > 100)`	Pushed down to source where possible
Project columns	`df.select("a", "b")`	Reduces row width
Add a derived column	`df.withColumn("c", col("a") * 2)`	Lazy
GROUP BY + aggregate	`df.groupBy("k").agg(sum("v").alias("total"))`	Triggers shuffle (wide)
Join	`df.join(other, on="k", how="inner")`	Sort-merge default; broadcast if small
Broadcast small side	`df.join(broadcast(small), ...)`	Eliminates shuffle
Materialise reused DF	`df.cache().count()`	Pair with action
Force partition count	`df.repartition(200)`	Triggers full shuffle
Reduce partition count	`df.coalesce(20)`	No shuffle; only reduces
Trigger DAG	`df.show()` / `df.count()` / `df.write...`	Actions
Inspect plan	`df.explain("formatted")`	Read physical plan
Enable AQE	`spark.sql.adaptive.enabled=true`	Runtime re-planning
Handle skew	`spark.sql.adaptive.skewJoin.enabled=true`	AQE automatic split
Vectorise UDFs	`@pandas_udf(returnType)`	10× faster than scalar UDF
Write Parquet	`df.write.mode("overwrite").parquet(path)`	Always pick a `mode`
Avoid collect-OOM	`df.show(20)` / `df.take(n)`	Never `collect()` on big data

Frequently asked questions

What's the difference between RDD and DataFrame in PySpark?

RDD is the low-level resilient distributed dataset — a distributed collection of untyped Python objects with no schema; you manage partitioning, serialisation, and execution manually. DataFrame is the high-level structured API — a distributed table with named, typed columns; every query goes through the Catalyst optimizer (predicate pushdown, column pruning, broadcast detection, join reordering) and executes through the Tungsten code-generated engine. The interview-canonical answer to pyspark dataframe vs rdd: default to DataFrame for 95% of PySpark work; reach for RDD only when you need fine-grained control over partitioning or when the data is unstructured (binary, raw log lines). In modern PySpark 3+ codebases, RDDs are rare — most pipelines never touch them directly.

What is lazy evaluation in PySpark and why does it matter?

Lazy evaluation in Spark means every transformation (select, filter, groupBy, join, withColumn) returns a new DataFrame without computing anything — it just adds a node to the directed-acyclic graph (DAG) of operations. The DAG only executes when an action fires (show, count, collect, write, take). This lets the Catalyst optimizer see the whole plan before any work happens — push filters down to the source, prune unread columns, fuse operations, pick the best join strategy, even switch sort-merge to broadcast at runtime via AQE. The practical implication: chain transformations freely (they're free until an action fires); be careful with actions (each one re-runs the DAG unless the DataFrame is cached); use .explain() to inspect the plan; avoid collect() on large data (pulls every row to the driver and OOMs).

What's the difference between cache() and persist() in PySpark?

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK) — the default storage level that keeps the DataFrame in RAM with disk fallback when memory is tight. df.persist(level) is the explicit form: you choose the storage tier (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, plus serialised _SER variants and OFF_HEAP). Both are lazy — neither actually stores the DataFrame until an action fires; the canonical pattern is df.cache().count() to force materialisation. Always call df.unpersist() when the DataFrame is no longer needed to free cluster memory. The interview answer to cache vs persist pyspark: use cache() for the safe default; use persist(specific_level) only when you need a different storage tier for memory or durability reasons.

When should I use a broadcast join in PySpark?

Use broadcast(df) when one side of a join is small enough to fit in each executor's memory — typically under 100MB but tunable via spark.sql.autoBroadcastJoinThreshold (default 10MB; Catalyst auto-broadcasts below this). The driver sends the small DataFrame to every executor; each executor joins its local partition of the large side against the in-memory broadcast copy, eliminating the shuffle that a sort-merge join would otherwise require. Typical speedups are 10-100× on huge-fact × small-dimension joins. Don't broadcast when (1) both sides are large — broadcasting a multi-GB DF to many executors wastes bandwidth; (2) executors are memory-constrained — the broadcast copy competes with other data; (3) the driver lacks memory — broadcast() materialises the small DF on the driver first. Verify with df.explain() that the physical plan shows BroadcastHashJoin instead of SortMergeJoin.

What's the difference between repartition and coalesce in PySpark?

df.repartition(N) forces a full shuffle and redistributes data into exactly N partitions of roughly equal size; df.repartition(N, "col") partitions by the hash of col. It can either increase or decrease the partition count. df.coalesce(N) merges existing partitions down to N without a shuffle — it just combines local partitions; it can only decrease the partition count (calling coalesce(100) on a 20-partition DataFrame leaves it at 20). The interview rule of thumb for pyspark repartition vs coalesce: use repartition when you need more partitions (parallelism) or when you need a specific key-based distribution (for subsequent joins); use coalesce when you need fewer partitions and want to skip the shuffle (e.g. before writing to avoid producing thousands of tiny files). Modern Spark 3.0+ AQE coalesces small post-shuffle partitions automatically — turn it on with spark.sql.adaptive.coalescePartitions.enabled=true.

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including Python practice keyed to PySpark DataFrame operations, ETL workflows, distributed-compute thinking, and the optimization patterns every senior PySpark loop tests. Whether you're drilling pyspark interview questions for freshers or grinding pyspark interview questions and answers for experienced, the practice library mirrors the same six-theme mental model this guide teaches.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse data-manipulation patterns →; reinforce data-analysis drills →; widen coverage on the full streaming Python library →.

DEV Community

PySpark Interview Questions: Top DataFrame, RDD & Optimization Patterns

1. Why PySpark interview questions test the same six themes

The six-theme map covers 90% of PySpark interview prompts

Worked example — a PySpark DataFrame query that hits all six themes

2. DataFrame API vs RDD API — pick the high-level abstraction

`pyspark dataframe vs rdd` — default to DataFrame; reach for RDD only for control

3. Lazy evaluation, the DAG, and transformations vs actions

`lazy evaluation in spark` — nothing runs until an action fires

4. Joins and the shuffle — the most expensive operation in Spark

`shuffle in spark` — every join (without broadcast) means data crosses the network

5. Broadcast joins — when one side fits in memory

`broadcast join in pyspark` — eliminate the shuffle on small-side joins

6. Caching, persist, repartition and coalesce

`cache vs persist pyspark` and `repartition vs coalesce` — the two pairs you must know

7. Catalyst, AQE, and PySpark optimization gotchas

Seven optimization gotchas every PySpark interview tests

Choosing the right PySpark pattern (cheat sheet)

Frequently asked questions

What's the difference between RDD and DataFrame in PySpark?

What is lazy evaluation in PySpark and why does it matter?

What's the difference between cache() and persist() in PySpark?

When should I use a broadcast join in PySpark?

What's the difference between repartition and coalesce in PySpark?

Practice on PipeCode

Top comments (0)

1. Why PySpark interview questions test the same six themes

The six-theme map covers 90% of PySpark interview prompts

Worked example — a PySpark DataFrame query that hits all six themes

2. DataFrame API vs RDD API — pick the high-level abstraction

pyspark dataframe vs rdd — default to DataFrame; reach for RDD only for control

3. Lazy evaluation, the DAG, and transformations vs actions

lazy evaluation in spark — nothing runs until an action fires

4. Joins and the shuffle — the most expensive operation in Spark

shuffle in spark — every join (without broadcast) means data crosses the network

5. Broadcast joins — when one side fits in memory

broadcast join in pyspark — eliminate the shuffle on small-side joins

6. Caching, persist, repartition and coalesce

cache vs persist pyspark and repartition vs coalesce — the two pairs you must know

7. Catalyst, AQE, and PySpark optimization gotchas

Seven optimization gotchas every PySpark interview tests

Choosing the right PySpark pattern (cheat sheet)

Frequently asked questions

What's the difference between RDD and DataFrame in PySpark?

What is lazy evaluation in PySpark and why does it matter?

What's the difference between cache() and persist() in PySpark?

When should I use a broadcast join in PySpark?

What's the difference between repartition and coalesce in PySpark?

Practice on PipeCode

`pyspark dataframe vs rdd` — default to DataFrame; reach for RDD only for control

`lazy evaluation in spark` — nothing runs until an action fires

`shuffle in spark` — every join (without broadcast) means data crosses the network

`broadcast join in pyspark` — eliminate the shuffle on small-side joins

`cache vs persist pyspark` and `repartition vs coalesce` — the two pairs you must know