Hannah Usmedynska

Posted on Mar 27

100 Spark Interview Questions for Data Engineer

#career #dataengineering #interview #resources

The framework sits at the center of most data pipelines, and interviewers test for it accordingly. Whether you are preparing as a candidate or building a question bank as a hiring manager, this set of 100 Spark interview questions for data engineer roles covers every seniority level and the most common technical topics.

Preparing for the Spark Interview

Both recruiters and technical specialists benefit from a structured question bank. It speeds up candidate screening and helps engineers close knowledge gaps before the real conversation.

How Sample Spark Interview Questions Help Recruiters

A structured set of Apache Spark interview questions data engineer candidates face lets recruiters screen technical depth without engineering support. Compare answers side by side and move qualified profiles forward faster.

How Sample Spark Interview Questions Help Data Engineers

Working through data engineer Spark interview questions before the call exposes blind spots in shuffle internals and pipeline design. Pair this list with Spark developer interview questions if your work extends into application-level code.

List of 100 Spark Interview Questions for Data Engineers

Each group opens with five bad-and-good answer pairs. These Spark data engineer interview questions cover fundamentals through production edge cases.

Spark Interview Questions for Junior Data Engineer

Start with the fundamentals every entry-level candidate should explain clearly.

1: What is Apache Spark and why do data engineers use it?

Bad Answer: It is a programming language for big data.

Good Answer: It is a distributed compute engine that processes data in memory across a cluster. It handles batch ETL, streaming, and SQL in one framework.

2: What is the difference between a transformation and an action?

Bad Answer: Transformations run the code, actions save it.

Good Answer: Transformations build a plan but do not execute until an action like count or write triggers the job.

3: What are RDDs and when would you encounter them?

Bad Answer: RDDs are old and never used anymore.

Good Answer: An RDD is a low-level distributed collection with no schema. You see them with custom partitioning or legacy code.

4: How does the framework handle fault tolerance?

Bad Answer: It copies all data to every node.

Good Answer: Each DataFrame tracks its lineage. If a partition is lost, the engine recomputes it from the lineage.

5: What is the difference between a DataFrame and a Dataset?

Bad Answer: A Dataset is a DataFrame with another name.

Good Answer: A DataFrame is Dataset[Row], untyped. A Dataset carries compile-time type safety in Scala and Java.

6: What is a SparkSession?

The unified entry point from 2.0 replacing SparkContext, SQLContext, and HiveContext.

7: What does lazy evaluation mean?

Transformations build a plan but nothing runs until an action is called.

8: What is a partition?

A chunk of data processed by one task. Partition count controls parallelism.

9: What file formats are supported natively?

Parquet, ORC, JSON, CSV, Avro, and text. Parquet and ORC support predicate pushdown.

10: What is the driver program?

It converts code into a DAG, negotiates resources, sends tasks, and collects results.

11: What does an executor do?

Runs tasks on a worker node, stores cached data, and reports back to the driver.

12: What is a DAG?

A Directed Acyclic Graph of transformations split into stages at shuffle boundaries.

13: How do you read a CSV into a DataFrame?

Spark.read.option(‘header’,’true’).csv(‘path’). Add inferSchema or supply a StructType.

14: What is the difference between cache and persist?

cache() uses MEMORY_ONLY. persist() accepts a storage level for memory, disk, or both.

15: What is schema inference?

The engine samples the data to detect column types. Slower than an explicit schema.

16: What is a narrow transformation?

Each input partition maps to one output partition. No shuffle. Examples: map, filter.

17: What is a wide transformation?

Needs data from multiple partitions, triggering a shuffle. Examples: groupBy, join.

18: What does repartition do?

Redistributes data through a full shuffle. Can increase or decrease partition count.

19: How is coalesce different from repartition?

Coalesce merges partitions without a shuffle. It can only reduce, not increase the count.

20: What is a broadcast variable?

A read-only copy sent once to every executor to avoid shipping data with each task.

21: What is an accumulator?

A write-only variable tasks add to. The driver reads the total after the job.

22: What does Spark-submit do?

Packages the application and submits it to the cluster manager with config flags.

23: What is the default storage level for cache?

MEMORY_ONLY for RDDs, MEMORY_AND_DISK for Datasets.

24: How do you write to Parquet?

df.write.parquet(‘path’). Add mode and partitionBy as needed.

25: What is the web UI for?

Shows jobs, stages, tasks, storage, and SQL plans to spot skew and slow stages.

Spark Interview Questions for Middle Data Engineer

These questions target mid-level engineers who run production jobs and handle tuning on a regular basis.

1: How does the Catalyst optimizer work?

Bad Answer: It caches queries automatically.

Good Answer: Catalyst parses a logical plan, applies predicate pushdown and constant folding, then generates a physical plan compiled by Tungsten.

2: What is the difference between groupByKey and reduceByKey?

Bad Answer: They are the same.

Good Answer: groupByKey shuffles all values before aggregation, risking OOM. reduceByKey reduces locally first, cutting shuffle volume.

3: How would you handle data skew in a join?

Bad Answer: Add more memory.

Good Answer: Salt the skewed key, replicate the other side to match, and join on the salted key. AQE does this automatically when enabled.

4: What is Adaptive Query Execution?

Bad Answer: It just means faster queries.

Good Answer: AQE re-optimizes the plan at runtime using shuffle statistics. It coalesces small partitions and switches join strategies.

5: How does partitioning affect shuffle performance?

Bad Answer: It does not matter with enough memory.

Good Answer: Poor partitioning forces extra shuffles. Partitioning by the join key eliminates the shuffle for that operation.

6: What is predicate pushdown?

Pushes filters to the source so only matching rows are read.

7: How does broadcast join work?

The small table is sent to every executor. Each joins locally without shuffling the large side.

8: What is whole-stage code generation?

Tungsten compiles operators in a stage into one Java function, removing virtual calls.

9: How do you tune shuffle partitions?

Default is 200. Raise for large data, lower for small jobs. AQE auto-coalesces.

10: What is a sort-merge join?

Both sides shuffle by key and sort. Good for large-to-large joins but costly.

11: What is the Tungsten engine?

Manages memory off-heap, avoids GC, and generates bytecode at runtime.

12: Client vs. cluster deploy mode?

Client runs the driver locally. Cluster runs it inside the cluster. Cluster is standard for production.

13: How does dynamic allocation work?

Requests executors when tasks queue, releases idle ones. Needs the external shuffle service.

14: What serialization formats are supported?

Java (slow default), Kryo (faster), and Tungsten binary for DataFrames.

15: When do you use checkpointing?

To truncate deep lineage in iterative or streaming jobs by writing to reliable storage.

16: How do you monitor a running job?

Web UI for stage metrics. Prometheus or Graphite sinks for production alerting.

17: What is speculative execution?

A duplicate launches for slow tasks. Whichever finishes first wins.

18: How do you size executor memory and cores?

Four to five cores per executor. Memory at 75% of node RAM with OS headroom.

19: What is a shuffle spill?

Shuffle data exceeds memory and spills to disk, slowing the job.

20: What happens at a stage boundary?

Shuffle output is written to disk, then the next stage reads and sorts it by key.

21: How do you handle nulls in a pipeline?

na.fill, na.drop, or coalesce/when expressions. Define handling early.

22: What is column pruning?

The optimizer drops unused columns so only needed data is read from storage.

23: What is the cost-based optimizer?

Uses ANALYZE TABLE statistics to pick the cheapest join order and strategy.

24: How do you read from Kafka?

readStream.format(‘kafka’) with broker addresses and topic subscription.

25: Append vs. complete output mode?

Append writes new rows only. Complete rewrites the full result table.

Spark Interview Questions for Senior Data Engineer

This section covers Spark interview questions for experienced data engineer profiles and Spark data engineer technical interview questions on architecture, streaming, and cluster management.

1: How would you design a multi-hop lakehouse architecture?

Bad Answer: Write everything to one big table.

Good Answer: Bronze for raw data, silver for cleaning and dedup, gold for aggregates. Delta Lake or Iceberg adds ACID across layers.

2: What is the impact of GC on long-running jobs?

Bad Answer: GC pauses are too small to notice.

Good Answer: Long pauses stall tasks and bloat stage duration. Tune G1GC regions and keep objects off-heap through Tungsten.

3: How do you achieve exactly-once in Structured Streaming?

Bad Answer: The engine handles it automatically.

Good Answer: Checkpoint to durable storage and use an idempotent sink. The engine replays uncommitted batches after failure.

4: When would you pick RDDs over DataFrames in production?

Bad Answer: Always. RDDs give more control.

Good Answer: Only for custom partitioning, non-tabular data, or low-level APIs without a DataFrame equivalent.

5: How do you manage backpressure in streaming?

Bad Answer: Just add more executors.

Good Answer: Cap batch size with maxOffsetsPerTrigger. Monitor processing time vs trigger interval and scale before lag grows.

6: How does the external shuffle service work?

A separate process per node serves shuffle files so executors can leave without losing data.

7: V1 vs. V2 DataSource API?

V2 adds columnar reads, partition pushdown, transactional writes, and streaming.

8: How do you debug Container killed by YARN?

Memory plus overhead exceeds the container limit. Check broadcasts, UDFs, and concurrency.

9: How does the framework integrate with Hive?

SparkSession reads the metastore for schemas and partitions, replacing Hive’s engine with Catalyst.

10: How do you implement SCD Type 2 with Delta Lake?

MERGE on the business key. Close old versions, insert new ones in one transaction.

11: What is Z-ordering?

Interleaves column bits into one sort order so related values land in the same files.

12: How do you manage cluster concurrency?

YARN or K8s queues with per-team limits. Cap per-app resources to prevent starvation.

13: What problems do small files cause?

High metadata overhead, excess tasks, and namenode load. Compact before writing.

14: How do you migrate a MapReduce pipeline?

Mapper becomes filter/select/withColumn. Reducer becomes groupBy/agg. Validate on a sample.

15: How do you test data pipelines?

Unit-test transforms with static DataFrames. Integration-test writes on a local cluster.

16: Micro-batch vs. continuous processing?

Micro-batch is simpler with full Catalyst support. Continuous has lower latency but fewer ops.

17: How does the scheduler assign tasks?

By data locality: PROCESS_LOCAL first, then NODE_LOCAL, RACK_LOCAL, ANY.

18: What triggers a stage retry?

A FetchFailedException from a lost executor or missing shuffle file.

19: How do you profile executor memory?

Attach async-profiler via extraJavaOptions. Check Storage tab and GC logs.

20: Local vs. distributed checkpointing?

Local stores on executor disk. Distributed writes to HDFS or S3, surviving executor loss.

21: How do you optimize multi-source reads?

Read sources in parallel, push filters, join after reducing row counts.

22: What causes serialization overhead?

Large closures, UDF arguments, and objects shipped between driver and executors.

23: How do you write a custom partitioner?

Extend Partitioner, override numPartitions and getPartition, pass to rdd.partitionBy.

24: What is bucket pruning?

Reads only matching bucket files when filtering or joining on the bucket column.

25: How do you handle schema evolution with Parquet?

mergeSchema on read. Nullable types for new columns. Delta Lake blocks incompatible changes.

Practice-Based Spark Questions for Data Engineer

These Spark data engineering interview questions focus on pipeline problems. Combine them with Spark scenario-based interview questions and answers for wider situational coverage.

1: A 10 TB join keeps failing with OOM. What do you do?

Bad Answer: Request more memory.

Good Answer: Check the UI for skew. Salt the hot key or use AQE. Try broadcast if one side fits after filters.

2: How do you build an incremental pipeline for new files only?

Bad Answer: Reprocess everything each time.

Good Answer: Track processed paths in a metadata table or use Streaming file source to pick up new files automatically.

3: Pipeline output has duplicates. How do you investigate?

Bad Answer: Add .distinct() and move on.

Good Answer: Check source duplication, many-to-many joins, and retry-caused double writes. Fix the root cause first.

4: How do you reduce file count in a partitioned Parquet table?

Bad Answer: Delete files manually.

Good Answer: Coalesce or repartition before writing. For Delta, run OPTIMIZE. Use maxRecordsPerFile to cap sizes.

5: Streaming throughput drops after an upstream schema change. What do you check?

Bad Answer: Roll back the schema change.

Good Answer: Verify deserialization works. Check for null-heavy columns causing GC pressure and checkpoint compatibility.

6: How do you orchestrate jobs in Airflow?

SparkSubmitOperator in a DAG. Parameterize by date, set retries and timeouts.

7: How do you validate data quality?

Assert row counts, null ratios, and key uniqueness per stage. Use Deequ or Great Expectations.

8: How do you roll back a bad Delta Lake write?

RESTORE TABLE to a previous version. Time travel for read-only access.

9: How do you handle late data with watermarks?

Set a watermark on event time. Records outside the window are dropped.

10: How do you break a monolithic job into modules?

One function per stage, intermediate writes to storage, Airflow for orchestration.

11: How do you estimate cluster sizing?

Profile on a sample, measure shuffle and peak memory, then scale linearly.

12: How do you migrate from YARN to Kubernetes?

Container image, –master k8s://, replace queues with namespace limits.

13: How do you manage config across environments?

Separate config files per environment. Pass the env flag at submit time.

14: How do you deduplicate by event time in streaming?

dropDuplicatesWithinWatermark on key and event-time column.

15: How do you set up production alerting?

Push metrics to Prometheus. Alert on batch duration, lag, and executor loss.

Tricky Spark Questions for Data Engineer

These questions test edge-case knowledge and tend to surface in senior-level rounds.

1: Why might a broadcast join fail even when the table seems small?

Bad Answer: It would not fail if the table is small.

Good Answer: Pre-filter size estimates can be wrong. After runtime filters actual data may exceed the threshold or driver memory.

2: What happens when you call collect on a huge DataFrame?

Bad Answer: The engine handles it efficiently.

Good Answer: All partitions ship to the driver as one array. Driver OOM if the result is too large. Use take or write instead.

3: How can a UDF silently produce wrong results?

Bad Answer: UDFs are correct if the code is correct.

Good Answer: Retries on exceptions create duplicates. Null handling differs from SQL. Type mismatches corrupt data silently.

4: Why does adding partitions sometimes slow a job?

Bad Answer: More partitions are always better.

Good Answer: Each partition adds scheduling and serialization cost. Tiny partitions make launch time dominate compute.

5: Why can reusing a cached DataFrame hurt performance?

Bad Answer: Caching is always beneficial.

Good Answer: Cached data pins memory, shrinking shuffle buffers. Large or rarely used caches cause spills that slow other stages.

6: What is the risk of coalesce(1)?

One task writes everything. No parallelism, possible OOM.

7: How does partition pruning interact with bucketing?

Partition pruning cuts hive partitions. Bucket pruning cuts bucket files. Both apply together.

8: repartition(n).write vs. maxRecordsPerFile?

Repartition shuffles. maxRecordsPerFile splits output without a shuffle.

9: Driver OOM vs. executor OOM?

Driver holds DAG, broadcasts, collected results. Executor holds cache and shuffle buffers. Driver OOM kills the app.

10: Why might SQL and RDD code return different results?

Catalyst applies null-safe comparisons and predicate ordering that RDD operations skip.

Tips for Spark Interview Preparation for Data Engineers

A few habits sharpen preparation beyond memorizing answers to Apache Spark interview questions data engineering teams commonly ask.

Run a local cluster, submit a pipeline, and study the web UI stage graphs.
Build a small ETL that reads, transforms, and writes Parquet. Break it with skewed data and fix it.
Practice explaining DAG stages on a whiteboard.
Compare plans with explain(true) and learn physical plan operators.
Time yourself. Two minutes per answer is a solid pace.

Conclusion

These 100 questions cover everything from core distributed-computing concepts to debugging and streaming edge cases. Use them to map your weak spots, rehearse under time pressure, and build the kind of technical fluency that stands out in a live interview.