Hannah Usmedynska

Posted on Apr 1 • Edited on Apr 6

100 Spark Interview Questions and Answers for Middle Developers

Middle-level interviews go beyond definitions. Hiring teams look for candidates who can reason about execution plans, memory trade-offs, and production tuning. Walking through targeted Spark interview questions and answers for middle developers before the call tightens weak areas and sharpens how you explain decisions.

Getting Ready for a Middle Spark Developer Interview

Mid-level rounds sit between entry-level concept checks and deep architectural debates. Whether you are reviewing Spark interview questions for 3 years experience or preparing closer to the Spark interview questions for 5 years experience range, the sections below match what most panels cover.

How Spark Interview Questions Help Recruiters Assess Middles

A structured question set shows whether a candidate can trace a shuffle, pick the right join strategy, and explain why a job failed at 3 AM. Spark technical interview questions for middle developers give hiring panels a reliable baseline for every candidate.

How Sample Spark Interview Questions Help Middle Developers Improve Skills

Comparing bad and good answer pairs trains you to structure responses around reasoning. If you are stepping up from common Spark questions for beginners , these intermediate problems sharpen the depth that interviewers expect. Once comfortable, Apache Spark interview questions for experienced roles are the natural next step. Practising Spark Scala scenario-based interview questions alongside this list covers the full range of formats.

List of 100 Spark Interview Questions and Answers for Middle Developers

Five sections by topic. Each opens with five bad-and-good answer pairs; the rest give correct answers only. Together they form a full set of Spark interview questions for middle level developers.

Basic Middle Spark Developer Interview Questions

These Spark interview questions medium difficulty cover architecture, core abstractions, and cluster fundamentals. They test whether a candidate understands how the engine splits work across executors and manages memory. Expect questions about stages, shuffles, and the role of the optimizer in everyday query execution.

1: What happens internally when you call an action on a DataFrame?

Bad answer: It runs the code line by line.

Good answer: Catalyst builds and optimizes a logical plan, then Tungsten generates a physical plan compiled into RDD stages.

2: How does the DAGScheduler split a job into stages?

Bad answer: It creates one stage per transformation.

Good answer: It walks the RDD lineage backwards; every wide dependency (shuffle) marks a stage boundary.

3: What is the difference between client and cluster deploy mode?

Bad answer: Client is for testing, cluster is for production.

Good answer: In client mode the driver runs on the submitting machine. In cluster mode the manager launches it inside the cluster.

4: Why does the framework use lazy evaluation?

Bad answer: Because it is slow and waits until it has to.

Good answer: Lazy evaluation lets the optimizer see the full graph before running, combining transformations and pushing filters.

5: What is the role of the Catalyst optimizer?

Bad answer: It caches previous query results.

Good answer: Catalyst converts a logical plan through analysis, optimization, physical planning, and code generation.

6: What is Tungsten?

Answer: Execution backend with off-heap memory, whole-stage code generation, and cache-friendly layouts.

7: How does Adaptive Query Execution work?

Answer: AQE re-optimizes plans at runtime using shuffle stats to coalesce partitions and switch joins.

8: Narrow vs wide dependency?
Answer: Narrow: parent feeds one child. Wide: parent feeds many, requiring a shuffle.

9: What does speculative execution do?

Answer: Launches duplicates of slow tasks; first to finish wins.

10: Purpose of the UI?

Answer: Displays job, stage, and task metrics for bottleneck detection.

11: What happens when an executor is lost?

Answer: Driver reschedules tasks. Lost shuffle data triggers recomputation.

12: What is dynamic allocation?

Answer: Scales executors up or down based on pending tasks.

13: Checkpointing vs caching?

Answer: Caching keeps lineage. Checkpointing writes to storage and truncates it.

14: What is a broadcast variable?

Answer: Read-only data shipped to each executor once.

15: What is an accumulator?

Answer: Write-only variable updated by tasks; driver reads the final value.

16: How is serialization handled?

Answer: Java or Kryo. Kryo is faster but needs registration.

17: repartition vs coalesce?

Answer: repartition triggers a full shuffle. coalesce merges partitions without one.

18: Why are shuffles expensive?

Answer: Network transfer, disk writes, serialization overhead.

19: How to choose partition count?

Answer: 2-4 per core. Too few underuse the cluster; too many add overhead.

20: DataFrame vs Dataset?

Answer: DataFrame is Dataset[Row] with no compile-time type safety. Dataset[T] catches type errors at compile time.

21: What does explain() show?

Answer: Logical and physical plans for query debugging.

22: Role of the cluster manager?

Answer: Allocates containers and resources for driver and executors.

23: Spark.sql.shuffle.partitions?

Answer: Default partition count after a shuffle. Default is 200.

24: Off-heap memory?

Answer: Memory outside the JVM heap managed by Tungsten, avoiding GC pauses.

25: Persist vs recompute?

Answer: Persist when lineage is long and the result is reused. Recompute when memory is tight.

Middle Spark Developer Programming Interview Questions

API usage, configuration, and programming patterns. These questions check whether a developer can move beyond default settings and use the API intentionally. Topics include UDF registration, join strategies, and configuration knobs that affect production stability.

1: How do you register and use a UDF?

Bad answer: Call a regular function directly in SQL.

Good answer: Define the function, register with Spark.udf.register(name, fn), then reference by name in the SQL expression.

2: map vs mapPartitions?

Bad answer: mapPartitions is just faster map.

Good answer: map applies per element. mapPartitions passes the whole partition iterator, amortizing setup costs like DB connections.

3: How does a broadcast join work?

Bad answer: Both tables are shuffled then joined.

Good answer: The smaller table is collected, broadcast to every executor, and joined locally without shuffling the larger side.

4: When do you use a Window function?

Bad answer: Same as GROUP BY.

Good answer: Window functions compute a value per row relative to a partition window without collapsing rows.

5: How to handle schema evolution in Parquet?

Bad answer: Delete old files and rewrite.

Good answer: Set mergeSchema to true. New columns appear as null for older files. Delta Lake handles this transactionally.

6: cache() vs persist()?

Answer: cache() equals persist(MEMORY_ONLY). persist() accepts other storage levels.

7: Read CSV with corrupt records?

Answer: PERMISSIVE mode with _corrupt_record column.

8: When to avoid schema inference?

Answer: In production: extra pass, possible type misidentification.

9: What is Spark-submit?

Answer: Entry point for launching apps on the cluster with JARs and config.

10: How to pass runtime config?

Answer: –conf flags or SparkConf programmatically.

11: reduceByKey vs groupByKey?

Answer: reduceByKey combines locally before shuffle. groupByKey shuffles everything first.

12: Partition output by column?

Answer: partitionBy(column) on DataFrameWriter.

13: foreachBatch in streaming?

Answer: Gives a static DataFrame per micro-batch for arbitrary operations.

14: Monotonically increasing ID?

Answer: monotonically_increasing_id(). Unique but not consecutive.

15: repartition by column vs number?

Answer: By column hashes values to co-locate equal keys. By number distributes round-robin.

16: Broadcast explicitly?

Answer: sc.broadcast(value). Access via .value. Immutable once sent.

17: What is SparkSession?

Answer: Unified entry point for DataFrame, SQL, and Catalog APIs.

18: Tune executor memory split?

Answer: Spark.memory.fraction for execution+storage share. Spark.memory.storageFraction for storage floor.

19: What is a bucketed table?

Answer: Pre-shuffled data by column hash so later joins skip the shuffle.

20: Avoid serialization errors in closures?

Answer: Reference only serializable objects. Extract fields into local vals.

21: distinct vs dropDuplicates?

Answer: distinct checks all columns. dropDuplicates accepts a subset.

22: Control output file size?

Answer: maxRecordsPerFile or repartition before writing.

23: When is coalesce(1) bad?

Answer: On large data: one task handles everything.

24: Read JDBC efficiently?

Answer: Set partitionColumn, bounds, numPartitions to parallelize.

25: Custom Encoder?

Answer: Encoders.product with a case class mapping to supported types.

Middle Spark Developer Coding Interview Questions

These Spark intermediate developer interview questions focus on writing and reasoning about code. Candidates should demonstrate window functions, deduplication logic, and streaming patterns in working snippets. The goal is to verify that the developer can translate a requirement into efficient, readable pipeline code.

1: Running total with a Window function?

Bad answer: Sort and loop.

Good answer: sum(col).over(windowSpec) with rowsBetween from unboundedPreceding to currentRow.

2: Deduplicate keeping the latest record per key?

Bad answer: Call distinct.

Good answer: row_number over Window partitioned by key, ordered by timestamp desc, filter row_number = 1.

3: Implement SCD type 2 merge?

Bad answer: Overwrite the table each time.

Good answer: Delta MERGE INTO: match on business key, expire changed rows, insert new active records.

4: Pivot a long table to wide?

Bad answer: Join with itself per category.

Good answer: groupBy row key, pivot on category, aggregate with sum or first.

5: Explode a nested array into rows?

Bad answer: Loop through the array.

Good answer: explode(col(‘items’)).alias(‘item’). Each element becomes a row.

6: Median per group?

Answer: Window ordered by value, count rows, filter to middle position(s).

7: Fill forward nulls?

Answer: last(col, ignoreNulls=True).over(window) from unboundedPreceding to currentRow.

8: Flatten a struct?

Answer: col(‘struct.field’) aliased to top-level names.

9: Union with different column order?

Answer: unionByName(allowMissingColumns=True).

10: Split column into multiple?

Answer: split into array, getItem for each element, alias.

11: UDF returning a struct?

Answer: Case class return type maps to StructType automatically.

12: Dynamic row-to-column transpose?

Answer: Collect distinct categories, pass to pivot, aggregate.

13: Count distinct per column?

Answer: List of countDistinct expressions passed to agg.

14: Rename all columns to snake_case?

Answer: Iterate df.columns with regex, use toDF with the new list.

15: Multi-line JSON?

Answer: option(‘multiLine’, ‘true’).

16: Duplicate column names after join?

Answer: Join on a list of common names or rename beforehand.

17: Custom Aggregator?

Answer: Extend Aggregator: zero, reduce, merge, finish. Register as UDAF.

18: Collect values into array per group?

Answer: collect_list (with duplicates) or collect_set (unique).

19: Conditional column update?

Answer: when(cond, val).otherwise(default).

20: Date range DataFrame?

Answer: sequence(start, end, interval) then explode.

21: Sample exactly N rows?

Answer: orderBy(rand()).limit(N).

22: Convert between time zones?

Answer: from_utc_timestamp or to_utc_timestamp.

23: Add a literal column?

Answer: withColumn(‘name’, lit(value)).

24: Anti-join?

Answer: join(df2, on=’key’, how=’left_anti’).

25: Different aggs on different columns?

Answer: Dict to agg or named column expressions.

Practice-Based Middle Spark Developer Interview Questions

Real-world troubleshooting and production scenarios. Mid-level engineers are expected to own incidents and improve pipeline reliability. These questions simulate issues that surface in daily operations, from small-file problems to executor memory pressure.

1: Job writes 10 000 small files per run. Fix?

Bad answer: Buy more storage.

Good answer: Coalesce before writing. Use maxRecordsPerFile or Delta auto-optimize.

2: One stage takes 10x longer. First step?

Bad answer: Restart the cluster.

Good answer: Check task durations in the UI. Uneven durations point to skew. Salt the key or enable AQE skew handling.

3: Migrate batch pipeline to near-real-time?

Bad answer: Add a while loop around batch code.

Good answer: Replace file reads with a streaming source, use Structured Streaming with a trigger, add watermarks.

4: Executors OOM. First three knobs?

Bad answer: Increase driver memory only.

Good answer: Raise executor memory, reduce cores per executor, check for large broadcasts or collects.

5: How to test a transformation pipeline?

Bad answer: Run on production data and check visually.

Good answer: Create small deterministic DataFrames, run the transformation, assert against expected results in a local session.

6: Join returns more rows than expected?

Answer: Non-unique key on both sides causes expansion. Deduplicate or use semi-join.

7: Late data in streaming?

Answer: withWatermark on event-time to bound state and drop late events.

8: Daily job suddenly 2x slower?

Answer: Compare DAG and task metrics against the last good run.

9: Exactly-once writes to external DB?

Answer: Idempotent UPSERT plus checkpoint-based offset tracking.

10: Data does not fit in memory?

Answer: Repartition, MEMORY_AND_DISK storage, avoid driver collects.

11: SQL fast, DataFrame API slow for same query?

Answer: Plans may differ. Call explain() on both.

12: Roll back failed Delta write?

Answer: RESTORE TABLE to a previous version or timestamp.

13: Key streaming metrics?

Answer: Input vs processing rate, batch duration, state size.

14: Isolate heavy jobs on shared cluster?

Answer: YARN queues, Kubernetes namespaces, or scheduler pools.

15: Corrupt records in streaming?

Answer: Route to dead-letter sink with PERMISSIVE mode.

Tricky Middle Spark Developer Interview Questions

Subtle behaviour and gotchas that catch mid-level developers. Interviewers use these to probe whether a candidate has run into edge cases beyond textbook examples. Knowing why a count() changes or why a broadcast fails signals hands-on depth.

1: count() returns different results on the same DataFrame?

Bad answer: Bug in the engine.

Good answer: If the source is being updated between calls, results change. Caching materializes the data once.

2: Broadcasting a 5 GB table fails. Why?

Bad answer: Broadcast only handles tiny tables.

Good answer: The driver must collect everything first, causing OOM. Use sort-merge or shuffle-hash joins for large tables.

3: UDF with try/catch still fails on bad rows?

Bad answer: Try/catch does not work here.

Good answer: An expression before the UDF (e.g. a cast) can throw before the UDF runs. Move parsing logic inside the UDF.

4: Column from dropped DataFrame throws AnalysisException?

Bad answer: Dropped DataFrames lose data immediately.

Good answer: Column references point to the original logical plan. If that plan is invalidated, the reference cannot resolve.

5: Identical transformations produce different DAGs?

Bad answer: Plans are chosen randomly.

Good answer: AQE and varying data distributions can shift strategies and partition counts between runs.

6: show() instant but count() takes minutes?

Answer: show() short-circuits after one partition. count() scans all.

7: withColumn in a loop degrades performance?

Answer: Each call nests a projection. Use a single select instead.

8: Inner join returns zero despite matching keys?

Answer: Trailing whitespace, case mismatch, or null keys.

9: partitionBy on high-cardinality column?

Answer: Millions of values produce millions of tiny files.

10: Spill metric non-zero despite enough memory?

Answer: Spills are per-task. Fewer cores gives each task a larger share.

Tips for Spark Interview Preparation for Middle Developers

Practical steps to prepare efficiently. Senior expectations start at the middle level, so preparation should reflect that. Focus on explain() output, real cluster experiments, and at least one production story you can walk through clearly.

Read explain() output on your own queries. Interviewers expect you to interpret physical plans.
Run experiments with AQE on and off. Compare the plans.
Debug a skewed join on a local cluster. This separates middles from juniors.
Prepare one clear example of a production issue you solved.
Review Delta Lake or Iceberg basics. Most teams use a lakehouse layer.

Technical Interview and Assessment Service for Middle Scala Developers with Spark Experience

Our platform focuses exclusively on Scala and related technologies. Candidates go through a live technical interview with senior Scala engineers who also evaluate distributed processing knowledge. This gives hiring companies a pre-vetted shortlist of middle developers whose skills have been verified against real coding and design scenarios. The dedicated Scala focus means deeper, more relevant evaluations than general job boards.

Why Submit Your Resume With Us

Submitting your profile through a specialized platform gives you access to companies that hire specifically for Scala and distributed data roles. Here is what you get:

Get assessed by engineers who work with Scala and distributed systems daily.
Receive structured feedback on strengths and areas to improve.
Appear on a pre-vetted list shared with companies hiring middle-level Scala developers.
Stand out through a verified, technology-specific evaluation.

Conclusion

Mid-level interviews reward developers who combine solid distributed processing knowledge with the ability to discuss real trade-offs. Use these 100 questions to surface gaps, tighten your reasoning, and walk into the technical round prepared.

Post a Job

The post 100 Spark Interview Questions and Answers for Middle Developers first appeared on Jobs With Scala.