DEV Community

Cover image for 100 Spark Interview Questions and Answers for Experienced Developers
Hannah Usmedynska
Hannah Usmedynska

Posted on

100 Spark Interview Questions and Answers for Experienced Developers

Senior-level interviews move past definitions into architecture reasoning and failure recovery. A candidate with Spark interview questions for 8 years experience level preparation can articulate trade-offs that production alone does not surface. This set of Spark interview questions for experienced engineers covers that ground.

Getting Ready for a Senior Spark Developer Interview

Whether you are reviewing Apache Spark interview questions for experienced roles or building a senior hiring panel, a shared bank keeps evaluation consistent.

How Spark Interview Questions Help Recruiters Assess Seniors

Tough Spark interview questions show whether a candidate can reason about cluster behavior and query optimization beyond textbook level. A senior who debugged a shuffle-heavy pipeline in production will answer differently from someone who only read documentation. Targeted questions also reveal how a candidate approaches trade-offs between memory, CPU, and I/O. The depth of the response tells a recruiter whether this person can own architecture decisions on the team.

How Sample Spark Interview Questions Help Senior Developers to Prepare for the Interview

Working through Spark complex interview questions forces a review of internals that daily work abstracts away. Revisiting entry level Spark questions through a senior lens helps too. Many experienced engineers rely on defaults that worked for smaller datasets and never revisit partitioning or memory tuning. Practicing with scenario-based prompts builds the habit of explaining not just what to do but why. That reasoning is exactly what panels look for at the senior level.

List of 100 Spark Interview Questions and Answers for Experienced

Five sections below. Each opens with five bad-and-good pairs; the rest give correct answers only. Difficulty targets Spark interview questions for senior developer roles.

Basic Senior Spark Developer Interview Questions

These Spark Scala interview questions for experienced candidates test architecture knowledge and core engine internals. They cover Catalyst, Tungsten, AQE, and the memory model that seniors are expected to reason about fluently. Strong answers reference real production trade-offs, not just definitions.

1: How does Adaptive Query Execution change join planning at runtime?

Bad answer: It is just a config flag that speeds everything up.

Good answer: AQE re-optimizes the physical plan at stage boundaries using shuffle statistics, converting SortMergeJoin to BroadcastHashJoin and coalescing small partitions.

2: What internal format does Tungsten use for in-memory rows?

Bad answer: Regular Java objects on the JVM heap.

Good answer: UnsafeRow: compact binary layout with null bitmap, fixed-length values inline, variable-length by offset.

3: How does Catalyst resolve column references in an unresolved logical plan?

Bad answer: It looks up names in a dictionary.

Good answer: The Analyzer binds attributes via the Catalog, resolves aliases, expands stars, applies coercion.

4: What is whole-stage codegen and why does it matter?

Bad answer: A debugging feature that prints generated code.

Good answer: Fuses operators into one Java method, eliminating virtual calls and intermediate materialization.

5: When does speculative execution backfire?

Bad answer: Never, it always helps.

Good answer: When slowness comes from data skew the duplicate hits the same partition and doubles resource use.

6: Client mode vs cluster mode on YARN?

Answer: Client: driver on submitting machine. Cluster: driver inside an AM container.

7: How does the external shuffle service help?

Answer: Persists shuffle files so lost executors skip upstream recomputation.

8: MEMORY_ONLY vs MEMORY_AND_DISK_SER?

Answer: First is fast, recomputes on eviction. Second spills to disk, costs CPU.

9: How does predicate pushdown differ across file formats?

Answer: Parquet and ORC push to row-group stats. CSV and JSON read everything.

10: What problem does bucketing solve?

Answer: Pre-partitions by join key at write time; later joins skip the shuffle.

11: DAGScheduler vs TaskScheduler?

Answer: DAGScheduler splits the graph into stages; TaskScheduler assigns tasks to executors.

12: What happens when broadcast exceeds the threshold?

Answer: Falls back to SortMergeJoin. A forced hint risks OOM.

13: Stage retry after node decommission?

Answer: Scheduler resubmits failed tasks; external shuffle service preserves data.

14: Dynamic partition pruning?

Answer: Injects dimension-side keys into the fact scan, skipping unmatched partitions.

15: Multiple SparkSessions in one JVM?

Answer: Separate SQL configs and temp views, shared SparkContext.

16: How does Parquet partition pruning work?

Answer: Values encoded in directory paths; non-matching directories are skipped.

17: Map-side vs reduce-side aggregation?

Answer: Map-side combines locally first; reduce-side shuffles everything.

18: Tuning spark.sql.shuffle.partitions?

Answer: Default 200. Too few causes spills; too many adds overhead. AQE coalesces.

19: Why might streaming state grow unbounded?

Answer: Stateful ops without a watermark keep every key indefinitely.

20: Accumulators vs broadcasts in failure retries?

Answer: Broadcasts are immutable. Accumulators may double-count on retry.

21: What does the UnifiedMemoryManager do?

Answer: Splits memory between execution and storage; one borrows from the other.

22: What does –packages do internally?

Answer: Resolves Maven coordinates via Ivy, downloads JARs and distributes them.

23: Column pruning on nested Parquet structs?

Answer: Reads only requested leaf columns at I/O level.

24: Catalyst cost model for join strategy?

Answer: Compares sizes against broadcast threshold; falls back to SortMergeJoin.

25: Off-heap memory for Tungsten?

Answer: Enabled via config. Avoids GC but needs careful sizing.

Senior Spark Developer Programming Interview Questions

API mastery, performance-aware coding, and resilient pipeline design. These questions evaluate whether a senior can build production-grade pipelines that handle failures, evolving schemas, and backpressure. Expect topics around custom partitioners, streaming guarantees, and testable transformation libraries.

1: How do you implement a custom Partitioner?

Bad answer: Just increase shuffle partitions.

Good answer: Subclass Partitioner with numPartitions and getPartition(key) for domain-specific distribution.

2: Performance cost of a Scala UDF returning a struct?

Bad answer: No cost, same as a built-in function.

Good answer: UDFs disable codegen and pushdown. Prefer built-in functions.

3: How do you guarantee exactly-once delivery in a streaming pipeline end to end?

Bad answer: Set output mode to complete.

Good answer: Checkpointed offsets plus atomic commit on the sink give exactly-once.

4: How do you diagnose and fix data skew in a join?

Bad answer: Add more memory until it stops failing.

Good answer: Check the UI for uneven tasks. Salt the hot key, replicate the small side, join on key plus salt.

5: How do you tune a job that spends most time in GC?

Bad answer: Move to a bigger cluster.

Good answer: Switch from RDD to DataFrame for off-heap. Reduce partition sizes, enable G1GC.

6: Delta Lake time-travel read?

Answer: option(“versionAsOf”, n).load(path).

7: SCD Type 2 with DataFrames?

Answer: Join on business key, close changed rows, insert new active records.

8: Late data beyond the watermark?

Answer: Dropped by default. Extend watermark or route to a dead-letter topic.

9: Map-side join in Scala?

Answer: largeDf.join(broadcast(smallDf), Seq(“key”)). Verify with explain().

10: Custom Encoder for a case class?

Answer: import spark.implicits._ for automatic; ExpressionEncoder for explicit.

11: Parallel JDBC writes?

Answer: foreachPartition with batch inserts.

12: Backpressure from Kafka?

Answer: maxOffsetsPerTrigger caps records per micro-batch.

13: Unit-test a transformation chain?

Answer: Local SparkSession, small DataFrames, assert on expected output.

14: Serialization errors in closures?

Answer: Extract values into local vals; referencing this captures the enclosing class.

15: Chain dependent streaming queries?

Answer: Write to a durable sink, read in the next query, separate checkpoints.

16: Reusable transformation library in Scala?

Answer: DataFrame-in, DataFrame-out functions in a shared module.

17: Custom Catalyst rule?

Answer: Extend Rule[LogicalPlan], match patterns, register via session extensions.

18: Schema evolution in streaming Delta?

Answer: autoMerge adds columns; breaking changes need explicit mergeSchema.

19: Profile a job for CPU bottlenecks?

Answer: async-profiler on executor JVMs; flame graphs show hot methods.

20: DataSource V2 connector?

Answer: Implement TableProvider, ScanBuilder, WriteBuilder.

21: Sort data within partitioned Parquet output?

Answer: sortWithinPartitions before write improves pushdown.

22: Pass secrets to executors safely?

Answer: Hadoop credential provider API; –conf values show in the web UI.

23: Retry logic inside foreachPartition?

Answer: Loop with exponential backoff and a max count.

24: Multi-tenant driver with shared sessions?

Answer: newSession() per tenant, fair scheduler pools.

25: Test streaming end to end locally?

Answer: MemoryStream source, MemorySink output, processAllAvailable to advance.

Senior Spark Developer Coding Interview Questions

Production-grade coding and architecture reasoning for common Apache Spark interview questions at the senior level. Candidates should demonstrate end-to-end pipeline thinking, from ingestion through medallion layers to incremental upserts. The focus is on patterns that survive schema drift, data skew, and concurrent writes.

1: Deduplicate a stream of events using Structured Streaming?

Bad answer: dropDuplicates after collecting all data.

Good answer: withWatermark on event time, then dropDuplicates(“eventId”, “ts”). State stays bounded.

2: Write a salted join for skewed keys.

Bad answer: Repartition both sides to 2000.

Good answer: Salt the large side randomly, explode the small side with matching values, join on key plus salt.

3: Compact small files in a Parquet directory?

Bad answer: Delete everything and rewrite from source.

Good answer: Read, repartition to desired count, write to a staging path, swap atomically. Or run OPTIMIZE on Delta.

4: Sessionize a clickstream with window functions?

Bad answer: Group by user and sort in a loop.

Good answer: lag() for inter-event gap, flag gaps above a threshold, cumulative sum assigns session IDs.

5: Multi-hop medallion pipeline?

Bad answer: Three separate apps with no connection.

Good answer: Bronze appends raw data. Silver deduplicates and casts. Gold aggregates. Each layer is a Delta table.

6: Delta MERGE for incremental upserts?

Answer: merge(updates, condition).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute().

7: Custom Aggregator in Scala?

Answer: Extend Aggregator[IN,BUF,OUT] with zero, reduce, merge, finish. Register with toColumn.

8: Stream-static join with a changing dimension?

Answer: Reload the static DataFrame in foreachBatch every N batches for freshness.

9: Approx distinct count over a sliding window?

Answer: approx_count_distinct with window(“ts”,”1 hour”,”15 min”). HyperLogLog trades accuracy for constant memory.

10: Data quality gate on null rates?

Answer: Compute null fraction per column. Halt the pipeline if any exceed the threshold.

11: Reliable streaming checkpoint to S3?

Answer: S3A committer avoids rename-based commits that can lose data.

12: Parallel JDBC reads with partitionColumn?

Answer: Provide column, lower, upper, numPartitions. Even distribution matters.

13: Pivot and unpivot in one pipeline?

Answer: groupBy.pivot.agg for pivot; stack(n, pairs) in selectExpr for unpivot.

14: Flatten a nested JSON column?

Answer: from_json into a struct, then select(“parsed.*”). Explode arrays first if needed.

15: Surrogate keys for a dimension?

Answer: monotonically_increasing_id() for unique. row_number() for sequential but partition-bound.

16: Metrics to Prometheus from streaming?

Answer: foreachBatch computes aggregates, pushes to Pushgateway.

17: Retry-safe S3 writer?

Answer: S3A client retries plus idempotent overwrite prevent duplicates.

18: CDC from a Kafka topic?

Answer: Parse events, keep latest per key, MERGE into Delta. DELETE events use whenMatchedDelete.

19: Moving average over last 7 days?

Answer: rangeBetween(-6, 0) window ordered by date, avg(“sales”).over(window).

20: Dynamic repartition based on cardinality?

Answer: Compute distinct count, derive partition count from target file size, repartition before write.

21: Concurrent Delta writes?

Answer: Optimistic concurrency with conflict retry. Partition writes by different keys to reduce collisions.

22: Generic schema validation function?

Answer: Compare expected StructType against df.schema. Flag missing columns and type mismatches.

23: Read Avro with union types?

Answer: Union becomes a struct with one field per member. Extract non-null with coalesce.

24: Partition by date, limit files per partition?

Answer: repartition(n, col(“date”)).write.partitionBy(“date”).

25: Detect schema drift between batches?

Answer: Diff current StructType against stored metadata. Block or merge as policy dictates.

Practice-Based Senior Spark Developer Interview Questions

Production incidents and design decisions that Spark intermediate developer interview questions rarely reach. These scenarios test whether a candidate can diagnose failures under pressure and propose fixes that hold long term. Answers should show ownership of reliability, not just awareness of tools.

1: Input doubled and the nightly job OOMs. Steps?

Bad answer: Add memory and rerun.

Good answer: Check the UI for the failing stage. Uneven durations suggest skew. Increase shuffle partitions, enable AQE skew handling, or salt the key.

2: Streaming checkpoint corrupted after storage outage. Recovery plan?

Bad answer: Delete the checkpoint and restart.

Good answer: Restore from a storage snapshot. Otherwise restart from a known offset and deduplicate downstream.

3: Strategy for migrating a legacy RDD pipeline to DataFrames?

Bad answer: Rewrite everything in one PR.

Good answer: Map operations one-to-one: map to withColumn, reduceByKey to groupBy.agg. Migrate one stage at a time, diff outputs.

4: Design a multi-tenant platform on a shared cluster?

Bad answer: Give each team its own cluster.

Good answer: YARN queues or Kubernetes namespaces with resource guarantees and dynamic allocation limits per job.

5: Join between two large tables takes 3 hours. How to cut it?

Bad answer: Switch to a bigger cluster.

Good answer: Check explain(). Bucket both tables on the join key. Enable AQE and dynamic partition pruning.

6: Downstream database can’t keep up with write rate?

Answer: Buffer in a staging Delta table. Rate-limited writer sizes batches to DB capacity.

7: Roll back a bad Delta write?

Answer: RESTORE TABLE to a previous version. Vacuum afterward.

8: Same input, non-deterministic results?

Answer: Check for rand(), current_timestamp(), or mutable state in UDFs.

9: Blue-green deployment for streaming?

Answer: New version on a separate checkpoint, both consume the same topic, validate then swap.

10: Executor sizing for heavy aggregation?

Answer: Fewer large executors. Four cores, 8 GB is a starting point. Adjust by spill metrics.

11: Consistency across three Delta tables in one pipeline?

Answer: Write all three inside a single foreachBatch. Idempotent writes handle retries.

12: Partner CSV with changing column order?

Answer: Read with header=true, select in expected order, validate schema before processing.

13: Benchmark two query implementations?

Answer: Same cluster, data, and config. Compare wall-clock, shuffle bytes, and peak memory. Run three times.

14: Migrate from on-prem Hadoop to cloud object store?

Answer: Replace HDFS paths, switch committer to S3A, adjust timeouts for higher latency.

15: Column-level access control?

Answer: Views projecting permitted columns per role. Unity Catalog column masks for fine-grained enforcement.

Tricky Senior Spark Developer Interview Questions

These 10 Spark interview questions and answers for experienced engineers probe corner cases. They target subtle runtime behaviours that surface only after months of production operation. Candidates who answer well here have likely debugged similar issues firsthand.

1: Plenty of free memory but the sort still spills. Why?

Bad answer: A bug in the memory manager.

Good answer: Cached data occupies the storage pool. Execution cannot borrow enough. Unpersist unused caches or raise spark.memory.fraction.

2: mapPartitions allocates a large buffer per partition. Impact on memory?

Bad answer: The engine manages it automatically.

Good answer: On-heap allocations inside closures are invisible to the memory manager and compete with execution.

3: Re-reading a cached DataFrame triggers a full recompute. Why?

Bad answer: Caching is a no-op.

Good answer: Blocks can be evicted under storage pressure. Check the Storage tab for eviction events.

4: Two streaming queries writing to the same Delta table concurrently?

Bad answer: One overwrites the other.

Good answer: Optimistic concurrency lets both commit if they touch disjoint files. Conflicts trigger retry.

5: When can a broadcast join be slower than a shuffle join?

Bad answer: Never, broadcast is always faster.

Good answer: Large data saturates driver memory during collection. Deserialization on hundreds of executors adds latency.

6: Same code, different results after a library upgrade?

Answer: New Catalyst rules may reorder joins or change null handling. Pin dependencies and test.

7: count() to verify cached data correctness?

Answer: Confirms row count, not quality. Corrupt records in PERMISSIVE mode load silently.

8: Dynamic allocation too slow for bursty workloads?

Answer: Container provisioning takes time. Pre-warm a minimum and lower schedulerBacklogTimeout.

9: monotonically_increasing_id in a retried pipeline?

Answer: IDs depend on partition index. Retries can produce different IDs for the same rows.

10: Small predicate change, huge execution time difference?

Answer: Stale statistics cause a poor plan. ANALYZE TABLE refreshes them.

Tips for Spark Interview Preparation for Senior Developers

Senior readiness goes beyond memorizing answers. Interviewers expect you to narrate real decisions, explain trade-offs, and walk through debugging workflows live. The habits below build that confidence faster than passive review.

  • Reproduce a skew scenario locally and fix it with salting.
  • Read explain() output on every query for a week.
  • Set up a streaming pipeline with Kafka and checkpointing. Break the checkpoint and recover.
  • Profile a real job with async-profiler and build a flame graph.
  • Review Spark intermediate developer interview questions to keep fundamentals sharp.

Technical Interview & Assessment Service for Senior Scala Developers with Spark Experience

Our platform runs a dedicated technical assessment built around Scala. Senior candidates with production experience on the distributed processing engine go through a live evaluation with engineers who work in the same stack daily. Because the platform focuses exclusively on Scala, the questions reach deeper into language idioms, type-safe API usage, and cluster tuning than any generalist board can. Candidates receive structured feedback on both Scala proficiency and distributed computing depth. Hiring companies get pre-vetted profiles with granular scores, cutting weeks from the senior screening pipeline.

Why Submit Your Resume With Us

A dedicated evaluation gives hiring managers a clear signal about your senior-level depth before the first call. These are the advantages of going through the process.

  • Get evaluated by engineers who ship Scala and distribute data code in production.
  • Receive detailed feedback on language proficiency and pipeline design.
  • Join a vetted talent pool shared directly with hiring companies.
  • Stand out through a verified, technology-specific evaluation.

Conclusion

These 100 questions span architecture internals, performance tuning, production coding, incident response, and runtime edge cases. Use them to stress-test preparation before a senior round.

Find the Right Scala Talent with Our Specialized Platform

Post a Job

The post 100 Spark Interview Questions and Answers for Experienced Developers first appeared on Jobs With Scala.

Top comments (0)