DataDriven

Posted on Apr 16

What Spark Interviews Actually Test (Based on 189 Real Interview Reports)

#dataengineering #interview #career #programming

What Spark Interviews Actually Test (Based on 189 Real Interview Reports)

We scraped thousands of data engineering interview reports from across the internet. 189 of them mentioned Spark. We tagged every question, tracked every outcome, and found patterns that contradicted most of the advice we see online.

This is what the data says.

Spark Shows Up Less Than You Think

Across all the reports we collected, Spark appeared in 6.7%. SQL appeared in 22.8%. Python in 16%.

That ratio matters. If you have 4 weeks to prep and you spend 2 of them grinding Spark internals, you've made a bad bet. SQL is 3.4x more likely to show up. Python is 2.4x more likely.

But here's the catch: when Spark does show up, it shows up hard. It's rarely one question in a round. It tends to be the entire round. And the failure rate is brutal.

The Question Changes Completely By Level

Most people prep for Spark interviews as if there's one test. There isn't. The question changes shape depending on what level you're interviewing for, and the jump between levels is steeper than people expect.

At L3/L4, interviewers test whether you can explain the basics. "What is a DAG?" "Why is a shuffle expensive?" "Tell me about your PySpark projects." One candidate interviewing at Nasdaq described the round as "Python, Pandas, PySpark, Databricks, Linux commands, my projects in Python." Conceptual. Vocabulary. Can you talk about this stuff without stumbling.

At L5, the entire format flips. The interviewer hands you a Spark UI screenshot and says "this job was meeting SLA for six months and now it's 10x slower. Nothing in the code changed. Walk me through your diagnosis." A TikTok L5 round combined "complex SQL problems, Spark architecture, and performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems" into a single session. You're not explaining what Spark is. You're fixing something that broke at 3am.

At L6, the scope widens again. One candidate at Booking.com was rejected because their system design choices were wrong: "feedback centered on tool choices (Flink vs Spark despite prompt asking for low latency; Redis vs Cassandra)." The question isn't "fix this job." It's "design the memory layout for a system that caches 100GB of reference data while running a 500GB sort-merge join." You're sizing executors, reasoning about GC pressure past 30GB of heap, deciding between MEMORY_AND_DISK and recomputation.

At L7, it's organizational. "How would you design a Spark application that processes 100+ PB across a shared multi-tenant cluster?" The bottleneck isn't compute anymore, it's resource isolation between 50 competing teams.

Same topic at every level. Completely different test. One Databricks candidate went through a 7-round process over 60 days that included a take-home with 15 hands-on Spark questions, followed by a live grilling where a lead engineer dug into their optimization choices. They called the whole experience "disappointing after almost 2 months."

The prep that gets you through an L3 round won't even register as relevant at L5.

68.8% of "Spark Interviews" Are Really SQL-at-Scale Interviews

This one surprised me the most. We tagged every technical topic mentioned in the 189 Spark interview reports. The breakdown:

Topic	% of Spark Interviews
SQL optimization	68.8%
Performance tuning	11.1%
Window functions	6.3%
Joins	5.3%
Partitioning	3.2%
Data skew	2.6%
Memory management	2.6%

Nearly 7 out of 10 "Spark interviews" are really about running SQL efficiently at distributed scale. Not RDD transformations. Not Catalyst internals. SQL.

The typical question sounds like this (from a real TikTok L5 interview): "...Discussed Spark architecture, and answered performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems."

SQL is the entry point. Spark is the context. The question is whether you understand what happens to your SQL query after you hit enter on a 500-executor cluster.

Nobody Asks About RDDs Anymore

Zero interviews in the dataset asked about RDDs as a primary topic. Zero asked about GC tuning directly.

That doesn't mean these concepts are irrelevant. It means interviewers have stopped asking "what is an RDD" and started asking questions where RDD knowledge helps you reason about the answer. The question is "why is this job slow?" and the ability to think in terms of lineage, partitions, and shuffle boundaries is what separates a good answer from a textbook recitation.

If you're spending prep time memorizing the difference between map and flatMap on RDDs, stop. That time is better spent learning to read a Spark UI.

What Companies Actually Ask

Here are real questions from real interviews, pulled directly from the reports:

Databricks (37 interviews, 46% rejection rate): "Length, breadth, height, depth on Spark core, DLT, Unity Catalog, code optimization, scenarios, your project issues and how they were resolved." Their process runs 7+ rounds over 50-60 days. One candidate reported being rejected after the presentation round "because of less Databricks knowledge" despite the hiring manager saying Databricks knowledge wouldn't be needed. The bar is inconsistent and the process is long.

Finance companies (multi-round, structured): "Explanation of Spark architecture in detail and different optimization techniques if any Spark job is taking long to run." These tend to be 4-round processes: PySpark coding, Spark optimization, system design, then a techno-managerial round.

TikTok (L5, 25% rejection): Complex SQL + Spark architecture + performance optimization in a single round. They test breadth.

BNSF Railway (100% rejection in dataset): "Multi-round process with system design, SQL, PySpark, and a deep technical discussion with leadership. The interviews were much challenging and focused heavily on real-world trade-offs, especially around data architecture and streaming concepts." When a railroad company rejects every Spark candidate in your dataset, they're not messing around.

QuantumBlack (McKinsey's data arm, 71% rejection): "What makes PySpark great? How do you debug PySpark?" Then a 45-minute coding test with 3 problems solvable in either Pandas or PySpark.

The Five Failure Patterns That Keep Showing Up

After tagging all the technical content across interview reports, challenge databases, and company-specific prep guides, five production failure patterns dominate what senior interviews test:

1. Data skew on power-law keys. One partition holds 320M rows while the others hold 3-4M. Task 199 runs for 7,140 seconds while the other 199 tasks finished in 22 seconds. The interviewer wants you to identify the skew from the Spark UI, explain why adding more executors won't help (the bottleneck is one partition, not total parallelism), and apply the right fix (broadcast the small table, or salt the key if both tables are large).

2. Broadcast overflow. A dimension table that was 8MB a year ago grew past the 10MB autoBroadcastJoinThreshold silently. Spark switched from BroadcastHashJoin to SortMergeJoin without anyone noticing. Runtime went from 8 minutes to 2 hours. The fix is one line of code. The interview tests whether you can find that line.

3. Shuffle explosion. Someone added a repartition() before a join, thinking more partitions would speed things up. It multiplied shuffle volume by 50x. Network saturated. The interviewer wants you to explain why repartition before a join is almost always wrong and what to do instead.

4. Executor OOM from cached data. A 100GB reference table is cached. A 500GB sort-merge join needs execution memory. Both compete for the unified pool (60% of heap). Spark's unified memory model lets execution evict cached blocks, but at 100GB the eviction churn destroys throughput. The interview tests whether you understand spark.memory.fraction, spark.memory.storageFraction, and the tradeoff between cache hit rate and execution headroom.

5. Catalyst plan regression from stale statistics. CBO statistics went stale after a table doubled in size. Spark picked sort-merge instead of broadcast. Nobody changed any code. The job just got slower. The interviewer wants you to explain how Catalyst's cost-based optimizer works and why ANALYZE TABLE ... COMPUTE STATISTICS matters.

These five patterns cover what I'd estimate is 80%+ of production Spark incidents. They're also what separates L3 answers ("it's slow because of the data") from L6 answers ("task 199 is reading 15.8GB of shuffle data because the top 1% of user_ids hash to the same partition, and the executor is at 78% GC overhead because it's trying to sort 320M rows in 28GB of heap").

The Real Signal: Can You Read the Spark UI?

Every pattern above comes down to one skill: reading the Spark UI and reasoning about what you see.

Stages, tasks, shuffle read/write, GC time, executor memory. That's the entire diagnostic surface. If you can look at a Spark UI screenshot and say "task 199 has 100x the shuffle read of every other task, the executor is at 98% heap, and the physical plan shows SortMergeJoin when this should be a broadcast" then you pass. If you can't, you recite textbook answers and the interviewer can tell.

This is the skill that most Spark prep resources skip entirely. They teach you what a broadcast join is. They don't teach you to recognize when a missing broadcast join is the reason your 3am pager went off.

Practice This Before Your Interview

I built a free Spark mock interview that simulates exactly this. You get paged. You see real Spark UI evidence: task durations, shuffle sizes, GC overhead, executor memory, the physical plan. You diagnose, write the fix in PySpark or Scala, run your code in the browser, then an AI interviewer grills you on tradeoffs and edge cases.

Four phases:

Think (5 min): Read the Spark UI. Diagnose before you touch code.
Code (15 min): Write and run your PySpark or Scala fix in a hosted IDE.
Discuss (10 min): AI interviewer asks follow-ups one at a time. "What happens when the table doubles?" "Why not just add more executors?"
Verdict: Scored across 5 dimensions (problem solving, technical execution, communication, verification, requirements understanding). Calibrated from L3 to L7.

No paywall, no trial, no credit card.

Try it here: Spark Mock Interview

Data sourced from thousands of interview reports scraped across the internet, covering 945+ companies.