Most interview prep teaches you what to know. Not how to think.
That's a problem, because data engineering interviews don't fail candidates on knowledge gaps as often as people assume. They fail candidates who know the answer but can't show their work.
I watched "Data & AI Guy" solve five real interview questions live on camera using DataDriven.io. SQL, Python, Spark, data modeling, pipeline architecture. One problem per domain, full reasoning narrated out loud. It's a useful model for how to actually behave during an interview. Not just what to code, but how to move through the problem from prompt to solution to edge cases.
SQL: Say the Why Out Loud
The problem: return a deduplicated list of regions from an infrastructure nodes table.
Answer is SELECT DISTINCT region FROM infra_nodes. One line. Correct.
Most people write it and wait. He writes it and immediately explains why DISTINCT over GROUP BY. Both work here. But DISTINCT signals intent. You're not aggregating, you're deduplicating. That's what DISTINCT is for. GROUP BY is a more powerful tool being used as a weaker one.
Then he raises the null edge case without being asked.
If region is nullable, DISTINCT returns NULL as a value. Is that correct? Depends on the business context. He doesn't assume. He flags it and asks.
That's the habit worth building. Not just solving the problem, but immediately asking: where does this solution make an assumption it shouldn't? The NULL case takes ten seconds to raise. It tells the interviewer you've worked with real production data, where nullable columns are the default, not the exception.
The move: After every solution, ask yourself "what breaks here?" out loud. Interviewers are watching for that step.
Python: Decompose Before You Code
The problem: given a list of integers and container type names, group by distinct value, sort ascending, round-robin assign each group to a container, format output differently per container type (set gets deduped and sorted descending, list and tuple keep original order).
It's tricky. Multi-step, several interacting requirements.
Before touching code he restates the problem in his own words, slowly, and traces through a concrete example by hand.
"I'm grouping every occurrence of each distinct value together. So the ones go in one bucket, threes in another. Then I order those groups by distinct value ascending. Then round-robin assign each group to a container type."
This is not stalling. This is the move that prevents you from writing fifteen minutes of code that solves the wrong problem.
Once the decomposition is right the code is almost mechanical. defaultdict for grouping, sorted for ordering, enumerate with modulo for round-robin, conditional formatting per container type.
Then three edge cases:
Empty input: returns empty dict, works correctly
Empty containers list: ZeroDivisionError from modulo, worth flagging upfront
Unknown container name: falls through to else branch, silently treated like a list. If the interviewer wants strict validation you'd add an assert or raise
The third one is subtle and most candidates don't get there. The interviewer who wants to stress test you will hand you exactly that input.
The move: On any problem with multiple interacting requirements, restate out loud before coding. Trace one example by hand. List the edge cases you can already see. Do this even when the answer feels obvious.
Spark: The Expected Output Is Telling You Something
The problem: return authors who deployed to both dev and prod.
Looks simple. Then he looks at the expected output.
Alice and alice are separate rows. Different people as far as this query is concerned. But DEV and dev should match. Case sensitivity applies to authors, not environments.
He catches this before writing a line.
This is the Spark round in a nutshell. The prompt is often deliberately vague or ambiguous. The expected output encodes decisions the prompt doesn't state explicitly. Candidates who read the prompt and start coding miss it. Candidates who look at the expected output first and ask "what is this output telling me?" catch it.
His pipeline: normalize environment name to lowercase in a new column, filter to dev/prod using the normalized column, group by author, count DISTINCT environments (not deploys), filter to count equals two, sort alphabetically.
The DISTINCT matters. Without it, an author who deployed to dev five times and prod zero times gets count five and passes the filter. You want distinct environments hit, not total deploys.
He also shows an alternative self-join solution that another user submitted. That's worth noting. There's usually more than one way to solve these and being able to discuss tradeoffs between approaches is what senior loops are actually testing.
The move: On any problem with expected output shown, read the output before the prompt. Ask what decisions the output is making that the prompt left unstated.
Data Modeling: Name the Grain First
The problem: track employee application usage, flag anyone spending more than ten hours a day in a single app.
His answer is a star schema. Two dims, one fact, daily grain.
dim_employee: employee_id, full_name, city, department
dim_application: application_id, app_name, category
fact_application_usage: usage_id, employee_id (FK), application_id (FK), usage_date, hours_used, over_ten_hour_flag
He names the grain explicitly before drawing anything. One row equals one employee using one application on one day. That grain is forced by the prompt. The ten-hour threshold is daily. The summaries are per employee per application. There's no other grain that fits.
This is the skill most data modeling rounds are actually testing. Not whether you know what a star schema is. Whether you can derive the grain from the business requirements and then explain why that grain is the only one that works.
The over_ten_hour_flag lives on the fact table rather than the BI layer. Reasonable call. If HR is querying it constantly, materializing it once is cleaner than recomputing it in every downstream report.
The move: Before drawing any schema, answer three questions out loud: who is the actor, what is the event, what is the time granularity. The schema follows from those answers.
Pipeline Architecture: Constraints Drive the Design
The last problem. Greenfield build. Six sources. Three requirements:
Dashboards ready by 8am on weekdays
One canonical definition of MRR and NRR
Finance cannot see raw card data
He builds a medallion architecture: sources into Kafka/Fivetran ingestion layers, into bronze raw delta tables, through dbt/DLT into silver cleaned tables, into gold star schema marts, with Unity Catalog and Airflow orchestrating across all of it.
The constraints are the answer.
The 8am SLA means batch-first. Maybe one streaming path for event data. But you cannot promise 8am dashboard delivery on a streaming-first architecture without a lot of complexity that isn't justified here.
The canonical MRR/NRR requirement means a semantic layer. Not just clean gold tables. If the metric definition lives in five BI tools separately, you have five definitions of MRR inside a year. The semantic layer is what makes "one canonical definition" actually mean something.
Finance not seeing raw card data means catalog-level security. Table-level access controls fail the moment someone grants the wrong table permission. Column masking and dynamic views in Unity Catalog enforce the control at the platform level, not dependent on every access grant being correct.
The candidates who fail architecture rounds can draw this diagram. They just can't explain why it's the right diagram for these constraints. The boxes are decoration. The constraints are the substance.
The move: Before drawing any architecture, write down the constraints and what each one forces. The diagram should be a consequence of the constraints, not the starting point.
The Meta-Skill Across All Five
Restate the problem. Trace an example. Narrate your reasoning. Raise the edge cases.
None of those are syntax. They're production instincts. The interviewer is trying to figure out if you've actually shipped things that broke at 2am. The reasoning is how you show that you have.
Most people cram to memorize answers. This video is a useful model for training the thing that gets you hired; intuition.
Where to Practice
DataDriven.io is where he pulls all five problems in the video. It covers all six domains: SQL, Python, Spark, AI Coding, data modeling, and pipeline architecture. Most platforms stop at SQL.
Multiple solutions per problem, community submissions, and it's free. No trial, no credit card, no pricing page.
Top comments (0)