DataDriven

Posted on Apr 23

The 6 Python Data Engineering Interview Questions You Will Actually Be Asked in 2026

#dataengineering #career #interview #python

Every data engineer preparing for interviews hits the same confused moment. You search for python interview questions, get a list of reverse-a-linked-list and two-sum problems, grind them for two weeks, walk into your first data engineering loop, and get asked to deduplicate a 10 million row event stream while preserving the latest record per composite key.

None of those LeetCode problems prepared you for that question. And the python round in a data engineering interview is not going to get easier until you realize the questions are a different species from the ones that show up in a software engineering loop.

I have run over 250 interview loops at Google, Meta, LinkedIn, and Netflix. The python portion of a data engineering loop does not look like a python backend or frontend loop. It looks like a pipeline-correctness loop wearing a python costume. Below is the full taxonomy of the six questions that actually get asked, how they differ from the SWE canon, and which patterns you have to internalize to pass.

The Core Difference In One Sentence

The SWE python round tests whether you can write correct code on data that fits in memory. The data engineering python round tests whether you can reason about data correctness, grain, idempotency, and scale on data that usually does not.

That is the entire gap. Every other difference flows from it.

A SWE Python problem gives you a list of integers and asks you to do something clever with it. The list has 10 elements. The test cases have 10 elements. The expected behavior is obvious. The skill tested is algorithms.

A data engineering Python problem gives you an iterator over events that might have 10 million elements, might have duplicates from retries, might have late-arriving data, might have schema drift between rows, and asks you to produce a deduplicated, ordered, grouped output without loading it all into memory. The test cases will have 5 elements. The skill tested is production instinct.

Candidates who prepped for SWE Python walk in confident and freeze the moment the input becomes an iterator instead of a list.

Question 1: Streaming Aggregation Over an Iterator

This is the single most common python question I have given and received in a data engineering loop across four FAANG companies.

Setup: you are handed an iterator that yields event dictionaries one at a time. Each event has user_id, event_type, ts, and a few other fields. Compute the count of each event type per user without loading the full iterator into a list.

A SWE candidate types this:

def count_events(events):
    events_list = list(events)
    result = {}
    for e in events_list:
        key = (e["user_id"], e["event_type"])
        result[key] = result.get(key, 0) + 1
    return result

That solution works on the 5 test events the interviewer gave you. It also kills the pipeline when a real day's worth of events arrives. The list(events) call materializes the whole iterator. In production that is 40 GB of dictionaries in memory for no reason.

The data engineering answer never materializes:

from collections import defaultdict

def count_events(events):
    counts = defaultdict(int)
    for e in events:
        counts[(e["user_id"], e["event_type"])] += 1
    return counts

Same logic, different relationship with memory. An interviewer running this round is watching for whether you call list() on an iterator. If you do, you have told them you think like a SWE, not a data engineer. Half the battle in Python data engineering interviews is showing you know the difference between an iterator and a list and that you default to iterators.

The follow-up is always: what if the iterator is too large to even hold the counts dict in memory? Now you are in sketch-aggregation territory (HyperLogLog, Count-Min Sketch) or you partition by a hash of the key. If you have never heard of those, the senior bar just evaporated.

Question 2: Deduplication With a Tiebreaker

Every data engineering loop has a dedup question. The SWE version is "remove duplicates from a list." The DE version is "here is an event stream with retries. Each event has an event_id, an ingested_at, and an updated_by that is sometimes null. Keep one row per event_id, preferring the latest ingested_at, breaking ties by preferring a non-null updated_by."

A SWE candidate reaches for a set. Sets do not have tiebreakers. Sets discard information you need.

The DE answer iterates, keeps a dict keyed by event_id, and compares against the current best:

def dedupe(events):
    best = {}
    for e in events:
        key = e["event_id"]
        current = best.get(key)
        if current is None:
            best[key] = e
            continue
        if e["ingested_at"] > current["ingested_at"]:
            best[key] = e
        elif e["ingested_at"] == current["ingested_at"]:
            if current["updated_by"] is None and e["updated_by"] is not None:
                best[key] = e
    return best.values()

What interviewers are testing here is not the algorithm. The algorithm is trivial. They are testing whether you ask "what defines a duplicate" before typing, whether you handle the null case in the tiebreaker explicitly, and whether you return an iterator-compatible view instead of a list.

The follow-up is always: what if the input is sorted by ingested_at already? Can you do this in constant additional memory? That is where a streaming groupby pattern comes in, and if you can sketch it on the fly you are clearing a senior bar.

Question 3: Schema-Tolerant Parsing

This is the one SWE prep completely ignores and DE interviews lean on heavily.

Setup: you are given a list of dictionaries representing events from a log. Some events are missing fields. Some have extra fields. Some have the right field names but wrong types. Write a function that produces a clean, typed output and a quarantine list for rows that cannot be parsed.

SWE Python does not train this muscle. The LeetCode problem gives you a clean input every time. The DE interviewer is watching whether you:

Validate required fields explicitly before touching them
Cast types with explicit try/except around each cast
Never let one bad row kill the whole batch
Separate "valid" from "invalid" without discarding the invalid rows silently

A reasonable answer:

def parse_events(rows):
    valid, quarantine = [], []
    for row in rows:
        try:
            if "user_id" not in row or "ts" not in row:
                raise ValueError("missing required field")
            parsed = {
                "user_id":    int(row["user_id"]),
                "ts":         int(row["ts"]),
                "event_type": str(row.get("event_type", "unknown")),
                "amount":     float(row["amount"]) if "amount" in row else None,
            }
            valid.append(parsed)
        except (ValueError, TypeError, KeyError) as err:
            quarantine.append({"row": row, "error": str(err)})
    return valid, quarantine

The trap interviewers plant is a row where user_id is the string "null" instead of the Python None. int("null") raises. A candidate who wraps only the final output in one big try/except loses one bad row and all subsequent rows, which is a pipeline bug.

If you have never written parsing code in production, the instinct to quarantine bad rows instead of crashing on them is foreign. It is the single most senior-signaling habit in a DE Python interview.

Question 4: Window and Session Logic Without SQL

When the python round asks a question that maps to a SQL window function, the SQL-only candidates freeze. The prompt sounds like: "Here is a sorted list of events per user. Group them into sessions where a session ends after 30 minutes of inactivity. Return a list of sessions with start_ts, end_ts, and event_count."

This is the sessionization pattern the SQL round tests, ported to Python. The interviewer wants to see whether you can implement in Python what you would write as a LAG + running SUM in SQL.

def sessionize(events, gap_seconds=1800):
    sessions, current = [], None
    for e in events:
        if current is None or e["ts"] - current["end_ts"] > gap_seconds:
            if current is not None:
                sessions.append(current)
            current = {"start_ts": e["ts"], "end_ts": e["ts"], "event_count": 1}
        else:
            current["end_ts"] = e["ts"]
            current["event_count"] += 1
    if current is not None:
        sessions.append(current)
    return sessions

The SWE candidate tries to use a library. The DE candidate writes the two-pointer/rolling-state loop. This is the clearest example of DE Python being closer to state-machine code than to algorithmic code.

Follow-ups are always about edge cases. What if two events share a timestamp? What if the input is not sorted? What if the iterator is chunked across hour boundaries and you have to support resuming? Each one is a real pipeline concern, not an algorithmic concern.

Question 5: Backfill-Safe Incremental Logic

This is the question that most distinguishes senior data engineers from mid-level candidates.

Setup: you have a function that processes yesterday's events. Rewrite it so that running it on today's date, on a date from last week, or on a date range over the last month all produce correct output without double-counting or dropping data.

The SWE candidate does not realize they were asked a design question. They write a function that filters by date and returns aggregates. The DE candidate writes code that is idempotent, deterministic on input date range, and safe against partial re-runs.

The moves interviewers are watching for:

Filter on ingested_at (processing time), not on event_ts (event time), when you want to catch late data
Produce output keyed by (partition, primary_key) so re-running overwrites instead of appends
Take the date range as an argument, not as datetime.now()
Emit a result that is the same shape whether you process one day or thirty

A minimal answer:

def process_range(events_iter, start_date, end_date):
    result = {}
    for e in events_iter:
        if not (start_date <= e["ingested_date"] < end_date):
            continue
        key = (e["ingested_date"], e["user_id"])
        if key not in result:
            result[key] = {"user_id": e["user_id"], "date": e["ingested_date"], "n": 0}
        result[key]["n"] += 1
    return list(result.values())

The interviewer's follow-up is "what happens if this fails halfway through?" If you do not immediately say "the same inputs produce the same outputs, so re-running it is safe," you have not understood why you were asked this question. Idempotency is the senior signal.

Question 6: PySpark DataFrame Logic

If the role is Spark-heavy, and many data engineering roles in 2026 are, the python portion of the interview is really a PySpark portion. The questions are the same five patterns above, expressed in DataFrame API calls instead of pure python.

The specific patterns that recur:

A join on multiple columns, phrased as "here are two DataFrames, produce one row per customer with their latest order and their primary payment method." Candidates who do not know the join(other, on=["customer_id", "region"], how="left") syntax lose time to syntax, not logic.

A window function in PySpark. Same sessionization prompt, but now you write Window.partitionBy("user_id").orderBy("ts") and use F.lag("ts").over(w) to compute gaps. This is the direct translation of the SQL pattern and the pure-Python pattern, and interviewers love it because it tests whether you have touched all three.

A broadcast-join decision. Interviewers describe two tables of wildly different sizes and ask how you would join them. If you say "regular join," you fail. If you say "broadcast the small one," you pass the first level. If you say "broadcast the small one, but only if it fits in the executor memory budget, otherwise repartition and sort-merge," you pass the senior level.

A partitioning and skew question. A DataFrame has 10 million rows, 90% of which share the same user_id (a bot). The interviewer asks what happens when you groupBy("user_id"). The answer involves salting, two-stage aggregation, or adaptive query execution. This is not a SWE question. It is a pipeline-performance question dressed as code.

What the SWE Round Never Asks That the DE Round Always Asks

If you only prep SWE-style python you will never see these coming in a data engineering loop:

"What is the memory footprint of this solution on 100 million rows?"
"What happens to this code if the input is an iterator instead of a list?"
"How do you make this idempotent?"
"What does this return when a field is missing or null?"
"How does this behave if you run it twice?"
"If this was running in production and failed midway, what would the next run see?"
"What is the grain of the output, and does that match what the downstream consumer expects?"

Every one of those is a pipeline-correctness question, not an algorithm question. Every one of them is what separates a DE hire from a SWE hire in the same company.

How To Practice In Four Weeks

Week one, iterator-first coding. Rewrite every LeetCode-style problem you have solved so that it accepts an iterator and returns an iterator. Use itertools.groupby, itertools.islice, generator expressions. Stop reaching for lists.

Week two, the five recurring patterns above. Streaming aggregation, dedup with tiebreakers, schema-tolerant parsing, sessionization, and backfill-safe incremental logic. Write each one on paper before running it.

Week three, PySpark DataFrame fluency. Joins with multiple keys, window functions, broadcast hints, skew handling. Read one real PySpark job from an open-source repository end to end. The muscle memory for DataFrame syntax only comes from reading real jobs.

Week four, edge cases. Null handling, duplicate keys, out-of-order inputs, idempotency, late-arriving data. Most DE interview rejections happen on the edge-case follow-up, not on the main question. Budget more time here than feels right.

The Meta-Skill

None of the patterns above are syntactically hard. The hard part is that the DE Python interview is testing a worldview. You see the question "count events per user" and the SWE worldview asks "what data structure." The DE worldview asks "what is the grain, what is the scale, what is the recovery story."

Candidates who cross over from SWE Python to DE Python successfully are the ones who rewire the question first. Before typing: what is the grain, what is the scale, is the input an iterator, how does this behave on a re-run, how does it handle nulls and missing fields. Once those questions are automatic, the syntax is trivial.

The DE Python interview is closer to a code review than to a coding round. You are not being asked if you can write the code. You are being asked if you would approve this code at 3 AM on a pipeline you own.

One Last Thing

If you are preparing for a data engineering loop with a LeetCode-style practice routine, stop. The patterns are wrong. The input shapes are wrong. The follow-ups will catch you off guard, and you will lose offers you should have won.

Practice the six patterns above. Think about grain, scale, and idempotency every time you type. Make your default input an iterator and your default concern production safety. Do that for four weeks and the python round stops being scary.

If you want to practice for your upcoming data engineer interview, www.DataDriven.io is free. No trial, no credit card. Built because the gap between Python SWE prep and Python DE prep is costing good data engineers jobs that they would otherwise get.

DEV Community