DEV Community

Cover image for Google Data Engineering Interview Questions & Prep Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

Google Data Engineering Interview Questions & Prep Guide

Google data engineering interview questions lean on two stacks at once: SQL (GROUP BY with subqueries, HAVING, self-joins for reporting chains, and recursive CTEs for skip-level hierarchies) and Python (hash maps and Counter for frequency counts, pandas clean-cast-aggregate, decorators stacked for logging and retry, and file I/O for log parsing). The bar is fluency—naming grain, NULL behavior, and edge cases out loud while you type.

This guide walks through the nine topic clusters you will actually see, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix reflects the topics that appear most often in Google data engineering loops (7 easy, 7 medium across SQL and Python) so you can move from reading to typing without losing the thread.

Google data engineering interview questions cover image with bold headline, SQL and Python chips, and pipecode.ai attribution.


Top Google data engineering interview topics

From the Google data engineering practice set, the nine numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–9) Why it shows up at Google
1 Math and arithmetic in Python Phone-screen filter: palindrome integer, missing number, average, arithmetic-formula evaluator.
2 Array and string manipulation in Python Vertical-scan algorithms (longest common prefix), pattern printing—careful index handling.
3 Hash maps and Counter for frequency counting Most-common words with tie-breaking, O(1) membership tests.
4 Dynamic typing and decorators in Python Type dispatch, @wraps, stacked decorators for cross-cutting concerns (logging + retry).
5 File I/O and log aggregation in Python Read line-by-line, parse log records, aggregate with defaultdict(int).
6 Pandas: clean, cast, and aggregate Real ETL prompts—dirty CSV in, per-customer totals out.
7 SQL aggregation, GROUP BY and subqueries "Above-average salary per department" with a scalar subquery vs JOIN to a grouped aggregate.
8 SQL self-joins for reporting hierarchies Employee → manager via manager_id, including CEO with no manager.
9 Recursive CTEs for multi-level hierarchies Skip-level reporting chains; anchor + recursive UNION ALL, depth tracking, termination.

SQL evaluation order (mental model): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → windows → SELECT / ORDER BY. Filter a sum or count of a group? That belongs in HAVING, not WHERE.


1. Math and arithmetic concepts in data engineering

Math and number primitives in Python for data engineering

Google opens many phone screens with a small math problem—not because data engineers solve contest math, but because the cleanest solutions force you to verbalize invariants. An invariant is a property that does not change as the algorithm runs: the digit-sum of a number across reversal, the closed-form sum of 0..n, the count of non-null values inside an aggregate. Once you name the invariant, you usually skip the obvious O(n²) or O(n log n) approach and land on O(n) or O(1).

The four primitives below cover the math problems on the Google practice set: palindrome integer (#82), missing number (#83), average of integers (#89), and arithmetic-formula evaluator (#273). All four reduce to two ideas: integer arithmetic with % 10 and // 10, and replacing a loop-and-check with a closed-form expression. Memorize those two and the section is muscle memory.

Palindrome check using integer reversal

Detailed explanation. A palindromic integer reads the same forwards and backwards (121, 1221). The standard trick avoids converting to a string: pull the last digit with n % 10, fold it into a running reversed integer with rev = rev * 10 + last, then drop the last digit with n //= 10. When n reaches 0, rev holds the reversed integer. Comparing rev == original resolves the question. The whole loop is O(d) where d is the digit count (log₁₀ n)—constant for any practical 64-bit integer.

Two edge cases trip people up. Negative numbers are not palindromes by convention because the leading minus sign breaks symmetry; return False immediately. Trailing zeros (10, 100) are also not palindromes—10 reverses to 01 = 1. The n > 0 invariant handles both naturally because 0 is technically a palindrome and reversing 10 gives 1.

Worked example. n = 121.

step n rev
start 121 0
iter 1 12 1
iter 2 1 12
iter 3 0 121

Result: rev (121) == original (121) → palindrome.

Worked-example solution.

def is_palindrome(n: int) -> bool:
    if n < 0:
        return False
    original, rev = n, 0
    while n > 0:
        rev = rev * 10 + n % 10
        n //= 10
    return rev == original
Enter fullscreen mode Exit fullscreen mode

Missing number using the Gauss formula

Detailed explanation. Given an array of distinct integers from 0..n with exactly one value missing, the brute-force "for each candidate, scan the array" runs in O(n²). The hash-set approach drops it to O(n) time but O(n) memory. The Gauss-summation approach hits O(n) time and O(1) memory: the sum of 0..n is n*(n+1)/2, so missing = expected_sum − actual_sum. This is your invariant—one missing number perturbs the sum by exactly that number.

A second technique uses XOR: 0 ⊕ 1 ⊕ … ⊕ n ⊕ a₀ ⊕ a₁ ⊕ … ⊕ a_{n-1} cancels every value except the missing one, because x ⊕ x = 0 and x ⊕ 0 = x. Slightly more numerically robust than summation if n is huge (no overflow concern in Python, but worth knowing).

Worked example. nums = [3, 0, 1]n = 3, expected sum 0+1+2+3 = 6, actual sum 3+0+1 = 4, missing = 6 − 4 = 2.

Worked-example solution.

def missing_number(nums: list[int]) -> int:
    n = len(nums)
    return n * (n + 1) // 2 - sum(nums)
Enter fullscreen mode Exit fullscreen mode

Average of integers (overflow-safe)

Detailed explanation. "Average" sounds trivial—until the prompt asks for floor of the average, rounding of the average, or for a guarantee against integer overflow on a sum that could exceed 2³¹. In Python, integers are arbitrary-precision, so overflow is not a concern, but the prompt clarification matters: int(total / len) returns a truncated float-converted integer, while total // len returns floor division. They agree for positive numbers and disagree for negatives. State which one the prompt wants.

Worked example. [1, 2, 3, 4, 5] → sum 15, len 5, average 3.0.

Worked-example solution.

def average(nums: list[int]) -> float:
    if not nums:
        raise ValueError("empty input")
    return sum(nums) / len(nums)
Enter fullscreen mode Exit fullscreen mode

Arithmetic formula evaluation with two stacks

Detailed explanation. Evaluating an expression like 3 + 5 * 2 is a parsing-plus-arithmetic problem. The classic two-stack solution: one stack holds numbers, one holds operators. Walk the expression token by token. For each operator, before pushing, apply any operator already on top whose precedence is greater than or equal to the new one (left-associative). At end of input, drain remaining operators.

The invariant is that the operator stack is always sorted by precedence from bottom to top: a higher-precedence operator can sit on top of a lower one (because it will resolve first), but a lower one cannot sit on top of a higher one (we resolve the higher one first, then push the lower). Parentheses are handled by treating ( as a marker that no operator on top can resolve until the matching ) arrives.

Worked example. 3 + 5 * 2.

token nums ops action
3 [3] [] push number
+ [3] [+] push op
5 [3, 5] [+] push number
* [3, 5] [+, *] * outranks +, push
2 [3, 5, 2] [+, *] push number
end [3, 10][13] [+][] resolve * (5*2=10), then + (3+10=13)

Final answer: 13.

Worked-example solution.

PRECEDENCE = {"+": 1, "-": 1, "*": 2, "/": 2}

def evaluate(expr: str) -> int:
    nums: list[int] = []
    ops: list[str] = []

    def apply():
        b, a, op = nums.pop(), nums.pop(), ops.pop()
        nums.append({"+": a + b, "-": a - b, "*": a * b, "/": a // b}[op])

    tokens = expr.replace(" ", "")
    i = 0
    while i < len(tokens):
        ch = tokens[i]
        if ch.isdigit():
            j = i
            while j < len(tokens) and tokens[j].isdigit():
                j += 1
            nums.append(int(tokens[i:j]))
            i = j
            continue
        while ops and PRECEDENCE.get(ops[-1], 0) >= PRECEDENCE[ch]:
            apply()
        ops.append(ch)
        i += 1
    while ops:
        apply()
    return nums[0]
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Forgetting negative-number handling for palindrome (returning True for -121).
  • Using sum(range(n)) for missing-number on a 0-indexed array (off-by-one—range(n) skips n).
  • Doing total / len(nums) and getting a float when the prompt wanted floor division.
  • Reading the arithmetic expression left to right without precedence (3 + 5 * 2 = 16 is wrong).

Python interview question on math and arithmetic

Table nums (column value INT) lists integers 0..n with one value missing. Find the missing value in O(n) time and O(1) space without sorting.

Solution using the Gauss summation identity

def missing_number(nums: list[int]) -> int:
    n = len(nums)
    return n * (n + 1) // 2 - sum(nums)
Enter fullscreen mode Exit fullscreen mode

Why this works: The closed-form n*(n+1)/2 gives the sum of 0..n in constant time. Subtracting the actual sum collapses the search to a single linear scan, beating the obvious set membership approach on memory (O(1) vs O(n)) and the sort approach on time (O(n) vs O(n log n)). The invariant is that one missing value perturbs the total by exactly that value—nothing else changes.

PYTHON
Topic — math
Math problems (all companies)

Practice →

COMPANY
Google — math
Google-tagged math

Practice →


2. Array and string manipulation in Python

Strings, slicing, and vertical scans for data engineering

Once the math warm-up is done, Google moves to small string problems that test careful index handling and edge-case thinking: empty input, single-element input, mixed lengths. The two patterns to recognize: vertical scans (compare across strings column by column, used for longest common prefix and grid-style problems) and printable-pattern loops (nested loops where the inner range encodes the shape).

The Google practice set uses these primitives directly: longest common prefix (#88), star triangle pattern (#84), and most-common-words string handling (#87). Beyond syntax, the interviewer is watching whether you state the invariant out loud—"the answer cannot be longer than the shortest string"—because that single sentence proves you saw the problem at the right altitude.

String iteration and slicing

Detailed explanation. Python strings are immutable sequences: indexable like lists (s[0], s[-1]), sliceable (s[1:4], s[::-1] for reverse), but you cannot mutate them in place (s[0] = 'x' raises TypeError). Slices are copiess[a:b] builds a new string in O(b−a) time. Concatenation with += inside a loop is O(n²) because each concat copies the whole prefix; build a list of pieces and join once for O(n).

The cheap operations to memorize: len(s), s.startswith(prefix), s.endswith(suffix), min(strs, key=len) to find the shortest string in a list, and enumerate(s) to iterate with indices. The expensive operations to avoid in loops: s += ch, ''.join(parts) inside the inner loop instead of after, re.search with a recompiled pattern.

Worked example. s = "flower"s[:2] is "fl"; s[-3:] is "wer"; s[::-1] is "rewolf". len(s) is 6.

Worked-example solution.

def reverse_with_slice(s: str) -> str:
    return s[::-1]
Enter fullscreen mode Exit fullscreen mode

Vertical-scan algorithm for longest common prefix

Detailed explanation. The longest common prefix (LCP) of a list of strings is the longest string that all of them start with. Two algorithms compete: horizontal scan (compute LCP of the first two strings, then LCP of that with the third, and so on; total O(S) where S = total characters) and vertical scan (walk the columns of the strings, stop at the first mismatch). Vertical scan wins on average because it can exit very early—after column 0 if the first characters differ.

The invariant for vertical scan is: the answer is at most as long as the shortest string in the list. Anchor on min(strs, key=len) to bound the outer loop. For each column i, check that every string's i-th character matches the anchor's. The first mismatch returns anchor[:i]. If the loop completes, the entire anchor is the answer.

Worked example. ["flower", "flow", "flight"].

col flower[i] flow[i] flight[i] match?
0 f f f
1 l l l
2 o o i ✗ → stop

Return "fl" (anchor sliced to [:2]).

Worked-example solution.

def longest_common_prefix(strs: list[str]) -> str:
    if not strs:
        return ""
    shortest = min(strs, key=len)
    for i, ch in enumerate(shortest):
        for s in strs:
            if s[i] != ch:
                return shortest[:i]
    return shortest
Enter fullscreen mode Exit fullscreen mode

Nested loops for printable patterns

Detailed explanation. Google asks pattern-printing questions to test loop control. The recipe: outer loop iterates rows, inner loop iterates columns; the inner loop's range is the row number. For a right-aligned star triangle of rows rows, row i has i stars. For a centered diamond, row i has 2*i − 1 stars and rows − i leading spaces. The pattern is in the inner range, not in special cases.

The invariant: rows are independent—nothing one row prints affects another. So once you have the inner expression right, scaling to any row count is free. State this out loud; interviewers note candidates who see independence and avoid coupling.

Worked example — 4-row star triangle.

*
* *
* * *
* * * *
Enter fullscreen mode Exit fullscreen mode

Row i (1-indexed) prints i star tokens separated by spaces.

Worked-example solution.

def star_triangle(rows: int) -> None:
    for i in range(1, rows + 1):
        print(" ".join("*" * i))
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Returning None (or raising) instead of "" for empty input list in LCP.
  • Iterating with range(len(strs[0])) and forgetting that other strings may be shorter—first index access raises IndexError.
  • Mutating the string with += inside a hot loop (quadratic time).
  • Off-by-one on the row count (0-indexed vs 1-indexed)—triangle of rows=4 should print 4 rows, not 3 or 5.

Python interview question on string manipulation

Given a list of strings, return the longest common prefix shared by all of them. If there is no common prefix, return "". Optimize for the common case where the answer is short.

Solution using vertical scan

def longest_common_prefix(strs: list[str]) -> str:
    if not strs:
        return ""
    shortest = min(strs, key=len)
    for i, ch in enumerate(shortest):
        for s in strs:
            if s[i] != ch:
                return shortest[:i]
    return shortest
Enter fullscreen mode Exit fullscreen mode

Why this works: The answer cannot be longer than the shortest string (proven by definition of "common prefix"), so anchoring on min(strs, key=len) bounds the outer loop in O(min_len). The vertical scan exits at the first column mismatch, which is typically very early in real data—total work is O(min_len * len(strs)) worst case, often closer to O(answer_len * len(strs)). Returning "" for an empty list is a defined contract, not an error.

PYTHON
Topic — array
Array problems (all companies)

Practice →

COMPANY
Google — array
Google-tagged array

Practice →


3. Hash maps and Counter for frequency counting in Python

Frequency counting and hash-map fluency in Python for data processing

Once the easy warm-ups are done, Google's interviewers expect you to reach for collections.Counter without prompting. Frequency counting shows up everywhere in data engineering—word counts, top-K queries, deduplication, set membership for streaming joins—so fluency with the standard library is the difference between a clean answer and a 30-line manual loop.

A hash map (Python dict) gives you average O(1) insert and lookup, in exchange for O(n) memory. A Counter is a dict subclass specialized for counts: it auto-zeros missing keys (no KeyError), supports arithmetic (c1 + c2 merges, c1 - c2 subtracts and clamps to zero), and exposes .most_common(k) for top-K. The difference between a clean Google answer and a sloppy one is usually which container you reach for.

dict vs defaultdict vs Counter

Detailed explanation. Three containers, three trade-offs:

  • dict — Generic key→value map; raises KeyError on missing keys unless you use .get(k, default). Use when keys are heterogeneous and you write the access pattern by hand.
  • defaultdict(factory) — On missing key access, calls factory() and inserts the result. defaultdict(int) auto-zeros, perfect for counting. defaultdict(list) auto-creates an empty list, perfect for grouping. The factory is called once per missing key.
  • Counter(iterable) — Built specifically for frequencies. Constructed from any iterable; gives .most_common(k), +, -, & (intersection: min counts), | (union: max counts) for free.

The invariant: Counter is the only one that already knows its values are integers, so it composes with arithmetic and .most_common. Reach for it whenever the problem says "count" or "frequency."

Worked example. Counting words in ["the", "cat", "the", "dog", "cat"].

container construction result
dict for w in words: d[w] = d.get(w, 0) + 1 {'the': 2, 'cat': 2, 'dog': 1}
defaultdict(int) for w in words: d[w] += 1 same
Counter Counter(words) Counter({'the': 2, 'cat': 2, 'dog': 1})

Worked-example solution.

from collections import Counter

def word_counts(words: list[str]) -> dict[str, int]:
    return dict(Counter(words))
Enter fullscreen mode Exit fullscreen mode

Counter.most_common(k) and tie-breaking

Detailed explanation. c.most_common() returns a list of (key, count) pairs sorted by count descending. c.most_common(k) returns the top k. When two keys tie, insertion order breaks the tie—not alphabetical, not lexicographic. This is a Python 3.7+ guarantee (dict preserves insertion order), but it is rarely what the interviewer wants.

If the prompt says "tie-break alphabetically," override the sort: sorted(c.items(), key=lambda x: (-x[1], x[0])). The key tuple (-count, word) sorts by count descending (the negation flips ascending into descending) and then by word ascending. This pattern is so common it should be a reflex—negate primary, ascend secondary.

Worked example. Counting ["the", "cat", "the", "dog", "cat"]. Counter.most_common(2) could return [('the', 2), ('cat', 2)] (insertion order: the was seen first), but if the prompt asks for alphabetical tie-break we want [('cat', 2), ('the', 2)].

Worked-example solution.

from collections import Counter

def top_k_alpha(words: list[str], k: int) -> list[str]:
    c = Counter(words)
    ordered = sorted(c.items(), key=lambda x: (-x[1], x[0]))
    return [w for w, _ in ordered[:k]]
Enter fullscreen mode Exit fullscreen mode

Hash-map lookup for O(1) membership

Detailed explanation. When a brute-force solution is O(n²) because of a for…in list lookup, replace the list with a set or dict and the lookup drops to average O(1). Worst case is still O(n) (hash collisions), but in practice Python's hash table keeps it constant. The trade-off is memory: a set of 1M ints uses roughly 30–60 MB, a list uses 8 MB. For bounded data this is free; for unbounded it can OOM.

The invariant: if you find yourself writing if x in big_list, you almost certainly want if x in big_set. State the rewrite even when the input is already a list—convert once with s = set(big_list) and your subsequent membership tests are O(1).

Worked example. nums = [1, 2, 3, …, 1_000_000]. 42 in list(nums) scans up to 1M elements; 42 in set(nums) is one hash + one bucket check.

Worked-example solution.

def has_pair_with_sum(nums: list[int], target: int) -> bool:
    seen: set[int] = set()
    for n in nums:
        if target - n in seen:
            return True
        seen.add(n)
    return False
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using a plain dict with += 1 and crashing with KeyError on the first unseen key.
  • Trusting Counter.most_common to break ties alphabetically (it does not).
  • Mutating a Counter while iterating over .items() (RuntimeError: dictionary changed size during iteration).
  • Building a Counter from an unsplit string—Counter("the cat") counts characters, not words.

Python interview question on frequency and tie-breaking

Given a list of words, return the top K most frequent words. When two words tie, the lexicographically smaller word comes first.

Solution using Counter and a custom sort

from collections import Counter

def top_k_words(words: list[str], k: int) -> list[str]:
    counts = Counter(words)
    ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
    return [w for w, _ in ordered[:k]]
Enter fullscreen mode Exit fullscreen mode

Why this works: Counter(words) builds the frequency map in one O(n) pass. The sort key (-count, word) orders by count descending (negation flips ascending into descending) and then by word ascending—exactly the prompt's tie-break rule. Slicing [:k] bounds output to k items, and the overall complexity is O(n + u log u) where u = len(counts). For typical text data u ≪ n, so this is effectively linear with a tiny log factor.

PYTHON
Topic — hash table
Hash table problems

Practice →

PYTHON
Topic — frequency analysis
Frequency analysis problems

Practice →


4. Dynamic typing and decorators in Python

Python idioms for ETL: type dispatch and decorator stacks

Google leans into Python's dynamic typing and decorator facilities for data engineering rounds because real ETL code mixes types (a column is sometimes int, sometimes str) and needs cross-cutting concerns like retries and logging without polluting business logic. A clean answer separates "what the code does" from "how it is wrapped"—business logic in the function body, retry/log/timing in decorators around it.

The mental model: a decorator is a function that takes a function and returns a function. Stacking two decorators means the closer-to-def one runs innermost. Order matters more than people expect: @retry then @log makes one log per call (regardless of retries); @log then @retry makes one log per attempt. Interviewers ask explicitly which behavior the prompt wants.

type() vs isinstance (and why isinstance wins)

Detailed explanation. Two ways to ask "is x an int?" — type(x) is int and isinstance(x, int). They are not equivalent. type(x) is int checks exact identity—it returns False for any subclass. isinstance(x, int) returns True for int and any subclass. Because bool is a subclass of int in Python (True == 1, False == 0), isinstance(True, int) is True but type(True) is int is False.

The rule: prefer isinstance unless you specifically need to reject subclasses. A function that takes "an int" should accept booleans by default. A function that wants to reject booleans (e.g. for serialization to a strict-int column) should check explicitly: isinstance(x, int) and not isinstance(x, bool). State this distinction in the interview—it shows you understand Python's type hierarchy.

Worked example.

call result reason
isinstance(True, int) True bool is a subclass of int
type(True) is int False exact identity, bool ≠ int
isinstance(1, (int, float)) True tuple of types accepted
isinstance("1", int) False str is not a subclass of int

Worked-example solution.

def is_strict_int(x) -> bool:
    return isinstance(x, int) and not isinstance(x, bool)
Enter fullscreen mode Exit fullscreen mode

Dispatch table for dynamic types

Detailed explanation. When a function must do different things for different input types, the dispatch table pattern beats a long if/elif chain on readability and extensibility. Build a dict mapping type → handler; lookup is O(1); adding a new type is one line. The fallback case (unknown type) becomes explicit.

The invariant: dispatch tables encode an open-set assumption. Adding a handler for a new type does not require editing the dispatcher itself, only registering a new entry. This is the same idea as functools.singledispatch (which is the more idiomatic way once a project commits to it), but a hand-built dict is fine for interview pace.

Worked example. Process 42, "hello", [1, 2, 3].

input type handler output
42 int lambda x: x * 2 84
"hello" str lambda x: x.upper() "HELLO"
[1, 2, 3] list lambda x: list(reversed(x)) [3, 2, 1]

Worked-example solution.

HANDLERS = {
    int: lambda x: x * 2,
    str: lambda x: x.upper(),
    list: lambda x: list(reversed(x)),
}

def process(value):
    handler = HANDLERS.get(type(value))
    if handler is None:
        raise TypeError(f"unsupported type: {type(value).__name__}")
    return handler(value)
Enter fullscreen mode Exit fullscreen mode

Decorator basics with functools.wraps

Detailed explanation. A decorator is sugar for f = decorator(f). Without @wraps, the wrapper function replaces the original's __name__, __doc__, and __qualname__, breaking debugging, logging, and help(). functools.wraps(fn) copies those attributes from fn to the wrapper. Always use @wraps—it is one line and saves an hour of "why does my logger say wrapper instead of fetch_user?" debugging.

A second subtlety: a decorator with arguments (@retry(times=3)) is actually a factory that returns the real decorator. So you have three layers: factory → decorator → wrapper. Each layer captures variables in its closure. Get the levels straight in your head before writing any complex decorator.

Worked example. Wrap add(a, b) so each call prints "call add" first.

add(2, 3)
# call add
# 5
Enter fullscreen mode Exit fullscreen mode

Worked-example solution.

from functools import wraps

def log_calls(fn):
    @wraps(fn)
    def wrapper(*args, **kwargs):
        print(f"call {fn.__name__}")
        return fn(*args, **kwargs)
    return wrapper

@log_calls
def add(a: int, b: int) -> int:
    return a + b
Enter fullscreen mode Exit fullscreen mode

Stacking decorators: logging + retry

Detailed explanation. When two decorators stack, the one closest to def is innermost—so it runs first on call and last on return. With @log over @retry over def f, the call chain is log_wrapper → retry_wrapper → f; logging fires once at the outer layer, retries are internal to the retry layer. Flip the order, and log_wrapper becomes innermost, so each retry attempt is logged separately.

For a Google interview, the prompt usually says "log every call once and retry on failure" (logs outside, retries inside) or "log every retry attempt" (logs inside, retries outside). Read carefully and state which order you chose before writing code.

Worked example. Function fetch_user(uid) that fails twice then succeeds. Decorated as @log_calls (outer) over @retry(times=3) (inner). Output:

call fetch_user
{'id': 42, 'name': 'Ada'}
Enter fullscreen mode Exit fullscreen mode

(One log line, three internal attempts the user does not see.)

Worked-example solution.

from functools import wraps
import time

def retry(times: int):
    def deco(fn):
        @wraps(fn)
        def wrapper(*a, **kw):
            for attempt in range(1, times + 1):
                try:
                    return fn(*a, **kw)
                except Exception:
                    if attempt == times:
                        raise
                    time.sleep(0.1)
        return wrapper
    return deco

def log_calls(fn):
    @wraps(fn)
    def wrapper(*a, **kw):
        print(f"call {fn.__name__}")
        return fn(*a, **kw)
    return wrapper

@log_calls
@retry(times=3)
def fetch_user(uid: int) -> dict:
    ...
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Forgetting @wraps—the wrapped function loses its name, breaking logs, tracebacks, and help().
  • Using type(x) == int and missing bool (or any subclass) silently.
  • Stacking decorators in the wrong order so retries are logged twice (or not at all).
  • Retrying a non-idempotent operation (a payment, a non-idempotent INSERT) without an idempotency key.

Python interview question on decorator stacks

Write a retry(times, exceptions) decorator that retries a function up to times times, only when one of the listed exceptions is raised, with exponential backoff. Stack it with a log_calls decorator so every first attempt is logged once.

Solution using functools.wraps and exception filtering

from functools import wraps
import time

def retry(times: int, exceptions: tuple = (Exception,)):
    def deco(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(1, times + 1):
                try:
                    return fn(*args, **kwargs)
                except exceptions:
                    if attempt == times:
                        raise
                    time.sleep(0.1 * (2 ** (attempt - 1)))
        return wrapper
    return deco

def log_calls(fn):
    @wraps(fn)
    def wrapper(*args, **kwargs):
        print(f"call {fn.__name__}")
        return fn(*args, **kwargs)
    return wrapper

@log_calls
@retry(times=3, exceptions=(TimeoutError,))
def fetch_user(user_id: int) -> dict:
    ...
Enter fullscreen mode Exit fullscreen mode

Why this works: Retry is the inner decorator, so each call logs once (at the outer layer) regardless of how many internal retries happen. Exponential backoff (0.1 * 2^(attempt-1)) prevents thundering-herd retries against a struggling downstream. The exceptions tuple lets the caller scope retries to recoverable failures (timeouts) without swallowing real bugs (KeyError, TypeError). functools.wraps preserves the function's identity so logs and tracebacks remain readable.

Practice Google-style Python problems →

PYTHON
Topic — dynamic typing
Dynamic typing problems

Practice →

PYTHON
Topic — decorator
Decorator problems

Practice →


5. File I/O and log aggregation in Python

Streaming file reads and log parsing for data engineering

The closest thing to real DE work in a Google interview is the log aggregator: read a file line-by-line, parse each line, and emit per-key totals. Three ideas matter—stream, do not slurp; isolate parsing from aggregation; handle malformed lines without crashing the whole pipeline.

Streaming matters because production log files can be tens of gigabytes. Loading the file into memory (f.read().splitlines()) is fine for 1 MB and fatal for 50 GB. Iterating for line in f: reads one line at a time. Parsing isolation matters because parsing logic is the part most likely to break on a corrupt line; if parsing is a separate function returning None on failure, the aggregator can skip-and-count without aborting. This is exactly the structure interviewers grade.

Reading files line-by-line with a context manager

Detailed explanation. with open(path) as f: creates a context-managed file handle. The with block ensures f.close() runs even if the body raises—the file descriptor leak is one of the most common bugs in DE code that does not use with. Inside the block, for line in f: produces one line at a time, including the trailing newline. Use line.rstrip("\n") if you do not want it.

For binary or unusual encodings, pass mode="rb" or encoding="utf-8" (default on most systems but worth being explicit). For very large files where you need to seek, switch to f.seek(offset) instead of iteration. For multi-process reading, split the file into byte ranges and assign one to each worker.

Worked example. Print every line of access.log with the leading line number.

Worked-example solution.

def print_with_lineno(path: str) -> None:
    with open(path, encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            print(f"{i}: {line.rstrip()}")
Enter fullscreen mode Exit fullscreen mode

Parsing log lines: split vs regex

Detailed explanation. Two parsers, two trade-offs. str.split is fast and brittle: it works when the format is fixed and whitespace-separated, breaks when fields are quoted or contain spaces. re.match / re.search is slower and flexible: it handles optional fields, quoted strings, and structured logs. For interview pace, default to split and switch to regex only when the prompt has irregular formats.

Compile regex once at module scope, never inside the loop. re.compile(r"…") returns a pattern object; reusing it is much cheaper than recompiling per line. For very high-throughput log parsing, the compiled pattern plus match.groupdict() for named captures is the cleanest pattern.

Worked example. Line "2026-04-27 INFO user=42 action=login" parsed with split()['2026-04-27', 'INFO', 'user=42', 'action=login']. Then dict(p.split('=', 1) for p in tokens if '=' in p){'user': '42', 'action': 'login'}.

Worked-example solution.

def parse_line(line: str) -> dict[str, str] | None:
    tokens = line.strip().split()
    if len(tokens) < 4:
        return None
    fields = dict(t.split("=", 1) for t in tokens if "=" in t)
    if "user" not in fields or "action" not in fields:
        return None
    return fields
Enter fullscreen mode Exit fullscreen mode

Aggregating with defaultdict(int)

Detailed explanation. Once parsing is isolated, aggregation is trivial. The aggregator only ever sees (key, value) pairs and increments a counter. defaultdict(int) auto-zeros missing keys; reading a missing key creates it with the default, which is occasionally surprising (so use .get(k, 0) if you only want to read without inserting).

The invariant: the aggregator should not raise on bad data, because parsing has already filtered it. If parsing returns None, the aggregator skips. This separation lets you log a count of skipped lines at the end (skipped = total − valid), which interviewers love.

Worked example. Three log lines, action counts per user.

2026-04-27 INFO user=42 action=login
2026-04-27 INFO user=99 action=login
2026-04-27 INFO user=42 action=logout
Enter fullscreen mode Exit fullscreen mode

{42: {'login': 1, 'logout': 1}, 99: {'login': 1}}.

Worked-example solution.

from collections import defaultdict

def aggregate_actions(path: str) -> dict[str, dict[str, int]]:
    counts: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    with open(path) as f:
        for line in f:
            fields = parse_line(line)
            if fields is None:
                continue
            counts[fields["user"]][fields["action"]] += 1
    return {u: dict(v) for u, v in counts.items()}
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • f.read().splitlines()—loads the whole file. Fine for 1 MB, fatal for 50 GB.
  • Catching every exception silently and returning {} on a malformed line, hiding real data corruption.
  • Compiling a regex inside the loop (recompile per line is slow); compile once at module scope.
  • Forgetting that defaultdict returns the default on read, so if user not in counts is always False if you check after a read.

Python interview question on log aggregation

Given a log file with lines of the form "<date> <level> user=<id> action=<verb>", count actions per user. The file is too large to fit in memory; stream it and emit a sorted summary.

Solution using defaultdict and a streaming read

from collections import defaultdict

def aggregate_actions(path: str) -> list[tuple[str, str, int]]:
    counts: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    with open(path) as f:
        for raw in f:
            tokens = raw.strip().split()
            fields = dict(t.split("=", 1) for t in tokens if "=" in t)
            u, a = fields.get("user"), fields.get("action")
            if u and a:
                counts[u][a] += 1
    return sorted(
        ((u, a, n) for u, actions in counts.items() for a, n in actions.items()),
        key=lambda r: (r[0], -r[2]),
    )
Enter fullscreen mode Exit fullscreen mode

Why this works: Iterating f directly streams lines—memory is bounded by the size of the per-user counter map, not the file. Parsing is isolated to two dict comprehensions, so a malformed line is skipped without crashing the whole run. The final sort returns deterministic output ordered by user, then by action frequency descending. Total time is O(L + N log N) where L is line count and N is unique (user, action) pairs.

PYTHON
Topic — file I/O
File I/O problems

Practice →

PYTHON
Topic — frequency analysis
Frequency analysis problems

Practice →


6. Pandas: clean, cast, and aggregate

Pandas patterns in Python for data engineering

A typical Google DE prompt: "Here is an order stream as a CSV. Some amount cells are blank or 'N/A'. Produce per-customer totals." That tests type casting, NaN handling, and groupby aggregation in one shot. The clean answer follows three explicit phases—clean (drop or coerce malformed rows, with a count of how many you dropped), cast (pd.to_numeric(errors='coerce') before astype), aggregate (named aggregations with explicit column names). Saying these three words out loud changes how the interviewer scores the round.

Three-step pandas ETL diagram showing a dirty CSV row, a to_numeric coerce step, and a grouped totals table.

pd.to_numeric with errors='coerce' for dirty data

Detailed explanation. pd.to_numeric(series, errors=…) converts a series to numeric. The errors argument controls failure mode: "raise" (default — raises ValueError on the first bad value), "ignore" (returns the input unchanged on failure, dangerous because the column type does not change), and "coerce" (turns bad values into NaN). The right answer 95% of the time is "coerce"—it surfaces dirt without crashing.

After coercion, NaN is the universal "missing" sentinel. Use .isna() / .notna() to filter, .dropna(subset=[col]) to drop rows where a specific column is NaN, or .fillna(value) to impute. Imputing with 0 is rarely correct; it skews mean and sum. Drop or carry NaN forward.

Worked example. Mixed-type column.

order_id amount (raw) after to_numeric(errors='coerce')
101 "12.5" 12.5
102 "N/A" NaN
103 "abc" NaN
104 15 15.0

Worked-example solution.

import pandas as pd

def coerce_amount(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    return df
Enter fullscreen mode Exit fullscreen mode

astype vs to_numeric

Detailed explanation. df["x"].astype(int) is the strict caster: it raises ValueError on the first row that cannot become an int. Use it once you know the column is clean. pd.to_numeric(df["x"], errors="coerce") is the tolerant caster: it produces NaN on bad rows and returns a float64 series (because NaN is a float). The pipeline order is therefore: to_numeric first to surface dirt, drop or impute NaN, astype second if you need a strict integer column for downstream serialization.

A subtle gotcha: astype(float) succeeds on "NaN" strings and parses them to NaN, while astype(int) raises. Always cast to float (or use to_numeric) before considering int.

Worked example. Same dirty data.

step column dtype behavior
raw object mixed strings
astype(int) ValueError on row 102
to_numeric(errors='coerce') float64 NaN for rows 102, 103
dropna() then astype(int) int64 clean integer column

Worked-example solution.

import pandas as pd

def to_clean_int(s: pd.Series) -> pd.Series:
    coerced = pd.to_numeric(s, errors="coerce")
    return coerced.dropna().astype(int)
Enter fullscreen mode Exit fullscreen mode

groupby().agg() with named aggregations

Detailed explanation. df.groupby(key).agg(...) is the workhorse for per-group metrics. The named-aggregation form (pd.NamedAgg or output_name=(input_col, func)) is the modern API and produces clean output column names without the multi-level index pain of older syntax. After grouping, .reset_index() flattens the group key back to a column—forget this and downstream merges break.

Multi-metric aggregations are common: "per customer, give me total spend, order count, and last order date." Encode them as separate named aggregations in one call. For custom aggregations (e.g. "fraction of orders that were paid"), pass a lambda: share=("status", lambda s: (s == "paid").mean()).

Worked example. Orders dataframe with multiple metrics per customer.

import pandas as pd

df = pd.DataFrame({
    "customer_id": [1, 1, 2, 2, 3],
    "amount": [12.5, None, 30.0, "N/A", 5.0],
})
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df.dropna(subset=["amount"])
out = df.groupby("customer_id").agg(
    total=("amount", "sum"),
    orders=("amount", "count"),
).reset_index()
Enter fullscreen mode Exit fullscreen mode

customer_id total orders
1 12.5 1
2 30.0 1
3 5.0 1

Worked-example solution.

import pandas as pd

def per_customer_metrics(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    return (
        df.dropna(subset=["amount"])
          .groupby("customer_id")
          .agg(total=("amount", "sum"), orders=("amount", "count"))
          .reset_index()
    )
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Calling astype(float) before cleaning—one bad row blows up the whole pipeline.
  • Letting NaN propagate into mean() silently and not reporting how many rows were dropped.
  • Forgetting .reset_index() after groupby and then merging on a multi-index later.
  • Using df["amount"].fillna(0) blindly when the prompt clearly says "exclude bad rows."

Python interview question on pandas ETL

Given an order-stream CSV with columns customer_id, amount, status where amount may be malformed and only status == "paid" rows count, produce a dataframe with one row per customer, columns total (sum of amounts) and orders (count of paid orders), sorted by total descending.

Solution using to_numeric, filtering, and groupby().agg()

import pandas as pd

def per_customer_totals(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    paid = df[(df["status"] == "paid") & df["amount"].notna()]
    out = (
        paid.groupby("customer_id")
            .agg(total=("amount", "sum"), orders=("amount", "count"))
            .reset_index()
            .sort_values("total", ascending=False, kind="stable")
    )
    return out
Enter fullscreen mode Exit fullscreen mode

Why this works: to_numeric(errors="coerce") turns malformed amounts into NaN without raising, which the boolean mask then drops alongside non-paid rows. Named aggregation produces clean output column names so downstream consumers do not depend on tuple-style multi-level column names. kind="stable" keeps tied totals in customer-id order for deterministic output—important if downstream tests compare row order. Total complexity is O(N) for the coerce + filter and O(N log N) for the sort.

Drill SQL aggregation patterns →

PYTHON
Topic — type casting
Type casting problems

Practice →

PYTHON
Topic — aggregation
Aggregation problems

Practice →


7. SQL aggregation, GROUP BY and subqueries

Aggregation and subqueries in SQL for data engineering

The most common Google SQL pattern: "Find employees whose salary is above the average." Sounds trivial—and it tests WHERE vs HAVING, NULL semantics in AVG, and whether you reach for a scalar subquery or a JOIN to a pre-aggregated CTE. The shape of your answer reveals whether you understand SQL evaluation order or just wrote a query that happens to run.

The four sub-skills below show up in order of difficulty: aggregate semantics with NULL → WHERE vs HAVING → scalar subquery → correlated subquery vs CTE+JOIN. Internalize each one and the rest of SQL aggregation feels easy.

AVG, SUM, COUNT on NULL

Detailed explanation. SQL aggregate functions all share one rule: NULL is skipped, not zeroed. SUM(col) sums non-null values; if every value in the group is NULL, the result is NULL, not 0. AVG(col) divides sum-of-non-null by count-of-non-null. COUNT(*) counts rows in the group regardless of NULL; COUNT(col) counts rows where col is not NULL; COUNT(DISTINCT col) counts unique non-null values.

The interview-grade restatement: AVG is biased when NULLs are common, because the denominator changes per group. If the prompt says "average salary" and some salaries are NULL, ask whether NULL means "no salary recorded" (skip) or "zero salary" (replace with 0). The answer changes the query.

Worked example. One group, three rows, salaries 100, NULL, 200.

aggregate value reasoning
SUM(salary) 300 NULL skipped
AVG(salary) 150 sum 300 / count 2
COUNT(*) 3 rows
COUNT(salary) 2 non-null salaries
COUNT(DISTINCT salary) 2 distinct non-null values

Worked-example solution.

SELECT
  SUM(salary)                AS total,
  AVG(salary)                AS avg_salary,
  COUNT(*)                   AS rows,
  COUNT(salary)              AS non_null_salaries,
  COUNT(DISTINCT salary)     AS distinct_salaries
FROM employees;
Enter fullscreen mode Exit fullscreen mode

WHERE vs HAVING

Detailed explanation. The single most-confused pair in SQL. Their roles map to two different stages of evaluation:

Step Predicate evaluated Operates on
WHERE Per-row predicate Source rows before grouping
GROUP BY (groups rows) Bundles rows into groups
Aggregates SUM, AVG, COUNT, etc. Compute one value per group
HAVING Per-group predicate Groups after aggregates exist

Pitfall: writing WHERE AVG(salary) > 100000 is invalid—SQL has not computed the aggregate yet. The query rewrites to HAVING AVG(salary) > 100000. Conversely, filtering on a per-row column (like salary > 50000) belongs in WHERE, not HAVING, because it can prune rows before grouping, which is faster.

Worked example. Two departments.

dept_id emp_id salary
A 1 100
A 2 200
B 3 50
  • WHERE salary > 75 removes employee 3 (per-row filter).
  • GROUP BY dept_id then HAVING SUM(salary) > 200 returns only dept A (sum 300); dept B (sum 50) drops out.

Worked-example solution.

SELECT dept_id, SUM(salary) AS total
FROM employees
WHERE salary > 75
GROUP BY dept_id
HAVING SUM(salary) > 200;
Enter fullscreen mode Exit fullscreen mode

Scalar subquery for "above average"

Detailed explanation. A scalar subquery is a SELECT that returns exactly one row and one column. It can stand in any expression slot a constant can. The classic Google use: comparing each row to a global statistic. (SELECT AVG(salary) FROM employees) returns a single number; the outer query compares each row's salary to it.

PostgreSQL evaluates the scalar subquery once (or caches it), so the outer query is still O(N). If the subquery returns more than one row, you get cardinality violation—wrap it in MAX(...) or MIN(...) to force one value, or replace with a JOIN if you need per-group statistics.

Worked example. Five employees: salaries 60, 80, 100, 120, 140. Average = 100. "Above average" rows: 120, 140.

Worked-example solution.

SELECT employee_id, name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
Enter fullscreen mode Exit fullscreen mode

Correlated subquery vs CTE alternative

Detailed explanation. A correlated subquery references the outer row and re-evaluates once per outer row. It is easy to write—WHERE salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id)—and slow at scale: for N employees in K departments, the planner often runs the subquery N times. Total cost approaches O(N × group_size).

The CTE + JOIN alternative aggregates once per group in a CTE, then joins each row to its group's aggregate exactly once. Cost is O(N + K) for grouping + O(N) for the join. On large tables this is the difference between minutes and seconds. Interviewers grade this distinction; mention it before writing.

Worked example. Same employees, but per-department average. The correlated subquery and the CTE+JOIN return the same rows, but the CTE+JOIN is one order of magnitude faster on large tables.

Worked-example solution.

WITH avg_by_dept AS (
  SELECT dept_id, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY dept_id
)
SELECT e.employee_id, e.name, e.salary
FROM employees e
JOIN avg_by_dept a ON a.dept_id = e.dept_id
WHERE e.salary > a.avg_salary;
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Filtering SUM/AVG in WHERE instead of HAVING.
  • Forgetting that AVG ignores NULLs—if salary is NULL for half the rows, the average is biased.
  • Using COUNT(salary) when the prompt asked "how many employees" (rows where salary IS NULL get dropped).
  • Letting a correlated subquery run N×K when a JOIN to a grouped aggregate runs in O(N + K).

SQL interview question on aggregation and subqueries

Table employees(employee_id, name, dept_id, salary). Return all employees whose salary is strictly greater than the average salary in their own department, sorted by dept_id, then by salary descending.

Solution using a CTE and a single join

WITH avg_by_dept AS (
  SELECT dept_id, AVG(salary) AS dept_avg
  FROM employees
  GROUP BY dept_id
)
SELECT e.employee_id, e.name, e.dept_id, e.salary
FROM employees e
JOIN avg_by_dept a ON a.dept_id = e.dept_id
WHERE e.salary > a.dept_avg
ORDER BY e.dept_id, e.salary DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: Aggregation runs once per department in the CTE—N rows in, K groups out, where K ≪ N. The single JOIN then attaches each employee to their department's average in one pass, beating the correlated subquery's N×K runtime. Strict > excludes employees who exactly match the average, matching the prompt's wording. The ORDER BY ties the output to a deterministic shape so downstream tests compare row sequences reliably.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

SQL
Topic — subqueries
Subquery problems

Practice →


8. SQL self-joins for reporting hierarchies

Self-joins in SQL for data engineering

The classic Google self-join question: "Print every employee with their manager's name." It uses one table aliased twice—employees on the left, managers on the right—joined on manager_id. Two pitfalls trip people up: the CEO has no manager (so INNER JOIN drops them), and column ambiguity without explicit aliases. Both are interview-grade signal: a candidate who handles them without prompting is a level above one who needs a hint.

The mental model: a self-join is a logical second copy of the same physical table. The database does not duplicate data; the query planner just allows two row-pointer roles for the same rows. Aliases (e and m) make the roles explicit. Without aliases, SELECT name FROM employees JOIN employees ON … raises ambiguous column reference.

Self-join basics: aliasing the same table twice

Detailed explanation. The JOIN syntax requires two row sources. To self-join, alias the same table twice in the FROM clause, treating each alias as a logical copy. Decide which side represents which role—conventionally e for employee, m for manager. The join condition links them: m.employee_id = e.manager_id reads as "the manager's employee_id equals the employee's manager_id."

The invariant: the join direction is asymmetric. e.manager_id = m.employee_id (employee's manager_id matches a manager's employee_id) is correct. Reverse it and you get nonsense pairs. Type the direction out loud as a sentence before writing the SQL.

Worked example. Four employees in a flat hierarchy.

employee_id name manager_id
1 Alice (CEO) NULL
2 Bob 1
3 Carol 2
4 Dan 2

INNER JOIN on m.employee_id = e.manager_id yields:

employee manager
Bob Alice
Carol Bob
Dan Bob

Alice is missing because her manager_id is NULL.

Worked-example solution.

SELECT e.name AS employee, m.name AS manager
FROM employees e
JOIN employees m ON m.employee_id = e.manager_id;
Enter fullscreen mode Exit fullscreen mode

INNER vs LEFT self-join (CEO with no manager)

Detailed explanation. INNER JOIN returns rows where the join condition matches on both sides. For employees with manager_id IS NULL, no manager row matches, so they drop out—including the CEO. LEFT JOIN keeps every left row; columns from the right side become NULL when there is no match. The CEO appears with manager = NULL.

The choice depends on the prompt. "List every direct-report relationship" → INNER JOIN (the CEO has no relationship to list). "List every employee with their manager (CEO has no manager)" → LEFT JOIN. Read the prompt twice; the verb often signals which one.

Worked example. Same four-row table, LEFT JOIN.

employee manager
Alice NULL
Bob Alice
Carol Bob
Dan Bob

Now Alice appears with NULL.

Worked-example solution.

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id;
Enter fullscreen mode Exit fullscreen mode

Projecting both names + chain depth = 1

Detailed explanation. A self-join with one hop is a direct-report report: depth = 1. To go beyond one hop (employee → manager → grand-manager), you need either a second self-join for each additional level (verbose, hard to scale) or a recursive CTE (next section). Most Google interviewers escalate from one self-join to a recursive CTE in the same conversation—it is the natural follow-up.

The invariant: a single self-join expresses exactly one parent edge per row. Two parent edges per row need two self-joins. N parent edges need N self-joins or one recursion.

Worked example. Adding the manager's manager (grand-manager).

SELECT
  e.name AS employee,
  m.name AS manager,
  gm.name AS grand_manager
FROM employees e
LEFT JOIN employees m  ON m.employee_id  = e.manager_id
LEFT JOIN employees gm ON gm.employee_id = m.manager_id;
Enter fullscreen mode Exit fullscreen mode

For Carol → Bob → Alice → (no grand): row is ('Carol', 'Bob', 'Alice').

Worked-example solution. Same as above; for chains deeper than 2, prefer a recursive CTE (§9).

Common beginner mistakes

  • Forgetting to alias—SELECT name FROM employees JOIN employees ON … is ambiguous.
  • Using INNER JOIN and then panicking that the CEO is missing.
  • Joining on e.employee_id = m.manager_id (reversed direction) and getting nonsense pairs.
  • Pulling all columns with SELECT * so the manager's manager_id and the employee's manager_id collide silently.

SQL interview question on self-joins

Table employees(employee_id, name, manager_id). Return one row per employee with columns employee and manager (the manager's name). The CEO must appear with manager = NULL. Order by employee.

Solution using a LEFT JOIN with two aliases

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id
ORDER BY e.name;
Enter fullscreen mode Exit fullscreen mode

Why this works: The LEFT JOIN keeps all employees—including the CEO whose manager_id is NULL. Aliasing e (employee) and m (manager) makes the join direction explicit and removes column ambiguity. The single hop m.employee_id = e.manager_id produces exactly one row per employee. The ORDER BY e.name enforces deterministic output.

SQL
Topic — self-join
Self-join problems

Practice →

SQL
Topic — joins
Join problems

Practice →


9. Recursive CTEs for multi-level hierarchies

Recursive CTEs in SQL for data engineering

When the question becomes "show every ancestor in the chain, not just the direct manager," a single self-join no longer suffices. You need a recursive CTE: one query that references itself to walk the tree until termination. Google's harder DE SQL question (#613 on the practice set) is exactly this—skip-level reporting: for every employee, list every ancestor up to the CEO.

The mental model: a recursive CTE has two members joined by UNION ALL. The anchor is the base case (one row per starting node). The recursive member joins the CTE's current contents to the base table to walk one more hop. Each iteration adds new rows; the recursion stops when the recursive member returns zero rows. PostgreSQL handles termination automatically when there are no cycles; with cycles, you must add a depth cap or path tracking.

Diagram showing a four-level employee reporting chain on the left with arrows, and a SQL recursive CTE code card on the right.

Recursive CTE anatomy: anchor + recursive UNION ALL

Detailed explanation. A recursive CTE has the shape:

WITH RECURSIVE name AS (
  -- anchor: base case
  SELECT  FROM source WHERE base_predicate
  UNION ALL
  -- recursive: one hop further
  SELECT  FROM source s JOIN name n ON s.parent = n.id
)
SELECT  FROM name;
Enter fullscreen mode Exit fullscreen mode

Three rules. First, the keyword RECURSIVE is required in PostgreSQL (some dialects make it optional; do not rely on that). Second, UNION ALL is required—UNION (deduplicating) is allowed in some implementations but rejected by Postgres for recursive CTEs because it would force materializing the whole set per iteration. Third, the anchor and recursive members must have the same column count and types.

The invariant: the recursive member adds rows that the anchor's invariant lets it generate. If the anchor returns "depth-1 ancestors," the recursive member must return "depth-N ancestors given depth-(N-1) rows." Whatever shape the anchor produces, the recursive member must extend.

Worked example. Walk from CEO down to all employees, tracking depth.

Worked-example solution.

WITH RECURSIVE chain AS (
  SELECT employee_id, name, manager_id, 1 AS level
  FROM employees
  WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.name, e.manager_id, c.level + 1
  FROM employees e
  JOIN chain c ON e.manager_id = c.employee_id
)
SELECT * FROM chain ORDER BY level, name;
Enter fullscreen mode Exit fullscreen mode

Tracking depth with level + 1

Detailed explanation. Carry an integer column that increments at each recursive step. The anchor sets 1 (or 0, depending on convention); the recursive member adds c.level + 1. This level column is your loop counter—it tells you how many hops from the start each row is. It is also your safety net: capping level < 50 in the recursive WHERE prevents infinite recursion on cyclic data.

The invariant: level is monotonically non-decreasing across iterations. The first iteration has level = 1, the second level = 2, and so on. If you see a row with level = 1 after the first iteration, your recursive join condition is wrong.

Worked example. 4-level chain Alice → Bob → Carol → Dan, walking down from CEO.

name level
Alice 1
Bob 2
Carol 3
Dan 4

Each row's level equals its depth from the CEO.

Worked-example solution.

WITH RECURSIVE chain AS (
  SELECT employee_id, name, 1 AS level
  FROM employees
  WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.name, c.level + 1
  FROM employees e
  JOIN chain c ON e.manager_id = c.employee_id
)
SELECT * FROM chain ORDER BY level, name;
Enter fullscreen mode Exit fullscreen mode

Termination and infinite-loop guards

Detailed explanation. PostgreSQL terminates a recursive CTE automatically when the recursive member returns zero rows. For acyclic trees (a typical org chart) this happens naturally—when you reach the leaf level, no employee has a parent in the current chain set, so the next iteration is empty and recursion stops. For cyclic data (employee A reports to B, B reports to A), the recursion never terminates without a guard.

Two guards. Cheap: cap depth in the recursive member with WHERE c.level < 50. The query terminates after at most 50 hops. Expensive but correct: track the visited path as an array column and check membership before recursing—WHERE NOT (e.employee_id = ANY(c.path)). Use the cheap guard for interview pace; mention the path-tracking version as the production-grade fix.

Worked example. Suppose manager_id is corrupted: Bob → Alice → Bob (cycle). Without a guard, recursion runs forever. With WHERE c.level < 50, the query returns 50 rows (all in the cycle) and stops, signaling the issue.

Worked-example solution.

WITH RECURSIVE chain AS (
  SELECT employee_id, name, 1 AS level
  FROM employees
  WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.name, c.level + 1
  FROM employees e
  JOIN chain c ON e.manager_id = c.employee_id
  WHERE c.level < 50
)
SELECT * FROM chain;
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using UNION instead of UNION ALL (PostgreSQL rejects this for recursive CTEs).
  • Forgetting the RECURSIVE keyword—Postgres requires it explicitly.
  • Mismatched column count or types between anchor and recursive members.
  • No depth cap in cyclic data → query hangs.
  • Joining the recursive member back to itself in the wrong direction (c.manager_id = e.employee_id instead of e.manager_id = c.employee_id).

SQL interview question on recursive CTEs

Table employees(employee_id, name, manager_id). Return every (employee, ancestor, depth) triple, where ancestor is any manager up the chain (direct manager has depth 1, their manager has depth 2, and so on). Order by employee_id, then depth.

Solution using a recursive CTE

WITH RECURSIVE chain AS (
  SELECT employee_id, manager_id, 1 AS depth
  FROM employees
  WHERE manager_id IS NOT NULL
  UNION ALL
  SELECT c.employee_id, e.manager_id, c.depth + 1
  FROM chain c
  JOIN employees e ON e.employee_id = c.manager_id
  WHERE e.manager_id IS NOT NULL
)
SELECT employee_id, manager_id AS ancestor_id, depth
FROM chain
ORDER BY employee_id, depth;
Enter fullscreen mode Exit fullscreen mode

Why this works: The anchor member emits each (employee, direct_manager, 1) pair, so the CTE starts populated for every non-CEO employee. The recursive member walks up one ancestor per iteration by joining chain.manager_id to employees.employee_id and incrementing depth; it filters out the CEO with WHERE e.manager_id IS NOT NULL so we do not try to recurse from a row whose ancestor is null. Termination is automatic: when the next ancestor's manager_id is NULL (we have reached the CEO), the join produces zero rows and recursion stops.

Optimized solution using a depth guard

For deep or potentially cyclic data, add AND c.depth < 50 to the recursive member's WHERE. The query now runs in bounded time even if the data has corrupted manager pointers, and you can flag rows where depth = 50 for investigation. This single line is the difference between a query that hangs production and one that fails fast.

SQL
Topic — CTE
CTE problems

Practice →

SQL
Topic — tree traversal
Tree traversal problems

Practice →


Tips to crack Google data engineering interviews

These are habits that move the needle in real Google loops—not a re-statement of the topics above.

SQL preparation: drill the Google patterns

Spend most of your SQL prep on GROUP BY + subquery (above-average filters), self-joins (employee/manager), and recursive CTEs (skip-level reporting). Type the queries; do not just read them. The aggregation topic page, self-join topic page, and CTE topic page cover the three patterns directly.

Python preparation: stdlib fluency over framework knowledge

Google interviewers care more about clean Counter / defaultdict use than whether you know the latest pandas API. Prioritize: collections, functools.wraps, itertools.groupby, context managers for file I/O. Pull problems from the hash table topic page and the file I/O topic page.

Data pipeline thinking: clean → cast → aggregate

When the prompt says "here is a CSV," walk through three explicit phases: clean (drop or coerce malformed rows, with a count of how many you dropped), cast (pd.to_numeric(errors='coerce') before astype), aggregate (named aggregations with explicit column names). Saying these three words out loud changes how the interviewer scores the round.

Where to practice on PipeCode

Skill lane Practice path
Curated Google practice set /explore/practice/company/google
SQL aggregation + subquery /explore/practice/topic/aggregation
SQL self-join /explore/practice/topic/self-join
SQL recursive CTE /explore/practice/topic/cte
Hash table + frequency /explore/practice/topic/hash-table
File I/O + log parsing /explore/practice/topic/file-io
All practice topics /explore/practice/topics
Interview courses /explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume manager_id is null only for the CEO and that there are no cycles." State grain: "One row per employee per ancestor." State edge cases: "If the recursion has a cycle, my depth guard caps at 50." Interviewers grade clear reasoning above silent-and-perfect.


Frequently Asked Questions

What is the Google data engineering interview process like?

The Google data engineering interview typically includes a phone screen (math + Python warm-up), one or two coding rounds focused on SQL and Python data manipulation, a system-design conversation around pipelines and warehousing, and behavioral interviews. The curated 14-problem Google practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Google test for data engineers?

Google emphasizes GROUP BY with subqueries (above-average patterns), self-joins for reporting hierarchies, and recursive CTEs for skip-level reporting. Window functions and date arithmetic show up in onsite rounds. Drill these on the aggregation, self-join, and CTE topic pages before your screen.

How important is Python for a Google data engineering interview?

Python is roughly half the interview at Google—expect frequency counting with Counter, file/log parsing, decorator stacks for retry+log, and pandas for clean-cast-aggregate. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.

Are there pandas questions in the Google data engineering interview?

Yes—particularly in the data-manipulation round. Expect a dirty CSV in, per-customer or per-key aggregates out, with pd.to_numeric(errors='coerce') to handle malformed rows. The type casting and aggregation practice topics cover these patterns.

How should I prepare for Google's recursive CTE and self-join questions?

Build the mental model first: a self-join is one hop; a recursive CTE walks the tree. Type out the anchor + recursive UNION ALL shape from memory ten times before your interview. Add a depth guard for cyclic data. The CTE and self-join topic pages have problems matching the Google patterns directly.

How many Google practice problems should I solve before the interview?

Aim for 30–50 problems spanning all nine topic clusters above—not 100 of the same kind. Solve every problem in the Google-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.


Start practicing Google data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Google problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Google asks—without the noise of generic algorithm prep.

Pipecode.ai is Leetcode for Data Engineering.

Browse Google practice →
View all practice →

Top comments (0)