Google data engineering interview questions lean on two stacks at once: SQL (GROUP BY with subqueries, HAVING, self-joins for reporting chains, and recursive CTEs for skip-level hierarchies) and Python (hash maps and Counter for frequency counts, pandas clean-cast-aggregate, decorators stacked for logging and retry, and file I/O for log parsing). The bar is fluency—naming grain, NULL behavior, and edge cases out loud while you type.
This guide walks through the nine topic clusters you will actually see, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix reflects the topics that appear most often in Google data engineering loops (7 easy, 7 medium across SQL and Python) so you can move from reading to typing without losing the thread.
Top Google data engineering interview topics
From the Google data engineering practice set, the nine numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–9) | Why it shows up at Google |
|---|---|---|
| 1 | Math and arithmetic in Python | Phone-screen filter: palindrome integer, missing number, average, arithmetic-formula evaluator. |
| 2 | Array and string manipulation in Python | Vertical-scan algorithms (longest common prefix), pattern printing—careful index handling. |
| 3 | Hash maps and Counter for frequency counting |
Most-common words with tie-breaking, O(1) membership tests. |
| 4 | Dynamic typing and decorators in Python | Type dispatch, @wraps, stacked decorators for cross-cutting concerns (logging + retry). |
| 5 | File I/O and log aggregation in Python | Read line-by-line, parse log records, aggregate with defaultdict(int). |
| 6 | Pandas: clean, cast, and aggregate | Real ETL prompts—dirty CSV in, per-customer totals out. |
| 7 | SQL aggregation, GROUP BY and subqueries |
"Above-average salary per department" with a scalar subquery vs JOIN to a grouped aggregate. |
| 8 | SQL self-joins for reporting hierarchies | Employee → manager via manager_id, including CEO with no manager. |
| 9 | Recursive CTEs for multi-level hierarchies | Skip-level reporting chains; anchor + recursive UNION ALL, depth tracking, termination. |
SQL evaluation order (mental model):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → windows →SELECT/ORDER BY. Filter a sum or count of a group? That belongs inHAVING, notWHERE.
1. Math and arithmetic concepts in data engineering
Math and number primitives in Python for data engineering
Google opens many phone screens with a small math problem—not because data engineers solve contest math, but because the cleanest solutions force you to verbalize invariants. An invariant is a property that does not change as the algorithm runs: the digit-sum of a number across reversal, the closed-form sum of 0..n, the count of non-null values inside an aggregate. Once you name the invariant, you usually skip the obvious O(n²) or O(n log n) approach and land on O(n) or O(1).
The four primitives below cover the math problems on the Google practice set: palindrome integer (#82), missing number (#83), average of integers (#89), and arithmetic-formula evaluator (#273). All four reduce to two ideas: integer arithmetic with % 10 and // 10, and replacing a loop-and-check with a closed-form expression. Memorize those two and the section is muscle memory.
Palindrome check using integer reversal
Detailed explanation. A palindromic integer reads the same forwards and backwards (121, 1221). The standard trick avoids converting to a string: pull the last digit with n % 10, fold it into a running reversed integer with rev = rev * 10 + last, then drop the last digit with n //= 10. When n reaches 0, rev holds the reversed integer. Comparing rev == original resolves the question. The whole loop is O(d) where d is the digit count (log₁₀ n)—constant for any practical 64-bit integer.
Two edge cases trip people up. Negative numbers are not palindromes by convention because the leading minus sign breaks symmetry; return False immediately. Trailing zeros (10, 100) are also not palindromes—10 reverses to 01 = 1. The n > 0 invariant handles both naturally because 0 is technically a palindrome and reversing 10 gives 1.
Worked example. n = 121.
| step | n | rev |
|---|---|---|
| start | 121 | 0 |
| iter 1 | 12 | 1 |
| iter 2 | 1 | 12 |
| iter 3 | 0 | 121 |
Result: rev (121) == original (121) → palindrome.
Worked-example solution.
def is_palindrome(n: int) -> bool:
if n < 0:
return False
original, rev = n, 0
while n > 0:
rev = rev * 10 + n % 10
n //= 10
return rev == original
Missing number using the Gauss formula
Detailed explanation. Given an array of distinct integers from 0..n with exactly one value missing, the brute-force "for each candidate, scan the array" runs in O(n²). The hash-set approach drops it to O(n) time but O(n) memory. The Gauss-summation approach hits O(n) time and O(1) memory: the sum of 0..n is n*(n+1)/2, so missing = expected_sum − actual_sum. This is your invariant—one missing number perturbs the sum by exactly that number.
A second technique uses XOR: 0 ⊕ 1 ⊕ … ⊕ n ⊕ a₀ ⊕ a₁ ⊕ … ⊕ a_{n-1} cancels every value except the missing one, because x ⊕ x = 0 and x ⊕ 0 = x. Slightly more numerically robust than summation if n is huge (no overflow concern in Python, but worth knowing).
Worked example. nums = [3, 0, 1] → n = 3, expected sum 0+1+2+3 = 6, actual sum 3+0+1 = 4, missing = 6 − 4 = 2.
Worked-example solution.
def missing_number(nums: list[int]) -> int:
n = len(nums)
return n * (n + 1) // 2 - sum(nums)
Average of integers (overflow-safe)
Detailed explanation. "Average" sounds trivial—until the prompt asks for floor of the average, rounding of the average, or for a guarantee against integer overflow on a sum that could exceed 2³¹. In Python, integers are arbitrary-precision, so overflow is not a concern, but the prompt clarification matters: int(total / len) returns a truncated float-converted integer, while total // len returns floor division. They agree for positive numbers and disagree for negatives. State which one the prompt wants.
Worked example. [1, 2, 3, 4, 5] → sum 15, len 5, average 3.0.
Worked-example solution.
def average(nums: list[int]) -> float:
if not nums:
raise ValueError("empty input")
return sum(nums) / len(nums)
Arithmetic formula evaluation with two stacks
Detailed explanation. Evaluating an expression like 3 + 5 * 2 is a parsing-plus-arithmetic problem. The classic two-stack solution: one stack holds numbers, one holds operators. Walk the expression token by token. For each operator, before pushing, apply any operator already on top whose precedence is greater than or equal to the new one (left-associative). At end of input, drain remaining operators.
The invariant is that the operator stack is always sorted by precedence from bottom to top: a higher-precedence operator can sit on top of a lower one (because it will resolve first), but a lower one cannot sit on top of a higher one (we resolve the higher one first, then push the lower). Parentheses are handled by treating ( as a marker that no operator on top can resolve until the matching ) arrives.
Worked example. 3 + 5 * 2.
| token | nums | ops | action |
|---|---|---|---|
3 |
[3] |
[] |
push number |
+ |
[3] |
[+] |
push op |
5 |
[3, 5] |
[+] |
push number |
* |
[3, 5] |
[+, *] |
* outranks +, push |
2 |
[3, 5, 2] |
[+, *] |
push number |
| end |
[3, 10] → [13]
|
[+] → []
|
resolve * (5*2=10), then + (3+10=13) |
Final answer: 13.
Worked-example solution.
PRECEDENCE = {"+": 1, "-": 1, "*": 2, "/": 2}
def evaluate(expr: str) -> int:
nums: list[int] = []
ops: list[str] = []
def apply():
b, a, op = nums.pop(), nums.pop(), ops.pop()
nums.append({"+": a + b, "-": a - b, "*": a * b, "/": a // b}[op])
tokens = expr.replace(" ", "")
i = 0
while i < len(tokens):
ch = tokens[i]
if ch.isdigit():
j = i
while j < len(tokens) and tokens[j].isdigit():
j += 1
nums.append(int(tokens[i:j]))
i = j
continue
while ops and PRECEDENCE.get(ops[-1], 0) >= PRECEDENCE[ch]:
apply()
ops.append(ch)
i += 1
while ops:
apply()
return nums[0]
Common beginner mistakes
- Forgetting negative-number handling for palindrome (returning
Truefor-121). - Using
sum(range(n))for missing-number on a 0-indexed array (off-by-one—range(n)skipsn). - Doing
total / len(nums)and getting a float when the prompt wanted floor division. - Reading the arithmetic expression left to right without precedence (
3 + 5 * 2 = 16is wrong).
Python interview question on math and arithmetic
Table nums (column value INT) lists integers 0..n with one value missing. Find the missing value in O(n) time and O(1) space without sorting.
Solution using the Gauss summation identity
def missing_number(nums: list[int]) -> int:
n = len(nums)
return n * (n + 1) // 2 - sum(nums)
Why this works: The closed-form n*(n+1)/2 gives the sum of 0..n in constant time. Subtracting the actual sum collapses the search to a single linear scan, beating the obvious set membership approach on memory (O(1) vs O(n)) and the sort approach on time (O(n) vs O(n log n)). The invariant is that one missing value perturbs the total by exactly that value—nothing else changes.
PYTHON
Topic — math
Math problems (all companies)
COMPANY
Google — math
Google-tagged math
2. Array and string manipulation in Python
Strings, slicing, and vertical scans for data engineering
Once the math warm-up is done, Google moves to small string problems that test careful index handling and edge-case thinking: empty input, single-element input, mixed lengths. The two patterns to recognize: vertical scans (compare across strings column by column, used for longest common prefix and grid-style problems) and printable-pattern loops (nested loops where the inner range encodes the shape).
The Google practice set uses these primitives directly: longest common prefix (#88), star triangle pattern (#84), and most-common-words string handling (#87). Beyond syntax, the interviewer is watching whether you state the invariant out loud—"the answer cannot be longer than the shortest string"—because that single sentence proves you saw the problem at the right altitude.
String iteration and slicing
Detailed explanation. Python strings are immutable sequences: indexable like lists (s[0], s[-1]), sliceable (s[1:4], s[::-1] for reverse), but you cannot mutate them in place (s[0] = 'x' raises TypeError). Slices are copies—s[a:b] builds a new string in O(b−a) time. Concatenation with += inside a loop is O(n²) because each concat copies the whole prefix; build a list of pieces and join once for O(n).
The cheap operations to memorize: len(s), s.startswith(prefix), s.endswith(suffix), min(strs, key=len) to find the shortest string in a list, and enumerate(s) to iterate with indices. The expensive operations to avoid in loops: s += ch, ''.join(parts) inside the inner loop instead of after, re.search with a recompiled pattern.
Worked example. s = "flower" → s[:2] is "fl"; s[-3:] is "wer"; s[::-1] is "rewolf". len(s) is 6.
Worked-example solution.
def reverse_with_slice(s: str) -> str:
return s[::-1]
Vertical-scan algorithm for longest common prefix
Detailed explanation. The longest common prefix (LCP) of a list of strings is the longest string that all of them start with. Two algorithms compete: horizontal scan (compute LCP of the first two strings, then LCP of that with the third, and so on; total O(S) where S = total characters) and vertical scan (walk the columns of the strings, stop at the first mismatch). Vertical scan wins on average because it can exit very early—after column 0 if the first characters differ.
The invariant for vertical scan is: the answer is at most as long as the shortest string in the list. Anchor on min(strs, key=len) to bound the outer loop. For each column i, check that every string's i-th character matches the anchor's. The first mismatch returns anchor[:i]. If the loop completes, the entire anchor is the answer.
Worked example. ["flower", "flow", "flight"].
| col | flower[i] | flow[i] | flight[i] | match? |
|---|---|---|---|---|
| 0 | f |
f |
f |
✓ |
| 1 | l |
l |
l |
✓ |
| 2 | o |
o |
i |
✗ → stop |
Return "fl" (anchor sliced to [:2]).
Worked-example solution.
def longest_common_prefix(strs: list[str]) -> str:
if not strs:
return ""
shortest = min(strs, key=len)
for i, ch in enumerate(shortest):
for s in strs:
if s[i] != ch:
return shortest[:i]
return shortest
Nested loops for printable patterns
Detailed explanation. Google asks pattern-printing questions to test loop control. The recipe: outer loop iterates rows, inner loop iterates columns; the inner loop's range is the row number. For a right-aligned star triangle of rows rows, row i has i stars. For a centered diamond, row i has 2*i − 1 stars and rows − i leading spaces. The pattern is in the inner range, not in special cases.
The invariant: rows are independent—nothing one row prints affects another. So once you have the inner expression right, scaling to any row count is free. State this out loud; interviewers note candidates who see independence and avoid coupling.
Worked example — 4-row star triangle.
*
* *
* * *
* * * *
Row i (1-indexed) prints i star tokens separated by spaces.
Worked-example solution.
def star_triangle(rows: int) -> None:
for i in range(1, rows + 1):
print(" ".join("*" * i))
Common beginner mistakes
- Returning
None(or raising) instead of""for empty input list in LCP. - Iterating with
range(len(strs[0]))and forgetting that other strings may be shorter—first index access raisesIndexError. - Mutating the string with
+=inside a hot loop (quadratic time). - Off-by-one on the row count (0-indexed vs 1-indexed)—triangle of
rows=4should print 4 rows, not 3 or 5.
Python interview question on string manipulation
Given a list of strings, return the longest common prefix shared by all of them. If there is no common prefix, return "". Optimize for the common case where the answer is short.
Solution using vertical scan
def longest_common_prefix(strs: list[str]) -> str:
if not strs:
return ""
shortest = min(strs, key=len)
for i, ch in enumerate(shortest):
for s in strs:
if s[i] != ch:
return shortest[:i]
return shortest
Why this works: The answer cannot be longer than the shortest string (proven by definition of "common prefix"), so anchoring on min(strs, key=len) bounds the outer loop in O(min_len). The vertical scan exits at the first column mismatch, which is typically very early in real data—total work is O(min_len * len(strs)) worst case, often closer to O(answer_len * len(strs)). Returning "" for an empty list is a defined contract, not an error.
PYTHON
Topic — array
Array problems (all companies)
COMPANY
Google — array
Google-tagged array
3. Hash maps and Counter for frequency counting in Python
Frequency counting and hash-map fluency in Python for data processing
Once the easy warm-ups are done, Google's interviewers expect you to reach for collections.Counter without prompting. Frequency counting shows up everywhere in data engineering—word counts, top-K queries, deduplication, set membership for streaming joins—so fluency with the standard library is the difference between a clean answer and a 30-line manual loop.
A hash map (Python dict) gives you average O(1) insert and lookup, in exchange for O(n) memory. A Counter is a dict subclass specialized for counts: it auto-zeros missing keys (no KeyError), supports arithmetic (c1 + c2 merges, c1 - c2 subtracts and clamps to zero), and exposes .most_common(k) for top-K. The difference between a clean Google answer and a sloppy one is usually which container you reach for.
dict vs defaultdict vs Counter
Detailed explanation. Three containers, three trade-offs:
-
dict— Generic key→value map; raisesKeyErroron missing keys unless you use.get(k, default). Use when keys are heterogeneous and you write the access pattern by hand. -
defaultdict(factory)— On missing key access, callsfactory()and inserts the result.defaultdict(int)auto-zeros, perfect for counting.defaultdict(list)auto-creates an empty list, perfect for grouping. The factory is called once per missing key. -
Counter(iterable)— Built specifically for frequencies. Constructed from any iterable; gives.most_common(k),+,-,&(intersection: min counts),|(union: max counts) for free.
The invariant: Counter is the only one that already knows its values are integers, so it composes with arithmetic and .most_common. Reach for it whenever the problem says "count" or "frequency."
Worked example. Counting words in ["the", "cat", "the", "dog", "cat"].
| container | construction | result |
|---|---|---|
dict |
for w in words: d[w] = d.get(w, 0) + 1 |
{'the': 2, 'cat': 2, 'dog': 1} |
defaultdict(int) |
for w in words: d[w] += 1 |
same |
Counter |
Counter(words) |
Counter({'the': 2, 'cat': 2, 'dog': 1}) |
Worked-example solution.
from collections import Counter
def word_counts(words: list[str]) -> dict[str, int]:
return dict(Counter(words))
Counter.most_common(k) and tie-breaking
Detailed explanation. c.most_common() returns a list of (key, count) pairs sorted by count descending. c.most_common(k) returns the top k. When two keys tie, insertion order breaks the tie—not alphabetical, not lexicographic. This is a Python 3.7+ guarantee (dict preserves insertion order), but it is rarely what the interviewer wants.
If the prompt says "tie-break alphabetically," override the sort: sorted(c.items(), key=lambda x: (-x[1], x[0])). The key tuple (-count, word) sorts by count descending (the negation flips ascending into descending) and then by word ascending. This pattern is so common it should be a reflex—negate primary, ascend secondary.
Worked example. Counting ["the", "cat", "the", "dog", "cat"]. Counter.most_common(2) could return [('the', 2), ('cat', 2)] (insertion order: the was seen first), but if the prompt asks for alphabetical tie-break we want [('cat', 2), ('the', 2)].
Worked-example solution.
from collections import Counter
def top_k_alpha(words: list[str], k: int) -> list[str]:
c = Counter(words)
ordered = sorted(c.items(), key=lambda x: (-x[1], x[0]))
return [w for w, _ in ordered[:k]]
Hash-map lookup for O(1) membership
Detailed explanation. When a brute-force solution is O(n²) because of a for…in list lookup, replace the list with a set or dict and the lookup drops to average O(1). Worst case is still O(n) (hash collisions), but in practice Python's hash table keeps it constant. The trade-off is memory: a set of 1M ints uses roughly 30–60 MB, a list uses 8 MB. For bounded data this is free; for unbounded it can OOM.
The invariant: if you find yourself writing if x in big_list, you almost certainly want if x in big_set. State the rewrite even when the input is already a list—convert once with s = set(big_list) and your subsequent membership tests are O(1).
Worked example. nums = [1, 2, 3, …, 1_000_000]. 42 in list(nums) scans up to 1M elements; 42 in set(nums) is one hash + one bucket check.
Worked-example solution.
def has_pair_with_sum(nums: list[int], target: int) -> bool:
seen: set[int] = set()
for n in nums:
if target - n in seen:
return True
seen.add(n)
return False
Common beginner mistakes
- Using a plain
dictwith+= 1and crashing withKeyErroron the first unseen key. - Trusting
Counter.most_commonto break ties alphabetically (it does not). - Mutating a
Counterwhile iterating over.items()(RuntimeError: dictionary changed size during iteration). - Building a
Counterfrom an unsplit string—Counter("the cat")counts characters, not words.
Python interview question on frequency and tie-breaking
Given a list of words, return the top K most frequent words. When two words tie, the lexicographically smaller word comes first.
Solution using Counter and a custom sort
from collections import Counter
def top_k_words(words: list[str], k: int) -> list[str]:
counts = Counter(words)
ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
return [w for w, _ in ordered[:k]]
Why this works: Counter(words) builds the frequency map in one O(n) pass. The sort key (-count, word) orders by count descending (negation flips ascending into descending) and then by word ascending—exactly the prompt's tie-break rule. Slicing [:k] bounds output to k items, and the overall complexity is O(n + u log u) where u = len(counts). For typical text data u ≪ n, so this is effectively linear with a tiny log factor.
PYTHON
Topic — hash table
Hash table problems
PYTHON
Topic — frequency analysis
Frequency analysis problems
4. Dynamic typing and decorators in Python
Python idioms for ETL: type dispatch and decorator stacks
Google leans into Python's dynamic typing and decorator facilities for data engineering rounds because real ETL code mixes types (a column is sometimes int, sometimes str) and needs cross-cutting concerns like retries and logging without polluting business logic. A clean answer separates "what the code does" from "how it is wrapped"—business logic in the function body, retry/log/timing in decorators around it.
The mental model: a decorator is a function that takes a function and returns a function. Stacking two decorators means the closer-to-def one runs innermost. Order matters more than people expect: @retry then @log makes one log per call (regardless of retries); @log then @retry makes one log per attempt. Interviewers ask explicitly which behavior the prompt wants.
type() vs isinstance (and why isinstance wins)
Detailed explanation. Two ways to ask "is x an int?" — type(x) is int and isinstance(x, int). They are not equivalent. type(x) is int checks exact identity—it returns False for any subclass. isinstance(x, int) returns True for int and any subclass. Because bool is a subclass of int in Python (True == 1, False == 0), isinstance(True, int) is True but type(True) is int is False.
The rule: prefer isinstance unless you specifically need to reject subclasses. A function that takes "an int" should accept booleans by default. A function that wants to reject booleans (e.g. for serialization to a strict-int column) should check explicitly: isinstance(x, int) and not isinstance(x, bool). State this distinction in the interview—it shows you understand Python's type hierarchy.
Worked example.
| call | result | reason |
|---|---|---|
isinstance(True, int) |
True |
bool is a subclass of int
|
type(True) is int |
False |
exact identity, bool ≠ int
|
isinstance(1, (int, float)) |
True |
tuple of types accepted |
isinstance("1", int) |
False |
str is not a subclass of int
|
Worked-example solution.
def is_strict_int(x) -> bool:
return isinstance(x, int) and not isinstance(x, bool)
Dispatch table for dynamic types
Detailed explanation. When a function must do different things for different input types, the dispatch table pattern beats a long if/elif chain on readability and extensibility. Build a dict mapping type → handler; lookup is O(1); adding a new type is one line. The fallback case (unknown type) becomes explicit.
The invariant: dispatch tables encode an open-set assumption. Adding a handler for a new type does not require editing the dispatcher itself, only registering a new entry. This is the same idea as functools.singledispatch (which is the more idiomatic way once a project commits to it), but a hand-built dict is fine for interview pace.
Worked example. Process 42, "hello", [1, 2, 3].
| input | type | handler | output |
|---|---|---|---|
42 |
int |
lambda x: x * 2 |
84 |
"hello" |
str |
lambda x: x.upper() |
"HELLO" |
[1, 2, 3] |
list |
lambda x: list(reversed(x)) |
[3, 2, 1] |
Worked-example solution.
HANDLERS = {
int: lambda x: x * 2,
str: lambda x: x.upper(),
list: lambda x: list(reversed(x)),
}
def process(value):
handler = HANDLERS.get(type(value))
if handler is None:
raise TypeError(f"unsupported type: {type(value).__name__}")
return handler(value)
Decorator basics with functools.wraps
Detailed explanation. A decorator is sugar for f = decorator(f). Without @wraps, the wrapper function replaces the original's __name__, __doc__, and __qualname__, breaking debugging, logging, and help(). functools.wraps(fn) copies those attributes from fn to the wrapper. Always use @wraps—it is one line and saves an hour of "why does my logger say wrapper instead of fetch_user?" debugging.
A second subtlety: a decorator with arguments (@retry(times=3)) is actually a factory that returns the real decorator. So you have three layers: factory → decorator → wrapper. Each layer captures variables in its closure. Get the levels straight in your head before writing any complex decorator.
Worked example. Wrap add(a, b) so each call prints "call add" first.
add(2, 3)
# call add
# 5
Worked-example solution.
from functools import wraps
def log_calls(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
print(f"call {fn.__name__}")
return fn(*args, **kwargs)
return wrapper
@log_calls
def add(a: int, b: int) -> int:
return a + b
Stacking decorators: logging + retry
Detailed explanation. When two decorators stack, the one closest to def is innermost—so it runs first on call and last on return. With @log over @retry over def f, the call chain is log_wrapper → retry_wrapper → f; logging fires once at the outer layer, retries are internal to the retry layer. Flip the order, and log_wrapper becomes innermost, so each retry attempt is logged separately.
For a Google interview, the prompt usually says "log every call once and retry on failure" (logs outside, retries inside) or "log every retry attempt" (logs inside, retries outside). Read carefully and state which order you chose before writing code.
Worked example. Function fetch_user(uid) that fails twice then succeeds. Decorated as @log_calls (outer) over @retry(times=3) (inner). Output:
call fetch_user
{'id': 42, 'name': 'Ada'}
(One log line, three internal attempts the user does not see.)
Worked-example solution.
from functools import wraps
import time
def retry(times: int):
def deco(fn):
@wraps(fn)
def wrapper(*a, **kw):
for attempt in range(1, times + 1):
try:
return fn(*a, **kw)
except Exception:
if attempt == times:
raise
time.sleep(0.1)
return wrapper
return deco
def log_calls(fn):
@wraps(fn)
def wrapper(*a, **kw):
print(f"call {fn.__name__}")
return fn(*a, **kw)
return wrapper
@log_calls
@retry(times=3)
def fetch_user(uid: int) -> dict:
...
Common beginner mistakes
- Forgetting
@wraps—the wrapped function loses its name, breaking logs, tracebacks, andhelp(). - Using
type(x) == intand missingbool(or any subclass) silently. - Stacking decorators in the wrong order so retries are logged twice (or not at all).
- Retrying a non-idempotent operation (a payment, a non-idempotent INSERT) without an idempotency key.
Python interview question on decorator stacks
Write a retry(times, exceptions) decorator that retries a function up to times times, only when one of the listed exceptions is raised, with exponential backoff. Stack it with a log_calls decorator so every first attempt is logged once.
Solution using functools.wraps and exception filtering
from functools import wraps
import time
def retry(times: int, exceptions: tuple = (Exception,)):
def deco(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
for attempt in range(1, times + 1):
try:
return fn(*args, **kwargs)
except exceptions:
if attempt == times:
raise
time.sleep(0.1 * (2 ** (attempt - 1)))
return wrapper
return deco
def log_calls(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
print(f"call {fn.__name__}")
return fn(*args, **kwargs)
return wrapper
@log_calls
@retry(times=3, exceptions=(TimeoutError,))
def fetch_user(user_id: int) -> dict:
...
Why this works: Retry is the inner decorator, so each call logs once (at the outer layer) regardless of how many internal retries happen. Exponential backoff (0.1 * 2^(attempt-1)) prevents thundering-herd retries against a struggling downstream. The exceptions tuple lets the caller scope retries to recoverable failures (timeouts) without swallowing real bugs (KeyError, TypeError). functools.wraps preserves the function's identity so logs and tracebacks remain readable.
Practice Google-style Python problems →
PYTHON
Topic — dynamic typing
Dynamic typing problems
PYTHON
Topic — decorator
Decorator problems
5. File I/O and log aggregation in Python
Streaming file reads and log parsing for data engineering
The closest thing to real DE work in a Google interview is the log aggregator: read a file line-by-line, parse each line, and emit per-key totals. Three ideas matter—stream, do not slurp; isolate parsing from aggregation; handle malformed lines without crashing the whole pipeline.
Streaming matters because production log files can be tens of gigabytes. Loading the file into memory (f.read().splitlines()) is fine for 1 MB and fatal for 50 GB. Iterating for line in f: reads one line at a time. Parsing isolation matters because parsing logic is the part most likely to break on a corrupt line; if parsing is a separate function returning None on failure, the aggregator can skip-and-count without aborting. This is exactly the structure interviewers grade.
Reading files line-by-line with a context manager
Detailed explanation. with open(path) as f: creates a context-managed file handle. The with block ensures f.close() runs even if the body raises—the file descriptor leak is one of the most common bugs in DE code that does not use with. Inside the block, for line in f: produces one line at a time, including the trailing newline. Use line.rstrip("\n") if you do not want it.
For binary or unusual encodings, pass mode="rb" or encoding="utf-8" (default on most systems but worth being explicit). For very large files where you need to seek, switch to f.seek(offset) instead of iteration. For multi-process reading, split the file into byte ranges and assign one to each worker.
Worked example. Print every line of access.log with the leading line number.
Worked-example solution.
def print_with_lineno(path: str) -> None:
with open(path, encoding="utf-8") as f:
for i, line in enumerate(f, start=1):
print(f"{i}: {line.rstrip()}")
Parsing log lines: split vs regex
Detailed explanation. Two parsers, two trade-offs. str.split is fast and brittle: it works when the format is fixed and whitespace-separated, breaks when fields are quoted or contain spaces. re.match / re.search is slower and flexible: it handles optional fields, quoted strings, and structured logs. For interview pace, default to split and switch to regex only when the prompt has irregular formats.
Compile regex once at module scope, never inside the loop. re.compile(r"…") returns a pattern object; reusing it is much cheaper than recompiling per line. For very high-throughput log parsing, the compiled pattern plus match.groupdict() for named captures is the cleanest pattern.
Worked example. Line "2026-04-27 INFO user=42 action=login" parsed with split() → ['2026-04-27', 'INFO', 'user=42', 'action=login']. Then dict(p.split('=', 1) for p in tokens if '=' in p) → {'user': '42', 'action': 'login'}.
Worked-example solution.
def parse_line(line: str) -> dict[str, str] | None:
tokens = line.strip().split()
if len(tokens) < 4:
return None
fields = dict(t.split("=", 1) for t in tokens if "=" in t)
if "user" not in fields or "action" not in fields:
return None
return fields
Aggregating with defaultdict(int)
Detailed explanation. Once parsing is isolated, aggregation is trivial. The aggregator only ever sees (key, value) pairs and increments a counter. defaultdict(int) auto-zeros missing keys; reading a missing key creates it with the default, which is occasionally surprising (so use .get(k, 0) if you only want to read without inserting).
The invariant: the aggregator should not raise on bad data, because parsing has already filtered it. If parsing returns None, the aggregator skips. This separation lets you log a count of skipped lines at the end (skipped = total − valid), which interviewers love.
Worked example. Three log lines, action counts per user.
2026-04-27 INFO user=42 action=login
2026-04-27 INFO user=99 action=login
2026-04-27 INFO user=42 action=logout
→ {42: {'login': 1, 'logout': 1}, 99: {'login': 1}}.
Worked-example solution.
from collections import defaultdict
def aggregate_actions(path: str) -> dict[str, dict[str, int]]:
counts: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
with open(path) as f:
for line in f:
fields = parse_line(line)
if fields is None:
continue
counts[fields["user"]][fields["action"]] += 1
return {u: dict(v) for u, v in counts.items()}
Common beginner mistakes
-
f.read().splitlines()—loads the whole file. Fine for 1 MB, fatal for 50 GB. - Catching every exception silently and returning
{}on a malformed line, hiding real data corruption. - Compiling a regex inside the loop (recompile per line is slow); compile once at module scope.
- Forgetting that
defaultdictreturns the default on read, soif user not in countsis alwaysFalseif you check after a read.
Python interview question on log aggregation
Given a log file with lines of the form "<date> <level> user=<id> action=<verb>", count actions per user. The file is too large to fit in memory; stream it and emit a sorted summary.
Solution using defaultdict and a streaming read
from collections import defaultdict
def aggregate_actions(path: str) -> list[tuple[str, str, int]]:
counts: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
with open(path) as f:
for raw in f:
tokens = raw.strip().split()
fields = dict(t.split("=", 1) for t in tokens if "=" in t)
u, a = fields.get("user"), fields.get("action")
if u and a:
counts[u][a] += 1
return sorted(
((u, a, n) for u, actions in counts.items() for a, n in actions.items()),
key=lambda r: (r[0], -r[2]),
)
Why this works: Iterating f directly streams lines—memory is bounded by the size of the per-user counter map, not the file. Parsing is isolated to two dict comprehensions, so a malformed line is skipped without crashing the whole run. The final sort returns deterministic output ordered by user, then by action frequency descending. Total time is O(L + N log N) where L is line count and N is unique (user, action) pairs.
PYTHON
Topic — file I/O
File I/O problems
PYTHON
Topic — frequency analysis
Frequency analysis problems
6. Pandas: clean, cast, and aggregate
Pandas patterns in Python for data engineering
A typical Google DE prompt: "Here is an order stream as a CSV. Some amount cells are blank or 'N/A'. Produce per-customer totals." That tests type casting, NaN handling, and groupby aggregation in one shot. The clean answer follows three explicit phases—clean (drop or coerce malformed rows, with a count of how many you dropped), cast (pd.to_numeric(errors='coerce') before astype), aggregate (named aggregations with explicit column names). Saying these three words out loud changes how the interviewer scores the round.
pd.to_numeric with errors='coerce' for dirty data
Detailed explanation. pd.to_numeric(series, errors=…) converts a series to numeric. The errors argument controls failure mode: "raise" (default — raises ValueError on the first bad value), "ignore" (returns the input unchanged on failure, dangerous because the column type does not change), and "coerce" (turns bad values into NaN). The right answer 95% of the time is "coerce"—it surfaces dirt without crashing.
After coercion, NaN is the universal "missing" sentinel. Use .isna() / .notna() to filter, .dropna(subset=[col]) to drop rows where a specific column is NaN, or .fillna(value) to impute. Imputing with 0 is rarely correct; it skews mean and sum. Drop or carry NaN forward.
Worked example. Mixed-type column.
| order_id | amount (raw) | after to_numeric(errors='coerce')
|
|---|---|---|
| 101 | "12.5" |
12.5 |
| 102 | "N/A" |
NaN |
| 103 | "abc" |
NaN |
| 104 | 15 |
15.0 |
Worked-example solution.
import pandas as pd
def coerce_amount(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
return df
astype vs to_numeric
Detailed explanation. df["x"].astype(int) is the strict caster: it raises ValueError on the first row that cannot become an int. Use it once you know the column is clean. pd.to_numeric(df["x"], errors="coerce") is the tolerant caster: it produces NaN on bad rows and returns a float64 series (because NaN is a float). The pipeline order is therefore: to_numeric first to surface dirt, drop or impute NaN, astype second if you need a strict integer column for downstream serialization.
A subtle gotcha: astype(float) succeeds on "NaN" strings and parses them to NaN, while astype(int) raises. Always cast to float (or use to_numeric) before considering int.
Worked example. Same dirty data.
| step | column dtype | behavior |
|---|---|---|
| raw | object |
mixed strings |
astype(int) |
— |
ValueError on row 102 |
to_numeric(errors='coerce') |
float64 |
NaN for rows 102, 103 |
dropna() then astype(int)
|
int64 |
clean integer column |
Worked-example solution.
import pandas as pd
def to_clean_int(s: pd.Series) -> pd.Series:
coerced = pd.to_numeric(s, errors="coerce")
return coerced.dropna().astype(int)
groupby().agg() with named aggregations
Detailed explanation. df.groupby(key).agg(...) is the workhorse for per-group metrics. The named-aggregation form (pd.NamedAgg or output_name=(input_col, func)) is the modern API and produces clean output column names without the multi-level index pain of older syntax. After grouping, .reset_index() flattens the group key back to a column—forget this and downstream merges break.
Multi-metric aggregations are common: "per customer, give me total spend, order count, and last order date." Encode them as separate named aggregations in one call. For custom aggregations (e.g. "fraction of orders that were paid"), pass a lambda: share=("status", lambda s: (s == "paid").mean()).
Worked example. Orders dataframe with multiple metrics per customer.
import pandas as pd
df = pd.DataFrame({
"customer_id": [1, 1, 2, 2, 3],
"amount": [12.5, None, 30.0, "N/A", 5.0],
})
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df.dropna(subset=["amount"])
out = df.groupby("customer_id").agg(
total=("amount", "sum"),
orders=("amount", "count"),
).reset_index()
→
| customer_id | total | orders |
|---|---|---|
| 1 | 12.5 | 1 |
| 2 | 30.0 | 1 |
| 3 | 5.0 | 1 |
Worked-example solution.
import pandas as pd
def per_customer_metrics(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
return (
df.dropna(subset=["amount"])
.groupby("customer_id")
.agg(total=("amount", "sum"), orders=("amount", "count"))
.reset_index()
)
Common beginner mistakes
- Calling
astype(float)before cleaning—one bad row blows up the whole pipeline. - Letting NaN propagate into
mean()silently and not reporting how many rows were dropped. - Forgetting
.reset_index()aftergroupbyand then merging on a multi-index later. - Using
df["amount"].fillna(0)blindly when the prompt clearly says "exclude bad rows."
Python interview question on pandas ETL
Given an order-stream CSV with columns customer_id, amount, status where amount may be malformed and only status == "paid" rows count, produce a dataframe with one row per customer, columns total (sum of amounts) and orders (count of paid orders), sorted by total descending.
Solution using to_numeric, filtering, and groupby().agg()
import pandas as pd
def per_customer_totals(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
paid = df[(df["status"] == "paid") & df["amount"].notna()]
out = (
paid.groupby("customer_id")
.agg(total=("amount", "sum"), orders=("amount", "count"))
.reset_index()
.sort_values("total", ascending=False, kind="stable")
)
return out
Why this works: to_numeric(errors="coerce") turns malformed amounts into NaN without raising, which the boolean mask then drops alongside non-paid rows. Named aggregation produces clean output column names so downstream consumers do not depend on tuple-style multi-level column names. kind="stable" keeps tied totals in customer-id order for deterministic output—important if downstream tests compare row order. Total complexity is O(N) for the coerce + filter and O(N log N) for the sort.
Drill SQL aggregation patterns →
PYTHON
Topic — type casting
Type casting problems
PYTHON
Topic — aggregation
Aggregation problems
7. SQL aggregation, GROUP BY and subqueries
Aggregation and subqueries in SQL for data engineering
The most common Google SQL pattern: "Find employees whose salary is above the average." Sounds trivial—and it tests WHERE vs HAVING, NULL semantics in AVG, and whether you reach for a scalar subquery or a JOIN to a pre-aggregated CTE. The shape of your answer reveals whether you understand SQL evaluation order or just wrote a query that happens to run.
The four sub-skills below show up in order of difficulty: aggregate semantics with NULL → WHERE vs HAVING → scalar subquery → correlated subquery vs CTE+JOIN. Internalize each one and the rest of SQL aggregation feels easy.
AVG, SUM, COUNT on NULL
Detailed explanation. SQL aggregate functions all share one rule: NULL is skipped, not zeroed. SUM(col) sums non-null values; if every value in the group is NULL, the result is NULL, not 0. AVG(col) divides sum-of-non-null by count-of-non-null. COUNT(*) counts rows in the group regardless of NULL; COUNT(col) counts rows where col is not NULL; COUNT(DISTINCT col) counts unique non-null values.
The interview-grade restatement: AVG is biased when NULLs are common, because the denominator changes per group. If the prompt says "average salary" and some salaries are NULL, ask whether NULL means "no salary recorded" (skip) or "zero salary" (replace with 0). The answer changes the query.
Worked example. One group, three rows, salaries 100, NULL, 200.
| aggregate | value | reasoning |
|---|---|---|
SUM(salary) |
300 | NULL skipped |
AVG(salary) |
150 | sum 300 / count 2 |
COUNT(*) |
3 | rows |
COUNT(salary) |
2 | non-null salaries |
COUNT(DISTINCT salary) |
2 | distinct non-null values |
Worked-example solution.
SELECT
SUM(salary) AS total,
AVG(salary) AS avg_salary,
COUNT(*) AS rows,
COUNT(salary) AS non_null_salaries,
COUNT(DISTINCT salary) AS distinct_salaries
FROM employees;
WHERE vs HAVING
Detailed explanation. The single most-confused pair in SQL. Their roles map to two different stages of evaluation:
| Step | Predicate evaluated | Operates on |
|---|---|---|
WHERE |
Per-row predicate | Source rows before grouping |
GROUP BY |
(groups rows) | Bundles rows into groups |
| Aggregates |
SUM, AVG, COUNT, etc. |
Compute one value per group |
HAVING |
Per-group predicate | Groups after aggregates exist |
Pitfall: writing WHERE AVG(salary) > 100000 is invalid—SQL has not computed the aggregate yet. The query rewrites to HAVING AVG(salary) > 100000. Conversely, filtering on a per-row column (like salary > 50000) belongs in WHERE, not HAVING, because it can prune rows before grouping, which is faster.
Worked example. Two departments.
| dept_id | emp_id | salary |
|---|---|---|
| A | 1 | 100 |
| A | 2 | 200 |
| B | 3 | 50 |
-
WHERE salary > 75removes employee 3 (per-row filter). -
GROUP BY dept_idthenHAVING SUM(salary) > 200returns only dept A (sum 300); dept B (sum 50) drops out.
Worked-example solution.
SELECT dept_id, SUM(salary) AS total
FROM employees
WHERE salary > 75
GROUP BY dept_id
HAVING SUM(salary) > 200;
Scalar subquery for "above average"
Detailed explanation. A scalar subquery is a SELECT that returns exactly one row and one column. It can stand in any expression slot a constant can. The classic Google use: comparing each row to a global statistic. (SELECT AVG(salary) FROM employees) returns a single number; the outer query compares each row's salary to it.
PostgreSQL evaluates the scalar subquery once (or caches it), so the outer query is still O(N). If the subquery returns more than one row, you get cardinality violation—wrap it in MAX(...) or MIN(...) to force one value, or replace with a JOIN if you need per-group statistics.
Worked example. Five employees: salaries 60, 80, 100, 120, 140. Average = 100. "Above average" rows: 120, 140.
Worked-example solution.
SELECT employee_id, name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
Correlated subquery vs CTE alternative
Detailed explanation. A correlated subquery references the outer row and re-evaluates once per outer row. It is easy to write—WHERE salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id)—and slow at scale: for N employees in K departments, the planner often runs the subquery N times. Total cost approaches O(N × group_size).
The CTE + JOIN alternative aggregates once per group in a CTE, then joins each row to its group's aggregate exactly once. Cost is O(N + K) for grouping + O(N) for the join. On large tables this is the difference between minutes and seconds. Interviewers grade this distinction; mention it before writing.
Worked example. Same employees, but per-department average. The correlated subquery and the CTE+JOIN return the same rows, but the CTE+JOIN is one order of magnitude faster on large tables.
Worked-example solution.
WITH avg_by_dept AS (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
)
SELECT e.employee_id, e.name, e.salary
FROM employees e
JOIN avg_by_dept a ON a.dept_id = e.dept_id
WHERE e.salary > a.avg_salary;
Common beginner mistakes
- Filtering
SUM/AVGinWHEREinstead ofHAVING. - Forgetting that
AVGignores NULLs—ifsalaryis NULL for half the rows, the average is biased. - Using
COUNT(salary)when the prompt asked "how many employees" (rows wheresalary IS NULLget dropped). - Letting a correlated subquery run N×K when a
JOINto a grouped aggregate runs in O(N + K).
SQL interview question on aggregation and subqueries
Table employees(employee_id, name, dept_id, salary). Return all employees whose salary is strictly greater than the average salary in their own department, sorted by dept_id, then by salary descending.
Solution using a CTE and a single join
WITH avg_by_dept AS (
SELECT dept_id, AVG(salary) AS dept_avg
FROM employees
GROUP BY dept_id
)
SELECT e.employee_id, e.name, e.dept_id, e.salary
FROM employees e
JOIN avg_by_dept a ON a.dept_id = e.dept_id
WHERE e.salary > a.dept_avg
ORDER BY e.dept_id, e.salary DESC;
Why this works: Aggregation runs once per department in the CTE—N rows in, K groups out, where K ≪ N. The single JOIN then attaches each employee to their department's average in one pass, beating the correlated subquery's N×K runtime. Strict > excludes employees who exactly match the average, matching the prompt's wording. The ORDER BY ties the output to a deterministic shape so downstream tests compare row sequences reliably.
SQL
Topic — aggregation
Aggregation problems (SQL)
SQL
Topic — subqueries
Subquery problems
8. SQL self-joins for reporting hierarchies
Self-joins in SQL for data engineering
The classic Google self-join question: "Print every employee with their manager's name." It uses one table aliased twice—employees on the left, managers on the right—joined on manager_id. Two pitfalls trip people up: the CEO has no manager (so INNER JOIN drops them), and column ambiguity without explicit aliases. Both are interview-grade signal: a candidate who handles them without prompting is a level above one who needs a hint.
The mental model: a self-join is a logical second copy of the same physical table. The database does not duplicate data; the query planner just allows two row-pointer roles for the same rows. Aliases (e and m) make the roles explicit. Without aliases, SELECT name FROM employees JOIN employees ON … raises ambiguous column reference.
Self-join basics: aliasing the same table twice
Detailed explanation. The JOIN syntax requires two row sources. To self-join, alias the same table twice in the FROM clause, treating each alias as a logical copy. Decide which side represents which role—conventionally e for employee, m for manager. The join condition links them: m.employee_id = e.manager_id reads as "the manager's employee_id equals the employee's manager_id."
The invariant: the join direction is asymmetric. e.manager_id = m.employee_id (employee's manager_id matches a manager's employee_id) is correct. Reverse it and you get nonsense pairs. Type the direction out loud as a sentence before writing the SQL.
Worked example. Four employees in a flat hierarchy.
| employee_id | name | manager_id |
|---|---|---|
| 1 | Alice (CEO) | NULL |
| 2 | Bob | 1 |
| 3 | Carol | 2 |
| 4 | Dan | 2 |
INNER JOIN on m.employee_id = e.manager_id yields:
| employee | manager |
|---|---|
| Bob | Alice |
| Carol | Bob |
| Dan | Bob |
Alice is missing because her manager_id is NULL.
Worked-example solution.
SELECT e.name AS employee, m.name AS manager
FROM employees e
JOIN employees m ON m.employee_id = e.manager_id;
INNER vs LEFT self-join (CEO with no manager)
Detailed explanation. INNER JOIN returns rows where the join condition matches on both sides. For employees with manager_id IS NULL, no manager row matches, so they drop out—including the CEO. LEFT JOIN keeps every left row; columns from the right side become NULL when there is no match. The CEO appears with manager = NULL.
The choice depends on the prompt. "List every direct-report relationship" → INNER JOIN (the CEO has no relationship to list). "List every employee with their manager (CEO has no manager)" → LEFT JOIN. Read the prompt twice; the verb often signals which one.
Worked example. Same four-row table, LEFT JOIN.
| employee | manager |
|---|---|
| Alice | NULL |
| Bob | Alice |
| Carol | Bob |
| Dan | Bob |
Now Alice appears with NULL.
Worked-example solution.
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id;
Projecting both names + chain depth = 1
Detailed explanation. A self-join with one hop is a direct-report report: depth = 1. To go beyond one hop (employee → manager → grand-manager), you need either a second self-join for each additional level (verbose, hard to scale) or a recursive CTE (next section). Most Google interviewers escalate from one self-join to a recursive CTE in the same conversation—it is the natural follow-up.
The invariant: a single self-join expresses exactly one parent edge per row. Two parent edges per row need two self-joins. N parent edges need N self-joins or one recursion.
Worked example. Adding the manager's manager (grand-manager).
SELECT
e.name AS employee,
m.name AS manager,
gm.name AS grand_manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id
LEFT JOIN employees gm ON gm.employee_id = m.manager_id;
For Carol → Bob → Alice → (no grand): row is ('Carol', 'Bob', 'Alice').
Worked-example solution. Same as above; for chains deeper than 2, prefer a recursive CTE (§9).
Common beginner mistakes
- Forgetting to alias—
SELECT name FROM employees JOIN employees ON …is ambiguous. - Using
INNER JOINand then panicking that the CEO is missing. - Joining on
e.employee_id = m.manager_id(reversed direction) and getting nonsense pairs. - Pulling all columns with
SELECT *so the manager'smanager_idand the employee'smanager_idcollide silently.
SQL interview question on self-joins
Table employees(employee_id, name, manager_id). Return one row per employee with columns employee and manager (the manager's name). The CEO must appear with manager = NULL. Order by employee.
Solution using a LEFT JOIN with two aliases
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id
ORDER BY e.name;
Why this works: The LEFT JOIN keeps all employees—including the CEO whose manager_id is NULL. Aliasing e (employee) and m (manager) makes the join direction explicit and removes column ambiguity. The single hop m.employee_id = e.manager_id produces exactly one row per employee. The ORDER BY e.name enforces deterministic output.
SQL
Topic — self-join
Self-join problems
SQL
Topic — joins
Join problems
9. Recursive CTEs for multi-level hierarchies
Recursive CTEs in SQL for data engineering
When the question becomes "show every ancestor in the chain, not just the direct manager," a single self-join no longer suffices. You need a recursive CTE: one query that references itself to walk the tree until termination. Google's harder DE SQL question (#613 on the practice set) is exactly this—skip-level reporting: for every employee, list every ancestor up to the CEO.
The mental model: a recursive CTE has two members joined by UNION ALL. The anchor is the base case (one row per starting node). The recursive member joins the CTE's current contents to the base table to walk one more hop. Each iteration adds new rows; the recursion stops when the recursive member returns zero rows. PostgreSQL handles termination automatically when there are no cycles; with cycles, you must add a depth cap or path tracking.
Recursive CTE anatomy: anchor + recursive UNION ALL
Detailed explanation. A recursive CTE has the shape:
WITH RECURSIVE name AS (
-- anchor: base case
SELECT … FROM source WHERE base_predicate
UNION ALL
-- recursive: one hop further
SELECT … FROM source s JOIN name n ON s.parent = n.id
)
SELECT … FROM name;
Three rules. First, the keyword RECURSIVE is required in PostgreSQL (some dialects make it optional; do not rely on that). Second, UNION ALL is required—UNION (deduplicating) is allowed in some implementations but rejected by Postgres for recursive CTEs because it would force materializing the whole set per iteration. Third, the anchor and recursive members must have the same column count and types.
The invariant: the recursive member adds rows that the anchor's invariant lets it generate. If the anchor returns "depth-1 ancestors," the recursive member must return "depth-N ancestors given depth-(N-1) rows." Whatever shape the anchor produces, the recursive member must extend.
Worked example. Walk from CEO down to all employees, tracking depth.
Worked-example solution.
WITH RECURSIVE chain AS (
SELECT employee_id, name, manager_id, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.name, e.manager_id, c.level + 1
FROM employees e
JOIN chain c ON e.manager_id = c.employee_id
)
SELECT * FROM chain ORDER BY level, name;
Tracking depth with level + 1
Detailed explanation. Carry an integer column that increments at each recursive step. The anchor sets 1 (or 0, depending on convention); the recursive member adds c.level + 1. This level column is your loop counter—it tells you how many hops from the start each row is. It is also your safety net: capping level < 50 in the recursive WHERE prevents infinite recursion on cyclic data.
The invariant: level is monotonically non-decreasing across iterations. The first iteration has level = 1, the second level = 2, and so on. If you see a row with level = 1 after the first iteration, your recursive join condition is wrong.
Worked example. 4-level chain Alice → Bob → Carol → Dan, walking down from CEO.
| name | level |
|---|---|
| Alice | 1 |
| Bob | 2 |
| Carol | 3 |
| Dan | 4 |
Each row's level equals its depth from the CEO.
Worked-example solution.
WITH RECURSIVE chain AS (
SELECT employee_id, name, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.name, c.level + 1
FROM employees e
JOIN chain c ON e.manager_id = c.employee_id
)
SELECT * FROM chain ORDER BY level, name;
Termination and infinite-loop guards
Detailed explanation. PostgreSQL terminates a recursive CTE automatically when the recursive member returns zero rows. For acyclic trees (a typical org chart) this happens naturally—when you reach the leaf level, no employee has a parent in the current chain set, so the next iteration is empty and recursion stops. For cyclic data (employee A reports to B, B reports to A), the recursion never terminates without a guard.
Two guards. Cheap: cap depth in the recursive member with WHERE c.level < 50. The query terminates after at most 50 hops. Expensive but correct: track the visited path as an array column and check membership before recursing—WHERE NOT (e.employee_id = ANY(c.path)). Use the cheap guard for interview pace; mention the path-tracking version as the production-grade fix.
Worked example. Suppose manager_id is corrupted: Bob → Alice → Bob (cycle). Without a guard, recursion runs forever. With WHERE c.level < 50, the query returns 50 rows (all in the cycle) and stops, signaling the issue.
Worked-example solution.
WITH RECURSIVE chain AS (
SELECT employee_id, name, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.name, c.level + 1
FROM employees e
JOIN chain c ON e.manager_id = c.employee_id
WHERE c.level < 50
)
SELECT * FROM chain;
Common beginner mistakes
- Using
UNIONinstead ofUNION ALL(PostgreSQL rejects this for recursive CTEs). - Forgetting the
RECURSIVEkeyword—Postgres requires it explicitly. - Mismatched column count or types between anchor and recursive members.
- No depth cap in cyclic data → query hangs.
- Joining the recursive member back to itself in the wrong direction (
c.manager_id = e.employee_idinstead ofe.manager_id = c.employee_id).
SQL interview question on recursive CTEs
Table employees(employee_id, name, manager_id). Return every (employee, ancestor, depth) triple, where ancestor is any manager up the chain (direct manager has depth 1, their manager has depth 2, and so on). Order by employee_id, then depth.
Solution using a recursive CTE
WITH RECURSIVE chain AS (
SELECT employee_id, manager_id, 1 AS depth
FROM employees
WHERE manager_id IS NOT NULL
UNION ALL
SELECT c.employee_id, e.manager_id, c.depth + 1
FROM chain c
JOIN employees e ON e.employee_id = c.manager_id
WHERE e.manager_id IS NOT NULL
)
SELECT employee_id, manager_id AS ancestor_id, depth
FROM chain
ORDER BY employee_id, depth;
Why this works: The anchor member emits each (employee, direct_manager, 1) pair, so the CTE starts populated for every non-CEO employee. The recursive member walks up one ancestor per iteration by joining chain.manager_id to employees.employee_id and incrementing depth; it filters out the CEO with WHERE e.manager_id IS NOT NULL so we do not try to recurse from a row whose ancestor is null. Termination is automatic: when the next ancestor's manager_id is NULL (we have reached the CEO), the join produces zero rows and recursion stops.
Optimized solution using a depth guard
For deep or potentially cyclic data, add AND c.depth < 50 to the recursive member's WHERE. The query now runs in bounded time even if the data has corrupted manager pointers, and you can flag rows where depth = 50 for investigation. This single line is the difference between a query that hangs production and one that fails fast.
SQL
Topic — CTE
CTE problems
SQL
Topic — tree traversal
Tree traversal problems
Tips to crack Google data engineering interviews
These are habits that move the needle in real Google loops—not a re-statement of the topics above.
SQL preparation: drill the Google patterns
Spend most of your SQL prep on GROUP BY + subquery (above-average filters), self-joins (employee/manager), and recursive CTEs (skip-level reporting). Type the queries; do not just read them. The aggregation topic page, self-join topic page, and CTE topic page cover the three patterns directly.
Python preparation: stdlib fluency over framework knowledge
Google interviewers care more about clean Counter / defaultdict use than whether you know the latest pandas API. Prioritize: collections, functools.wraps, itertools.groupby, context managers for file I/O. Pull problems from the hash table topic page and the file I/O topic page.
Data pipeline thinking: clean → cast → aggregate
When the prompt says "here is a CSV," walk through three explicit phases: clean (drop or coerce malformed rows, with a count of how many you dropped), cast (pd.to_numeric(errors='coerce') before astype), aggregate (named aggregations with explicit column names). Saying these three words out loud changes how the interviewer scores the round.
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Google practice set | /explore/practice/company/google |
| SQL aggregation + subquery | /explore/practice/topic/aggregation |
| SQL self-join | /explore/practice/topic/self-join |
| SQL recursive CTE | /explore/practice/topic/cte |
| Hash table + frequency | /explore/practice/topic/hash-table |
| File I/O + log parsing | /explore/practice/topic/file-io |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume manager_id is null only for the CEO and that there are no cycles." State grain: "One row per employee per ancestor." State edge cases: "If the recursion has a cycle, my depth guard caps at 50." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Google data engineering interview process like?
The Google data engineering interview typically includes a phone screen (math + Python warm-up), one or two coding rounds focused on SQL and Python data manipulation, a system-design conversation around pipelines and warehousing, and behavioral interviews. The curated 14-problem Google practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Google test for data engineers?
Google emphasizes GROUP BY with subqueries (above-average patterns), self-joins for reporting hierarchies, and recursive CTEs for skip-level reporting. Window functions and date arithmetic show up in onsite rounds. Drill these on the aggregation, self-join, and CTE topic pages before your screen.
How important is Python for a Google data engineering interview?
Python is roughly half the interview at Google—expect frequency counting with Counter, file/log parsing, decorator stacks for retry+log, and pandas for clean-cast-aggregate. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.
Are there pandas questions in the Google data engineering interview?
Yes—particularly in the data-manipulation round. Expect a dirty CSV in, per-customer or per-key aggregates out, with pd.to_numeric(errors='coerce') to handle malformed rows. The type casting and aggregation practice topics cover these patterns.
How should I prepare for Google's recursive CTE and self-join questions?
Build the mental model first: a self-join is one hop; a recursive CTE walks the tree. Type out the anchor + recursive UNION ALL shape from memory ten times before your interview. Add a depth guard for cyclic data. The CTE and self-join topic pages have problems matching the Google patterns directly.
How many Google practice problems should I solve before the interview?
Aim for 30–50 problems spanning all nine topic clusters above—not 100 of the same kind. Solve every problem in the Google-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.
Start practicing Google data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Google problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Google asks—without the noise of generic algorithm prep.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)