Techpath data engineering interview questions are the cleanest fundamentals + production-pattern Python loop you'll see in a DE interview. Expect mostly straightforward Python: list-as-queue simulations, tuples and indexing for structured records, aggregation over tuples, hash-table set operations for inventory reconciliation, conditionals and comparison operators for priority classification, nested conditionals with membership testing for routing, functions with tuple-based return values, and CSV processing + exception handling + file I/O + log aggregation for production-flavored data loading. No SQL, no graph algorithms, no dynamic programming—just Python idioms and production patterns.
This guide walks through the eight topic clusters Techpath actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 9-problem Techpath set (3 easy, 6 medium, 0 hard)—the most fundamentals-friendly company hub covered so far, with no Hard tier. If you're early in your DE prep journey, this is the right hub to start with.
Top Techpath data engineering interview topics
From the Techpath data engineering practice set, the eight numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–8) | Why it shows up at Techpath |
|---|---|---|
| 1 | Lists and queue simulation in Python | Order Queue Manager—FIFO simulation with list or collections.deque. |
| 2 | Tuples and indexing for structured records in Python | Get Order Details—pack record fields into tuples, index by primary key. |
| 3 | Aggregation and data analysis on tuples in Python | Analyze Order Batch—sum, min, max, group-by-key over a list of tuples. |
| 4 | Hash tables for set operations: intersection and difference in Python | Inventory Reconciler—set intersection (&), difference (-), symmetric difference (^). |
| 5 | Conditionals and comparison operators in Python | Order Priority Classifier—chained comparisons, if/elif/else decision trees. |
| 6 | Conditionals with membership testing for routing in Python | Order Router—nested ifs with in operator on set for O(1) routing. |
| 7 | Functions, returns, and tuple-based outputs in Python | Order Summary Generator—multiple return values via tuple unpacking. |
| 8 | CSV processing, error handling, file I/O, and log aggregation in Python | Fault-Tolerant Data Loader + Log File Aggregator—production patterns for parsing dirty CSVs and aggregating log lines. |
Fundamentals + production framing: Techpath's prompts dress general Python idioms in operational data—orders, inventory, logs. The interviewer is grading whether you reach for the right Python primitive on each prompt:
listfor FIFO,tuplefor record packing,setfor membership,defaultdictfor counting,try/exceptfor fault tolerance, generator-based file iteration for streaming. State the primitive choice out loud before coding.
1. Lists and Queue Simulation in Python
Lists as queues for order simulation in Python for data engineering
The Python list is the right primitive for stacks (append and pop from the right are O(1)) but the wrong primitive for queues: list.pop(0) and list.insert(0, x) are O(n) because every element shifts. For FIFO order processing, reach for collections.deque—O(1) on both ends—and state the choice out loud.
Pro tip: "I'll use
collections.dequebecauselist.pop(0)is O(n) and we need O(1) dequeue" earns immediate credit. The list-as-queue antipattern is the most-graded Python performance bug in DE screens.
list.append / list.pop(0) vs collections.deque
list is a dynamic array (contiguous memory); left-end inserts/pops are O(n). deque is a doubly-linked list of memory blocks; both ends are O(1). The decision rule: list for stacks (LIFO), deque for queues (FIFO).
-
list.append/list.pop()— O(1) right end, the stack workload. -
list.insert(0, x)/list.pop(0)— O(n), the queue antipattern. -
deque.append/deque.popleft— O(1) both ends, the FIFO workload. -
queue.Queue— thread-safe but slower; use only with concurrent producers/consumers.
Worked example: operation cost comparison.
| operation | list |
deque |
|---|---|---|
append(x) |
O(1) | O(1) |
pop() (right) |
O(1) | O(1) |
insert(0, x) / appendleft(x)
|
O(n) | O(1) |
pop(0) / popleft()
|
O(n) | O(1) |
dq[i] (random index) |
O(1) | O(n) |
from collections import deque
orders: deque = deque()
orders.append("O1") # enqueue right — O(1)
orders.append("O2")
first = orders.popleft() # dequeue left — O(1) → 'O1'
Simulating an order queue with FIFO semantics
A FIFO queue exposes enqueue (push back), dequeue (pop front), and peek (front without removing). The invariant: insertion order is preserved—first in, first out.
-
Raise
IndexErroron empty — matches Python's sequence convention; never silently returnNone. -
Implement
__len__— soif queue:truthiness works correctly. - Leading underscore on internal deque — signals "private," callers shouldn't reach in.
Worked example: push three orders, peek, dequeue twice.
| op | queue state | return |
|---|---|---|
enqueue('O1') |
[O1] |
— |
enqueue('O2') |
[O1, O2] |
— |
enqueue('O3') |
[O1, O2, O3] |
— |
peek() |
[O1, O2, O3] |
'O1' |
dequeue() |
[O2, O3] |
'O1' |
dequeue() |
[O3] |
'O2' |
class OrderQueue:
def __init__(self):
self._q: deque = deque()
def enqueue(self, order):
self._q.append(order)
def dequeue(self):
if not self._q:
raise IndexError("dequeue from empty queue")
return self._q.popleft()
def peek(self):
if not self._q:
raise IndexError("peek on empty queue")
return self._q[0]
def __len__(self):
return len(self._q)
State updates and bounded queues
Real order queues need bounds—either silent eviction or explicit rejection.
-
deque(maxlen=K)— auto-evicts the oldest on overflow; right for sliding-window queues. -
Manual capacity check — raise or return
Falseon overflow; right for backpressure where orders are valuable.
Techpath framing usually wants explicit rejection—orders shouldn't silently disappear.
class BoundedOrderQueue:
def __init__(self, capacity: int):
self._q: deque = deque()
self._cap = capacity
def enqueue(self, order) -> bool:
if len(self._q) >= self._cap:
return False # rejected, caller must retry
self._q.append(order)
return True
Common beginner mistakes
- Using
list.pop(0)for dequeue—O(n) per operation, painfully slow on large queues. - Returning
Noneon empty dequeue instead of raising—silently masks "queue underflow" bugs. - Forgetting
__len__—if queue:then doesn't truthiness-check correctly. - Using
queue.Queuefor single-threaded code—thread-safety overhead with no benefit. - Setting
maxlenwhen the prompt requires explicit rejection—silent eviction loses orders.
Practice list and array problems →
Python interview question on lists and queue simulation
Implement an OrderQueue class with enqueue(order), dequeue(), peek(), and __len__() methods. All operations must be O(1). Empty dequeue and peek should raise IndexError.
Solution using collections.deque
from collections import deque
class OrderQueue:
def __init__(self):
self._q: deque = deque()
def enqueue(self, order):
self._q.append(order)
def dequeue(self):
if not self._q:
raise IndexError("dequeue from empty queue")
return self._q.popleft()
def peek(self):
if not self._q:
raise IndexError("peek on empty queue")
return self._q[0]
def __len__(self):
return len(self._q)
Why this works: collections.deque provides O(1) append (enqueue at back) and O(1) popleft (dequeue from front)—exactly the FIFO contract we need, with no O(n) shifts. dq[0] for peek is also O(1) because the deque exposes its leftmost block directly. The IndexError on empty matches Python's convention for sequence operations and signals "queue underflow" loudly. __len__ lets if queue: work correctly, and exposes size to callers without leaking the internal deque.
PYTHON
Topic — array
Array problems
PYTHON
Topic — simulation
Simulation problems
2. Tuples and Indexing for Structured Records in Python
Tuples and indexing for structured records in Python for data engineering
A tuple is Python's lightweight record: immutable, fixed-length, smallest memory, hashable (so it can be a dict key). The downside is integer-indexed access (order[0] is opaque); the upgrade is collections.namedtuple, which keeps tuple performance and adds attribute access. For "build once, query many" workloads, layer a dict index on top so lookups are O(1).
Pro tip: State the uniqueness of the primary key out loud.
{o.order_id: o for o in orders}silently drops duplicates—usedefaultdict(list)if the key isn't unique.
Tuples vs lists vs dataclasses
Four contenders for representing a record, with a clear cost ladder.
-
tuple— immutable, smallest, hashable. Fastest construction; integer index access. -
list— mutable, not hashable, slightly larger. Same indexing as tuple. -
namedtuple— tuple performance + attribute access (order.id). Best balance for DE records. -
@dataclass— class with named fields, auto__init__/__repr__/__eq__; mutable by default. Slowest construction; clearest for ergonomic code.
Decision rule: tuple for hot loops, namedtuple for slightly readable hot loops, dataclass for ergonomic code with methods.
| representation | construction | access | size |
|---|---|---|---|
| tuple | (1, "Alice", 99.99, "paid") |
order[0] |
smallest |
| namedtuple | Order(1, "Alice", 99.99, "paid") |
order.id |
same as tuple |
| list | [1, "Alice", 99.99, "paid"] |
order[0] |
slightly larger |
| dataclass | Order(id=1, customer="Alice", amount=99.99, status="paid") |
order.id |
larger (dict-backed) |
from collections import namedtuple
Order = namedtuple("Order", ["order_id", "customer", "amount", "status"])
o = Order(1, "Alice", 99.99, "paid")
print(o.order_id, o.customer) # 1 Alice
Tuple unpacking and named indexing via namedtuple
Tuples support unpacking (a, b, c = tup), which exposes named locals in one line. namedtuple adds attribute access on top, plus _replace(field=value) for copy-with-change and _asdict() for serialization.
- Unpacking — fastest, cleanest for one-time expansion of fields.
-
namedtupleattribute access — about 2× slower than raw indexing in CPython, negligible for most code. -
_replace— required for "modify" since tuples are immutable.
order = (1, "Alice", 99.99, "paid")
order_id, customer, amount, status = order # unpacking
from collections import namedtuple
Order = namedtuple("Order", ["order_id", "customer", "amount", "status"])
no = Order(*order) # construct from tuple
print(no.customer, no.amount) # named attribute access
Indexing a list of tuples by primary key
For "build once, query many" workloads, build a dict index in one O(n) pass; subsequent queries are O(1). For 1M records and 100K queries, this is ~1.1M ops vs ~10¹¹ for linear scan—five orders of magnitude. If the key isn't unique, use defaultdict(list) to avoid silent overwrite.
Worked example: index 3 orders by order_id.
| input | dict index |
|---|---|
[(1, "A"), (2, "B"), (3, "C")] |
{1: (1, "A"), 2: (2, "B"), 3: (3, "C")} |
Query idx[2] → (2, "B") in O(1).
def build_index(orders: list[tuple]) -> dict:
return {o[0]: o for o in orders} # assumes unique primary key
def get_order(idx: dict, order_id: int) -> tuple | None:
return idx.get(order_id)
Common beginner mistakes
- Using a list and linear scan when a dict index gives O(1) lookup at small construction cost.
- Building a dict index from a list with non-unique keys—silent overwrite of duplicates.
- Reaching for
dataclasswhennamedtupleis sufficient and faster. - Using integer indexing throughout the body of a long function—unreadable, error-prone on field reorderings.
- Forgetting that tuples are immutable—
o[0] = 99raisesTypeError.
Python interview question on tuples and indexing
Given a list of (order_id, customer, amount, status) tuples, build a structure that, for any order_id, returns the matching tuple in O(1) time. Handle the case where the same order_id appears in multiple records (return all matches as a list).
Solution using defaultdict(list) keyed by order_id
from collections import defaultdict
def build_multi_index(orders: list[tuple]) -> dict[int, list[tuple]]:
idx: dict[int, list[tuple]] = defaultdict(list)
for o in orders:
idx[o[0]].append(o)
return dict(idx)
def get_orders_by_id(idx: dict[int, list[tuple]], order_id: int) -> list[tuple]:
return idx.get(order_id, [])
Why this works: defaultdict(list) auto-creates an empty list on first access for each order_id, so idx[o[0]].append(o) works without an explicit "if key not in" guard. Construction is O(n) for the n input tuples. Each query is O(1) average for the dict lookup, plus O(k) to return the k matching records. Returning [] for missing keys (instead of raising KeyError) gives the caller a uniform iteration API. Total: O(n) build, O(1 + k) query.
PYTHON
Topic — indexing
Indexing problems
PYTHON
Topic — array
Array problems
3. Aggregation and Data Analysis on Tuples in Python
Aggregation on tuples in Python for data engineering
Aggregation problems split into two shapes: flat aggregates (one number across all rows—sum, mean, min, max) and grouped aggregates (one number per group—defaultdict or Counter). Build the flat one-liner first; then nest with defaultdict for the grouped version.
Pro tip: Generator expressions (
sum(o[2] for o in orders)) are lazy—O(1) memory regardless of input size. Reach forCounterfor frequency aggregates anddefaultdict(int|float)for per-key totals.
sum, min, max, statistics.mean over tuple fields
The built-in flat aggregates are sum, min, max, len, plus statistics.mean / median / stdev. All accept any iterable, so a generator expression that pulls one field per tuple keeps memory O(1).
-
sum(gen)— defaults to integer0; pass a typed start (Decimal(0)) for non-numeric types. -
statistics.mean(gen)— handles empty input by raisingStatisticsError; guard withif orders:. -
min/max— single-pass, O(n).
Worked example: aggregates over 4 orders.
| order_id | amount |
|---|---|
| 1 | 10.0 |
| 2 | 20.0 |
| 3 | 30.0 |
| 4 | 40.0 |
| metric | value |
|---|---|
sum |
100.0 |
min |
10.0 |
max |
40.0 |
mean |
25.0 |
count |
4 |
import statistics
def order_stats(orders: list[tuple]) -> dict:
return {
"total": sum(o[2] for o in orders),
"min": min(o[2] for o in orders),
"max": max(o[2] for o in orders),
"mean": statistics.mean(o[2] for o in orders),
"count": len(orders),
}
Group-by-key aggregation with defaultdict(list) then reduce
For per-group aggregates ("total per customer," "count per status"), two patterns: two-pass (group with defaultdict(list), then reduce each group) or single-pass (running aggregate folded into one loop).
- Two-pass — O(N) extra memory for intermediate lists; cleaner for multi-metric outputs.
- Single-pass — O(K) memory for K groups; preferred for very large N.
Worked example: per-status totals over 4 orders → {paid: 70, pending: 5}.
| order_id | status | amount |
|---|---|---|
| 1 | paid | 10 |
| 2 | paid | 20 |
| 3 | pending | 5 |
| 4 | paid | 40 |
from collections import defaultdict
def total_per_status(orders: list[tuple]) -> dict[str, float]:
totals: dict[str, float] = defaultdict(float)
for _, _, amount, status in orders:
totals[status] += amount
return dict(totals)
itertools.groupby for sorted group-by
itertools.groupby only groups consecutive equal keys, so unsorted input produces fragmented groups—sort first, then groupby. Cost: O(N log N) sort + O(N) pass. Slower than defaultdict(list) but emits groups lazily, so it works on memory-constrained or already-sorted streams.
-
Use
groupby— input already sorted (e.g. databaseORDER BY), memory-constrained, or processing in sorted-key order. -
Use
defaultdict(list)— unsorted input that fits in memory; O(N) total.
Worked example: sorted input groups cleanly.
| order_id | status |
|---|---|
| 1 | paid |
| 2 | paid |
| 3 | pending |
| 4 | shipped |
groupby(orders, key=lambda o: o[1]) → [("paid", [1, 2]), ("pending", [3]), ("shipped", [4])].
from itertools import groupby
def groups_by_status(orders: list[tuple]) -> dict[str, list[tuple]]:
sorted_orders = sorted(orders, key=lambda o: o[3])
return {
status: list(group) for status, group in groupby(sorted_orders, key=lambda o: o[3])
}
Common beginner mistakes
- Materializing intermediate lists (
list(generator)) whensum/min/maxalready consume iterables lazily. - Using
itertools.groupbyon unsorted input—silent fragmented groups. - Mixing
Decimalandintinsumwithout an explicit start value—TypeError. - Using a regular
dictwith+= 1and crashing on the first unseen key—usedefaultdict(int)orCounter. - Computing each metric in a separate pass (4 passes for total + min + max + mean) when one pass with running aggregates is enough.
Python interview question on aggregation over tuples
Given a list of (order_id, customer, amount, status) tuples, return a dictionary with: total_amount, mean_amount, count, count_by_status (per-status count), and total_by_customer (per-customer total).
Solution using a single pass with defaultdict and Counter
from collections import Counter, defaultdict
import statistics
def analyze_batch(orders: list[tuple]) -> dict:
if not orders:
return {"total_amount": 0, "mean_amount": 0, "count": 0, "count_by_status": {}, "total_by_customer": {}}
amounts = [o[2] for o in orders]
count_by_status: Counter = Counter(o[3] for o in orders)
total_by_customer: dict[str, float] = defaultdict(float)
for _, customer, amount, _ in orders:
total_by_customer[customer] += amount
return {
"total_amount": sum(amounts),
"mean_amount": statistics.mean(amounts),
"count": len(orders),
"count_by_status": dict(count_by_status),
"total_by_customer": dict(total_by_customer),
}
Why this works: Building amounts once and reusing it for sum, mean, and len avoids three separate passes over orders. Counter is the canonical one-pass frequency aggregate. defaultdict(float) builds per-customer totals in a single pass without explicit key initialization. The empty-input guard returns a structurally consistent result so downstream code doesn't need to special-case empty batches. Total cost: O(N) time, O(N + K) memory where K is unique statuses + customers.
PYTHON
Topic — aggregation
Aggregation problems
PYTHON
Topic — data analysis
Data analysis problems
4. Hash Tables for Set Operations: Intersection and Difference in Python
Set operations for inventory reconciliation in Python for data engineering
"What's the difference between two collections" is set algebra: convert each list to a set, then use & (intersection), - (difference), | (union), ^ (symmetric difference). Total cost is O(N + M)—the conversion pays for itself the moment you avoid one nested loop. For 100K items, this beats nested-loop O(N*M) by five orders of magnitude.
Pro tip: Whenever you reach for
if item in list:inside a loop over another list, stop—convert tosetinstead. Membership testing onsetis O(1); onlistit's O(N).
set constructor + &, -, |
set(iterable) deduplicates into a hash table. Elements must be hashable (tuples of immutables work; lists and dicts don't). Operation costs:
-
a & b— intersection, O(min(|a|, |b|)). -
a | b— union, O(|a| + |b|). -
a - b— difference, O(|a|). -
a ^ b— symmetric difference, O(|a| + |b|). -
x in a— membership, O(1) average.
Worked example: two SKU lists.
| set | contents |
|---|---|
warehouse |
{SKU-101, SKU-102, SKU-200} |
store |
{SKU-200, SKU-301} |
| operation | result |
|---|---|
warehouse & store |
{SKU-200} (in both) |
warehouse - store |
{SKU-101, SKU-102} (only in warehouse) |
store - warehouse |
{SKU-301} (only in store) |
warehouse ^ store |
{SKU-101, SKU-102, SKU-301} (mismatched) |
def reconcile(warehouse: list[str], store: list[str]) -> dict:
w, s = set(warehouse), set(store)
return {
"in_both": w & s,
"only_warehouse": w - s,
"only_store": s - w,
"mismatched": w ^ s,
}
Symmetric difference for "what changed both ways"
a ^ b returns items in exactly one of the two sets—the union of the two non-overlapping Venn regions. Use cases: "what changed between yesterday and today," "which test cases differ between runs."
a ^ b == (a | b) - (a & b) == (a - b) | (b - a)—three equivalent expressions; the operator form is shortest.
Worked example: warehouse ^ store = {SKU-101, SKU-102, SKU-301} — three SKUs that don't match between inventories.
def mismatches(warehouse: list[str], store: list[str]) -> set:
return set(warehouse) ^ set(store)
dict comparison via keys() & keys() for keyed-store reconciliation
When inventories are keyed stores (SKU → quantity), the keys tell you what's in both; the values tell you whether they match. dict.keys() is set-like—use &, -, |, ^ directly without converting to set.
Worked example: same SKUs but different quantities.
warehouse = {"SKU-101": 10, "SKU-200": 5}
store = {"SKU-200": 3, "SKU-301": 7}
in_both_skus = warehouse.keys() & store.keys() # {"SKU-200"}
# qty mismatch: warehouse=5, store=3
def reconcile_keyed(
warehouse: dict[str, int], store: dict[str, int]
) -> dict:
in_both = warehouse.keys() & store.keys()
return {
"only_warehouse": warehouse.keys() - store.keys(),
"only_store": store.keys() - warehouse.keys(),
"qty_mismatch": {
sku: (warehouse[sku], store[sku])
for sku in in_both
if warehouse[sku] != store[sku]
},
}
Common beginner mistakes
- Using
if x in list:inside a nested loop (O(N*M)) instead ofset(list)for O(1) membership checks. - Forgetting that
setelements must be hashable—lists and dicts can't be set elements. - Using
==to compare two sets of tuples and getting a non-deterministic result if the tuples are unordered—they're not; sets compare by membership. - Using
set(dict.keys())whendict.keys()is already set-like. - Reaching for nested
forloops when set algebra expresses the same thing in one line.
Python interview question on set operations
Given two lists of SKU codes (warehouse_inventory, store_inventory), return three sets: items only in warehouse, items only in store, and items in both. Use O(N + M) time.
Solution using set algebra
def reconcile_inventory(
warehouse_inventory: list[str], store_inventory: list[str]
) -> dict[str, set[str]]:
w, s = set(warehouse_inventory), set(store_inventory)
return {
"only_warehouse": w - s,
"only_store": s - w,
"in_both": w & s,
}
Why this works: Converting each list to a set is O(N) and O(M)—deduplication plus hash-table construction. The three set operations (-, -, &) are each O(min size) on average. Total cost: O(N + M), much faster than the naive O(N * M) nested-loop approach. The result is structurally clean: three named sets, no quantity ambiguity, easy to consume downstream.
PYTHON
Topic — hash table
Hash table problems
PYTHON
Topic — set operations
Set operations problems
5. Conditionals and Comparison Operators in Python
Conditionals and comparison operators in Python for data engineering
Conditionals are graded on cleanliness, exhaustiveness, and idioms—not on whether you know if exists. Two Python-specific idioms separate fluent code from naive code: chained comparisons (100 < amount <= 1000) for range checks, and short-circuiting and/or for guards and defaults.
Pro tip:
if 100 < amount <= 1000:is faster and clearer thanif amount > 100 and amount <= 1000:—the middle term is evaluated once. State this in interviews; it shows Python fluency.
Comparison operators and chained comparisons
The six operators (<, <=, >, >=, ==, !=) return bool and chain with and/or/not. Python's unique feature: chained comparisons evaluate the middle term once.
-
Chained comparison —
a < b < cis(a < b) and (b < c)withbcomputed once. -
Type strictness —
1 < "1"raisesTypeErrorin Python 3 (no implicit numeric/string coercion). -
==vsis—==is value equality;isis identity. UseisforNone/True/False,==for everything else.
Worked example: range checks.
| comparison | meaning |
|---|---|
0 < x <= 100 |
x is in (0, 100] |
0 <= x < 100 |
x is in [0, 100) |
a == b == c |
all three equal |
a < b > c |
b greater than both (allowed but rare) |
def in_range(x: float, low: float, high: float) -> bool:
return low <= x < high # half-open range
if/elif/else decision trees vs dispatch tables
if/elif/else is the right shape for range checks and complex conditions. A dispatch table (dict from key to value/callable) is the right shape for equality-only routing: O(1) lookup, code reads as data.
-
if/elif— order matters when conditions overlap; the first match wins. -
Dispatch table —
actions = {"a": handle_a, "b": handle_b}; actions[key](). -
Decision rule —
if/eliffor ranges/compound, dispatch for ≥5 equality-only branches.
Worked example: decision tree for order priority.
| amount | priority |
|---|---|
| ≤ 10 | low |
| 10 < a ≤ 100 | medium |
| > 100 | high |
def priority(amount: float) -> str:
if amount <= 10:
return "low"
elif amount <= 100:
return "medium"
else:
return "high"
Boolean short-circuiting (and / or) for guarded checks
a and b evaluates b only if a is truthy; a or b evaluates b only if a is falsy. Two uses: skip expensive checks (if items and items[0] > 0) and provide defaults (value = arg or "default").
-
Truthiness gotcha —
0,"",[],{},Noneare all falsy;0 or "default"returns"default". -
Explicit None check —
arg if arg is not None else "default"when zero/empty is a valid input. -
Return value —
andreturns the first falsy operand (or last);orreturns the first truthy operand (or last).
Worked example: short-circuit behavior.
| expression | evaluates b? |
result |
|---|---|---|
True and b |
yes | b |
False and b |
no | False |
True or b |
no | True |
False or b |
yes | b |
0 or "default" |
yes | "default" |
"" or "default" |
yes | "default" |
5 or "default" |
no | 5 |
def safe_first_amount(orders: list[tuple]) -> float:
return orders[0][2] if orders else 0.0
Common beginner mistakes
- Comparing with
is None/is not Nonefor objects (works forNone, breaks for other objects with__eq__). - Reversing
if/eliforder so an earlier branch swallows what should hit a later one. - Building a 50-line
if/elifchain when a dispatch dict would be 5 lines. - Forgetting that
0,"",[],{},Noneare all falsy—if x:rejects all of them. - Using
==to compare toNone—works most of the time but breaks for custom__eq__.
Python interview question on conditionals
Given an order with amount (float) and customer_tier (one of "gold", "silver", "bronze"), return its priority: "high" for amount > 1000 OR gold tier; "medium" for amount > 100 OR silver tier; "low" otherwise.
Solution using chained comparisons and short-circuiting
def classify_priority(amount: float, tier: str) -> str:
if amount > 1000 or tier == "gold":
return "high"
if amount > 100 or tier == "silver":
return "medium"
return "low"
Why this works: Each branch combines an amount check and a tier check with or—if either is true, the priority assigns. The branches are evaluated top-down; the first match wins, so a gold-tier customer with a low amount still gets "high" priority. The implicit else at the bottom catches everything that didn't match the upper two branches. No elif needed because each branch ends with return. Total cost: O(1) per call, two comparisons in the worst case.
PYTHON
Topic — conditionals
Conditional problems
PYTHON
Topic — comparison
Comparison problems
6. Conditionals with Membership Testing for Routing in Python
Membership testing and routing in Python for data engineering
Routing is "given input X, decide where to send it." Two performance levers: in on set or dict is O(1) while in on list is O(N), and dispatch tables beat nested if/elif for equality-only routing (O(1) lookup, code reads as data). Convert routing tables to sets the moment you exceed ~10 entries.
Pro tip: State the choice in interviews: "I'll use a dispatch dict because adding a new region is one line and the lookup stays O(1) regardless of table size."
Membership testing with in on list, set, dict
The in operator's cost depends on container type. Pick by data size and access pattern.
-
x in list— linear scan, O(N). -
x in set— hash lookup, O(1) average; element must be hashable. -
x in dict— checks keys, not values; O(1) average. -
x in str— substring check, O(N*M) worst case. -
NaN gotcha —
float('nan') == float('nan')isFalse;if x in [NaN, ...]silently fails. Usemath.isnan(x).
Worked example: membership cost.
| container | cost of x in container
|
|---|---|
list (10 items) |
O(10) — fast but linear |
set (10 items) |
O(1) average |
list (10K items) |
O(10K) — slow in a hot loop |
set (10K items) |
O(1) average |
WEST_STATES = {"CA", "OR", "WA", "NV", "AZ"} # set, not list
def is_west(state: str) -> bool:
return state in WEST_STATES # O(1)
Nested conditionals vs flat dispatch tables
For 5+ equality-only branches, replace if/elif chains with a dict keyed by the input value. The dict can map to constants or handler callables.
-
if/elifchain — O(N) in branch count; required for range/compound conditions. - Dispatch table — O(1) lookup; one-line additions; equality-only.
- Decision rule — dispatch for ≥5 equality routes, conditionals for ≤4 routes or compound conditions.
Worked example: region → warehouse mapping.
| region | handler |
|---|---|
"west" |
send_west_warehouse |
"east" |
send_east_warehouse |
"central" |
send_central_warehouse |
"unknown" |
None → raise |
# Dispatch table form
ROUTES = {
"west": send_west_warehouse,
"east": send_east_warehouse,
"central": send_central_warehouse,
}
def route_dispatch(region: str):
handler = ROUTES.get(region)
if handler is None:
raise ValueError(f"unknown region: {region}")
return handler()
Guard clauses for early returns
A guard clause is an early return or raise that exits on edge cases before the main logic runs—flattening nested conditionals so the happy path lives at the top level of indentation. The invariant: guards exit early on invalid state; the body assumes valid state.
- Nested form — happy path is buried 3–4 indents deep.
- Guarded form — each guard handles one edge case; happy path is one line at the bottom.
Worked example: compare nesting depth.
| version | max nesting | happy path indent |
|---|---|---|
| nested | 4 | 4 |
| guards | 1 | 1 |
# With guard clauses — flat
def process(order):
if order is None:
return None
if order.amount <= 0:
return reject_zero_amount(order)
if order.status != "paid":
return queue_for_payment(order)
return ship(order)
Common beginner mistakes
- Using
in listfor routing tables with 100+ entries—linear scan, slow. - Building a 50-line
if/elifchain for equality routing instead of a dispatch dict. - Deeply nested conditionals when guards would flatten the code.
- Forgetting
dict.get(k)returnsNoneon missing key—must explicitly handle. - Routing tables that share state across calls without making them module-level constants—per-call rebuild is O(N) overhead.
Python interview question on conditional routing
You have orders with a region field (one of "west", "east", "central", "international"). Build a router that, given an order, returns the warehouse code: WH-W for west, WH-E for east, WH-C for central, WH-INT for international. Unknown regions raise ValueError. Use O(1) routing.
Solution using a dispatch table
ROUTES = {
"west": "WH-W",
"east": "WH-E",
"central": "WH-C",
"international": "WH-INT",
}
def route_order(region: str) -> str:
warehouse = ROUTES.get(region)
if warehouse is None:
raise ValueError(f"unknown region: {region!r}")
return warehouse
Why this works: The dispatch dict is built once at module load; subsequent lookups are O(1) average. dict.get(region) returns None on missing key (no exception), so we can produce a custom error message including the offending value via the !r format spec. Adding a new region is one line in the dict—no code-flow changes. The function's body is 3 lines, regardless of how many regions exist.
PYTHON
Topic — conditional logic
Conditional logic problems
PYTHON
Topic — conditionals
Conditional problems
7. Functions, Returns, and Tuple-Based Outputs in Python
Functions and tuple-based return values in Python for data engineering
Python functions are first-class: return a, b, c implicitly constructs a tuple, and callers can unpack at the call site (a, b, c = fn()). For ad-hoc returns, raw tuples are fine; for values that flow through many functions, upgrade to namedtuple or @dataclass so fields are self-documenting.
Pro tip: Comprehensions (
[fn(x) for x in iterable]) are usually faster and more Pythonic thanmap/filterin CPython. Reservefunctools.reducefor genuinely custom aggregations—sum/max/minare clearer.
Multiple return values via tuple unpacking
return a, b, c returns the tuple (a, b, c) (parentheses optional). Callers unpack with a, b, c = fn() or use first, *rest = fn() for partial unpacking.
- Tuple constructed implicitly — no special multi-return syntax.
-
Unpack count must match — wrong count raises
ValueError. -
Partial unpacking —
first, *rest = fn()collects the tail into a list.
def order_summary(orders: list[tuple]) -> tuple[float, float, int]:
if not orders:
return 0.0, 0.0, 0
total = sum(o[2] for o in orders)
return total, total / len(orders), len(orders)
total, avg, count = order_summary(my_orders)
Named outputs via namedtuple or dataclass
For values that flow through multiple functions (passed around, serialized, logged), use a named record so call sites read cleanly.
-
namedtuple— tuple-shaped + attribute-named; backward-compatible with unpacking (total, avg, count = result) AND supportsresult.total. -
@dataclass— class-shaped + auto-generated__init__/__repr__/__eq__; mutable by default; supports defaults. -
Decision rule —
namedtuplefor most DE records,@dataclasswhen the value has methods or default values.
Worked example: Summary(100, 25, 4) — result.total is 100, result[0] is also 100.
from collections import namedtuple
Summary = namedtuple("Summary", ["total", "average", "count"])
def order_summary(orders: list[tuple]) -> Summary:
if not orders:
return Summary(0.0, 0.0, 0)
total = sum(o[2] for o in orders)
return Summary(total, total / len(orders), len(orders))
Higher-order functions: map, filter, functools.reduce
Three lazy iterators that take callables: map(fn, it), filter(pred, it), functools.reduce(fn, it, init). Modern Python prefers comprehensions for map/filter; reduce has no comprehension equivalent.
-
map/filter— usually replaced by comprehensions/generators, which are faster in CPython. -
reduce— fold-style aggregation; reserved for custom folds (sum/max/minare clearer for the common ones). -
Comprehensions —
[fn(x) for x in it]and[x for x in it if pred(x)]are the idiomatic forms.
Worked example: comparing approaches.
| approach | code | result |
|---|---|---|
map |
list(map(lambda x: x*2, [1,2,3])) |
[2, 4, 6] |
| comprehension | [x*2 for x in [1,2,3]] |
[2, 4, 6] |
filter |
list(filter(lambda x: x>1, [1,2,3])) |
[2, 3] |
| comprehension | [x for x in [1,2,3] if x>1] |
[2, 3] |
reduce |
reduce(operator.add, [1,2,3], 0) |
6 |
| built-in | sum([1,2,3]) |
6 |
def double_amounts(orders: list[tuple]) -> list[float]:
return [o[2] * 2 for o in orders] # comprehension over `map`
def expensive_orders(orders: list[tuple], threshold: float) -> list[tuple]:
return [o for o in orders if o[2] > threshold]
Common beginner mistakes
- Returning a tuple of mixed types (some
int, somefloat)—the unpacking caller may not realize. - Using
map/filterwhen a comprehension would be more Pythonic and slightly faster. - Using
functools.reducewhensum/max/minalready does the job. - Forgetting that
namedtupleinstances are immutable—s.total = 100raisesAttributeError. - Returning
Nonefrom some branches and a tuple from others—callers can't unpack uniformly.
Python interview question on functions and tuple returns
Write a function order_summary(orders) that returns a namedtuple Summary(total, average, count, max_amount) for a list of order tuples. Handle empty input gracefully (return zeros).
Solution using namedtuple and a single pass
from collections import namedtuple
Summary = namedtuple("Summary", ["total", "average", "count", "max_amount"])
def order_summary(orders: list[tuple]) -> Summary:
if not orders:
return Summary(0.0, 0.0, 0, 0.0)
total = 0.0
max_amount = float("-inf")
for o in orders:
amount = o[2]
total += amount
if amount > max_amount:
max_amount = amount
count = len(orders)
return Summary(total, total / count, count, max_amount)
Why this works: A single pass computes total and max_amount simultaneously, avoiding two separate sum/max calls (which would each iterate independently). The namedtuple return is unpackable AND attribute-accessible at the call site. The empty-input guard returns a structurally consistent Summary with zeros, so downstream code can result.total/result.average uniformly without a None check. Total: O(N) time, O(1) extra memory.
PYTHON
Topic — higher-order functions
Higher-order function problems
PYTHON
Topic — list comprehension
List comprehension problems
8. CSV Processing, Error Handling, File I/O, and Log Aggregation in Python
Production patterns for fault-tolerant data loading and log aggregation in Python for data engineering
Production loading uses four primitives together: csv.DictReader for typed row access, try/except per row for fault tolerance, for line in f: for memory-bounded streaming, and defaultdict(int) for streaming aggregation. A candidate who reads with f.read().split("\n") signals no production experience; for line in f: signals the opposite.
Pro tip: State the framing out loud: "I'll stream the file in case it exceeds memory; I'll wrap each row in
try/exceptso one bad row doesn't fail the whole load; I'll quarantine bad rows for later inspection."
csv.reader / csv.DictReader for typed row access
The csv module handles delimiters, quote escaping, and embedded newlines correctly—line.split(",") breaks the moment a field contains a comma. Every value the reader returns is a string; numeric coercion must be explicit and fault-tolerant.
-
csv.reader(f)— yields each row as a list of strings; integer-indexed access. -
csv.DictReader(f)— yields each row as a dict keyed by header columns; self-documenting access (row["amount"]). -
String values —
int(row["qty"])/float(row["amount"])may raiseValueError; wrap intry/except.
Worked example: reading a dirty CSV.
order_id,qty,amount
101,5,49.99
102,3,29.99
103,abc,9.99
| approach | code | row 3 result |
|---|---|---|
csv.reader |
for row in csv.reader(f): |
["103", "abc", "9.99"] (strings) |
csv.DictReader |
for row in csv.DictReader(f): |
{"order_id": "103", "qty": "abc", "amount": "9.99"} |
import csv
def read_orders(path: str) -> list[dict]:
with open(path, newline="") as f:
return list(csv.DictReader(f))
try/except around per-row parsing for fault tolerance
The fault-tolerant loader pattern: wrap each row's coercion in try/except, push valid rows to one list and quarantined rows (with the error reason) to another. Never crash on a single bad row; never silently drop bad rows.
-
Catch specific exceptions —
ValueErrorforint/floatfailures,KeyErrorfor missing columns; never bareException. -
Capture the error message — quarantine
(row, str(e))so debugging has the reason. - Return both lists — let the caller decide whether to log, retry, or alert.
Worked example: loading the dirty CSV above.
| row | result |
|---|---|
{order_id: "101", qty: "5", amount: "49.99"} |
valid → {order_id: 101, qty: 5, amount: 49.99}
|
{order_id: "102", qty: "3", amount: "29.99"} |
valid → {order_id: 102, qty: 3, amount: 29.99}
|
{order_id: "103", qty: "abc", amount: "9.99"} |
quarantine → (row, "invalid literal for int() ...")
|
import csv
def load_with_quarantine(path: str) -> tuple[list[dict], list[tuple]]:
valid: list[dict] = []
quarantine: list[tuple[dict, str]] = []
with open(path, newline="") as f:
for row in csv.DictReader(f):
try:
parsed = {
"order_id": int(row["order_id"]),
"qty": int(row["qty"]),
"amount": float(row["amount"]),
}
valid.append(parsed)
except (ValueError, KeyError) as e:
quarantine.append((row, str(e)))
return valid, quarantine
Streaming file reads with for line in open(...)
for line in f: reads one line at a time with O(1) memory regardless of file size—the only viable pattern for multi-GB logs. Wrap with with open(...) so the descriptor is always closed.
-
Anti-pattern:
f.read()— loads the whole file into a single string; OOM on large files. -
Anti-pattern:
f.readlines()— loads all lines into a list; same OOM risk. -
Trailing newline —
for line in f:keeps"\n"; strip withline.rstrip()before parsing.
Worked example: streaming a log file with bounded memory.
with open("access.log") as f:
for line in f:
process(line.rstrip()) # strip trailing \n
defaultdict(int) for log line aggregation
Log aggregation is the canonical streaming workload: parse each line, increment a per-key counter, finish in O(N) time and O(K) memory for K unique keys. Combine streaming reads, per-line parsing with a malformed-line guard, and defaultdict(int) for auto-zero counters.
- Streaming read — O(1) memory per line.
-
Malformed-line guard —
if len(parts) < 9: continuekeeps the aggregator running. -
defaultdict(int)— noif key not incheck; first access auto-zeros.
Worked example: aggregating 4 log lines.
| line | status code | counter after |
|---|---|---|
... 200 ... |
200 | {200: 1} |
... 200 ... |
200 | {200: 2} |
... 404 ... |
404 | {200: 2, 404: 1} |
... 500 ... |
500 | {200: 2, 404: 1, 500: 1} |
from collections import defaultdict
def aggregate_status_codes(log_path: str) -> dict[str, int]:
counts: dict[str, int] = defaultdict(int)
with open(log_path) as f:
for line in f:
parts = line.rstrip().split()
if len(parts) < 9:
continue # malformed line; skip
status_code = parts[8]
counts[status_code] += 1
return dict(counts)
Common beginner mistakes
- Using
f.read().split("\n")to load a large file—OOM on multi-GB inputs. - Catching
Exceptioninstead of specific types (ValueError,KeyError)—hides real bugs. - Forgetting
with open(...)and leaking file descriptors. - Forgetting to strip the trailing newline before parsing—
line == "expected"fails silently. - Using
csv.readerfor keyed access then mapping integer indices to column names manually—useDictReaderinstead.
Python interview question on fault-tolerant CSV loading
Implement load_orders(csv_path) that streams a CSV with columns order_id,qty,amount,status and returns (valid_orders, quarantined_rows). Each valid row is a dict with parsed types (int, int, float, str); each quarantined row is (original_dict, error_message). Use O(1) memory regardless of file size.
Solution using csv.DictReader and per-row try/except
import csv
def load_orders(csv_path: str) -> tuple[list[dict], list[tuple[dict, str]]]:
valid: list[dict] = []
quarantine: list[tuple[dict, str]] = []
with open(csv_path, newline="") as f:
reader = csv.DictReader(f)
for raw in reader:
try:
parsed = {
"order_id": int(raw["order_id"]),
"qty": int(raw["qty"]),
"amount": float(raw["amount"]),
"status": raw["status"].strip(),
}
valid.append(parsed)
except (ValueError, KeyError) as e:
quarantine.append((dict(raw), str(e)))
return valid, quarantine
Why this works: csv.DictReader streams rows one at a time—O(1) memory per row. Each row is wrapped in try/except catching only the specific exceptions we expect (ValueError for int/float parse failures, KeyError for missing columns). Bad rows go to quarantine with their error reason; the loader never crashes. The with open(...) context manager handles file cleanup even on early returns. Total: O(N) time, O(N) memory for the result lists (unavoidable since we return them), but streaming memory (the in-flight processing) is O(1).
PYTHON
Topic — CSV processing
CSV processing problems
PYTHON
Topic — exception handling
Exception handling problems
Tips to crack Techpath data engineering interviews
These are habits that move the needle in real Techpath loops—not a re-statement of the topics above.
Python fundamentals preparation
Spend most of your prep on stdlib fluency: collections.deque, collections.defaultdict, collections.Counter, collections.namedtuple, itertools.groupby, csv.DictReader, functools.reduce. Type the patterns; do not just read them. The array, hash table, and conditionals topic pages cover the bulk.
Production-pattern preparation
Drill the four production primitives: streaming file reads (for line in f:), fault-tolerant parsing (try/except per row), CSV with type coercion (csv.DictReader + int(...) / float(...)), and log aggregation (defaultdict(int) over streamed lines). The CSV processing, exception handling, and file I/O topic pages have problems matching these patterns.
Order/inventory framing
Techpath's prompts dress general Python primitives in operational data: order queues, inventory reconciliation, log aggregation. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is FIFO simulation, use deque"; "this is set difference, use set algebra"; "this is conditional routing, use a dispatch dict"; "this is fault-tolerant CSV loading, use try/except per row." Mapping framings to families is the meta-skill.
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Techpath practice set | /explore/practice/company/techpath |
| Array | /explore/practice/topic/array |
| Hash table | /explore/practice/topic/hash-table |
| Set operations | /explore/practice/topic/set-operations |
| Conditionals | /explore/practice/topic/conditionals |
| CSV processing | /explore/practice/topic/csv-processing |
| Exception handling | /explore/practice/topic/exception-handling |
| File I/O | /explore/practice/topic/file-io |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume the CSV has a header row"; "I'll assume order_ids are unique"; "I'll assume the file may exceed memory, so I'll stream." State invariants after key code blocks. State complexity: "this is O(N) for the streaming pass, O(K) memory for the aggregate dict." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Techpath data engineering interview process like?
The Techpath data engineering interview typically includes a phone screen (Python warm-up around lists, tuples, or hash tables), one or two coding rounds focused on Python fundamentals and production patterns (CSV loaders, log aggregators), a system-design conversation around pipelines and data workflows, and behavioral interviews. The curated 9-problem Techpath practice set on PipeCode mirrors what you will see on the technical rounds.
Does Techpath test SQL in their data engineering interviews?
The curated Techpath practice set is 100% Python—no SQL problems among the nine. Other Techpath interviewers may bring SQL in ad-hoc rounds, but the published company set is fundamentals-and-production-pattern Python. Prepare for SQL separately if your role calls for it; the curated set will not drill it.
How important is Python for a Techpath data engineering interview?
Python is essentially the entire technical interview at Techpath—Python fundamentals, stdlib fluency, and idiomatic patterns. Memorize: collections.deque, defaultdict, Counter, namedtuple, itertools.groupby, csv.DictReader, try/except, for line in f:. Stdlib fluency separates a clean answer from a 30-line manual loop.
How hard are Techpath data engineering interview questions?
Techpath's curated set has 2 easy + 7 medium + 0 hard = no Hard tier. This is the most fundamentals-friendly company hub covered in PipeCode's company guides. If you're early in your DE prep journey, this is the right hub to start with; if you're a senior candidate prepping for FAANG, it's a quick refresher rather than a stretch.
What Python topics should I prioritize for Techpath?
In rough order: (1) lists vs deque for queue simulation, (2) tuples and indexing for structured records, (3) defaultdict(int) and Counter for aggregation, (4) set operations (&, -, |, ^) for reconciliation, (5) conditionals + dispatch tables for routing, (6) try/except for fault-tolerant parsing, (7) streaming file reads with for line in f:, (8) csv.DictReader + per-row error handling. The array, hash table, and CSV processing topic pages cover the spread.
How many Techpath practice problems should I solve before the interview?
Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Techpath-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.
Start practicing Techpath data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Techpath problems with tests, AI feedback, and a coding environment so you can drill the exact Python fundamentals and production patterns Techpath asks—without the noise of generic SQL prep that doesn't apply to this loop.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)