DEV Community

Cover image for Techpath Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

Techpath Data Engineering Interview Questions

Techpath data engineering interview questions are the cleanest fundamentals + production-pattern Python loop you'll see in a DE interview. Expect mostly straightforward Python: list-as-queue simulations, tuples and indexing for structured records, aggregation over tuples, hash-table set operations for inventory reconciliation, conditionals and comparison operators for priority classification, nested conditionals with membership testing for routing, functions with tuple-based return values, and CSV processing + exception handling + file I/O + log aggregation for production-flavored data loading. No SQL, no graph algorithms, no dynamic programming—just Python idioms and production patterns.

This guide walks through the eight topic clusters Techpath actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 9-problem Techpath set (3 easy, 6 medium, 0 hard)—the most fundamentals-friendly company hub covered so far, with no Hard tier. If you're early in your DE prep journey, this is the right hub to start with.

Techpath data engineering interview questions cover image with bold headline, Python chip, and pipecode.ai attribution.


Top Techpath data engineering interview topics

From the Techpath data engineering practice set, the eight numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–8) Why it shows up at Techpath
1 Lists and queue simulation in Python Order Queue Manager—FIFO simulation with list or collections.deque.
2 Tuples and indexing for structured records in Python Get Order Details—pack record fields into tuples, index by primary key.
3 Aggregation and data analysis on tuples in Python Analyze Order Batch—sum, min, max, group-by-key over a list of tuples.
4 Hash tables for set operations: intersection and difference in Python Inventory Reconciler—set intersection (&), difference (-), symmetric difference (^).
5 Conditionals and comparison operators in Python Order Priority Classifier—chained comparisons, if/elif/else decision trees.
6 Conditionals with membership testing for routing in Python Order Router—nested ifs with in operator on set for O(1) routing.
7 Functions, returns, and tuple-based outputs in Python Order Summary Generator—multiple return values via tuple unpacking.
8 CSV processing, error handling, file I/O, and log aggregation in Python Fault-Tolerant Data Loader + Log File Aggregator—production patterns for parsing dirty CSVs and aggregating log lines.

Fundamentals + production framing: Techpath's prompts dress general Python idioms in operational data—orders, inventory, logs. The interviewer is grading whether you reach for the right Python primitive on each prompt: list for FIFO, tuple for record packing, set for membership, defaultdict for counting, try/except for fault tolerance, generator-based file iteration for streaming. State the primitive choice out loud before coding.


1. Lists and Queue Simulation in Python

Lists as queues for order simulation in Python for data engineering

The Python list is the right primitive for stacks (append and pop from the right are O(1)) but the wrong primitive for queues: list.pop(0) and list.insert(0, x) are O(n) because every element shifts. For FIFO order processing, reach for collections.deque—O(1) on both ends—and state the choice out loud.

Pro tip: "I'll use collections.deque because list.pop(0) is O(n) and we need O(1) dequeue" earns immediate credit. The list-as-queue antipattern is the most-graded Python performance bug in DE screens.

list.append / list.pop(0) vs collections.deque

list is a dynamic array (contiguous memory); left-end inserts/pops are O(n). deque is a doubly-linked list of memory blocks; both ends are O(1). The decision rule: list for stacks (LIFO), deque for queues (FIFO).

  • list.append / list.pop() — O(1) right end, the stack workload.
  • list.insert(0, x) / list.pop(0) — O(n), the queue antipattern.
  • deque.append / deque.popleft — O(1) both ends, the FIFO workload.
  • queue.Queue — thread-safe but slower; use only with concurrent producers/consumers.

Worked example: operation cost comparison.

operation list deque
append(x) O(1) O(1)
pop() (right) O(1) O(1)
insert(0, x) / appendleft(x) O(n) O(1)
pop(0) / popleft() O(n) O(1)
dq[i] (random index) O(1) O(n)
from collections import deque

orders: deque = deque()
orders.append("O1")        # enqueue right — O(1)
orders.append("O2")
first = orders.popleft()   # dequeue left — O(1) → 'O1'
Enter fullscreen mode Exit fullscreen mode

Simulating an order queue with FIFO semantics

A FIFO queue exposes enqueue (push back), dequeue (pop front), and peek (front without removing). The invariant: insertion order is preserved—first in, first out.

  • Raise IndexError on empty — matches Python's sequence convention; never silently return None.
  • Implement __len__ — so if queue: truthiness works correctly.
  • Leading underscore on internal deque — signals "private," callers shouldn't reach in.

Worked example: push three orders, peek, dequeue twice.

op queue state return
enqueue('O1') [O1]
enqueue('O2') [O1, O2]
enqueue('O3') [O1, O2, O3]
peek() [O1, O2, O3] 'O1'
dequeue() [O2, O3] 'O1'
dequeue() [O3] 'O2'
class OrderQueue:
    def __init__(self):
        self._q: deque = deque()

    def enqueue(self, order):
        self._q.append(order)

    def dequeue(self):
        if not self._q:
            raise IndexError("dequeue from empty queue")
        return self._q.popleft()

    def peek(self):
        if not self._q:
            raise IndexError("peek on empty queue")
        return self._q[0]

    def __len__(self):
        return len(self._q)
Enter fullscreen mode Exit fullscreen mode

State updates and bounded queues

Real order queues need bounds—either silent eviction or explicit rejection.

  • deque(maxlen=K) — auto-evicts the oldest on overflow; right for sliding-window queues.
  • Manual capacity check — raise or return False on overflow; right for backpressure where orders are valuable.

Techpath framing usually wants explicit rejection—orders shouldn't silently disappear.

class BoundedOrderQueue:
    def __init__(self, capacity: int):
        self._q: deque = deque()
        self._cap = capacity

    def enqueue(self, order) -> bool:
        if len(self._q) >= self._cap:
            return False  # rejected, caller must retry
        self._q.append(order)
        return True
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using list.pop(0) for dequeue—O(n) per operation, painfully slow on large queues.
  • Returning None on empty dequeue instead of raising—silently masks "queue underflow" bugs.
  • Forgetting __len__if queue: then doesn't truthiness-check correctly.
  • Using queue.Queue for single-threaded code—thread-safety overhead with no benefit.
  • Setting maxlen when the prompt requires explicit rejection—silent eviction loses orders.

Practice list and array problems →

Python interview question on lists and queue simulation

Implement an OrderQueue class with enqueue(order), dequeue(), peek(), and __len__() methods. All operations must be O(1). Empty dequeue and peek should raise IndexError.

Solution using collections.deque

from collections import deque

class OrderQueue:
    def __init__(self):
        self._q: deque = deque()

    def enqueue(self, order):
        self._q.append(order)

    def dequeue(self):
        if not self._q:
            raise IndexError("dequeue from empty queue")
        return self._q.popleft()

    def peek(self):
        if not self._q:
            raise IndexError("peek on empty queue")
        return self._q[0]

    def __len__(self):
        return len(self._q)
Enter fullscreen mode Exit fullscreen mode

Why this works: collections.deque provides O(1) append (enqueue at back) and O(1) popleft (dequeue from front)—exactly the FIFO contract we need, with no O(n) shifts. dq[0] for peek is also O(1) because the deque exposes its leftmost block directly. The IndexError on empty matches Python's convention for sequence operations and signals "queue underflow" loudly. __len__ lets if queue: work correctly, and exposes size to callers without leaking the internal deque.

PYTHON
Topic — array
Array problems

Practice →

PYTHON
Topic — simulation
Simulation problems

Practice →


2. Tuples and Indexing for Structured Records in Python

Tuples and indexing for structured records in Python for data engineering

A tuple is Python's lightweight record: immutable, fixed-length, smallest memory, hashable (so it can be a dict key). The downside is integer-indexed access (order[0] is opaque); the upgrade is collections.namedtuple, which keeps tuple performance and adds attribute access. For "build once, query many" workloads, layer a dict index on top so lookups are O(1).

Pro tip: State the uniqueness of the primary key out loud. {o.order_id: o for o in orders} silently drops duplicates—use defaultdict(list) if the key isn't unique.

Tuples vs lists vs dataclasses

Four contenders for representing a record, with a clear cost ladder.

  • tuple — immutable, smallest, hashable. Fastest construction; integer index access.
  • list — mutable, not hashable, slightly larger. Same indexing as tuple.
  • namedtuple — tuple performance + attribute access (order.id). Best balance for DE records.
  • @dataclass — class with named fields, auto __init__ / __repr__ / __eq__; mutable by default. Slowest construction; clearest for ergonomic code.

Decision rule: tuple for hot loops, namedtuple for slightly readable hot loops, dataclass for ergonomic code with methods.

representation construction access size
tuple (1, "Alice", 99.99, "paid") order[0] smallest
namedtuple Order(1, "Alice", 99.99, "paid") order.id same as tuple
list [1, "Alice", 99.99, "paid"] order[0] slightly larger
dataclass Order(id=1, customer="Alice", amount=99.99, status="paid") order.id larger (dict-backed)
from collections import namedtuple

Order = namedtuple("Order", ["order_id", "customer", "amount", "status"])
o = Order(1, "Alice", 99.99, "paid")
print(o.order_id, o.customer)  # 1 Alice
Enter fullscreen mode Exit fullscreen mode

Tuple unpacking and named indexing via namedtuple

Tuples support unpacking (a, b, c = tup), which exposes named locals in one line. namedtuple adds attribute access on top, plus _replace(field=value) for copy-with-change and _asdict() for serialization.

  • Unpacking — fastest, cleanest for one-time expansion of fields.
  • namedtuple attribute access — about 2× slower than raw indexing in CPython, negligible for most code.
  • _replace — required for "modify" since tuples are immutable.
order = (1, "Alice", 99.99, "paid")
order_id, customer, amount, status = order   # unpacking

from collections import namedtuple
Order = namedtuple("Order", ["order_id", "customer", "amount", "status"])
no = Order(*order)                            # construct from tuple
print(no.customer, no.amount)                 # named attribute access
Enter fullscreen mode Exit fullscreen mode

Indexing a list of tuples by primary key

For "build once, query many" workloads, build a dict index in one O(n) pass; subsequent queries are O(1). For 1M records and 100K queries, this is ~1.1M ops vs ~10¹¹ for linear scan—five orders of magnitude. If the key isn't unique, use defaultdict(list) to avoid silent overwrite.

Worked example: index 3 orders by order_id.

input dict index
[(1, "A"), (2, "B"), (3, "C")] {1: (1, "A"), 2: (2, "B"), 3: (3, "C")}

Query idx[2](2, "B") in O(1).

def build_index(orders: list[tuple]) -> dict:
    return {o[0]: o for o in orders}  # assumes unique primary key

def get_order(idx: dict, order_id: int) -> tuple | None:
    return idx.get(order_id)
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using a list and linear scan when a dict index gives O(1) lookup at small construction cost.
  • Building a dict index from a list with non-unique keys—silent overwrite of duplicates.
  • Reaching for dataclass when namedtuple is sufficient and faster.
  • Using integer indexing throughout the body of a long function—unreadable, error-prone on field reorderings.
  • Forgetting that tuples are immutable—o[0] = 99 raises TypeError.

Python interview question on tuples and indexing

Given a list of (order_id, customer, amount, status) tuples, build a structure that, for any order_id, returns the matching tuple in O(1) time. Handle the case where the same order_id appears in multiple records (return all matches as a list).

Solution using defaultdict(list) keyed by order_id

from collections import defaultdict

def build_multi_index(orders: list[tuple]) -> dict[int, list[tuple]]:
    idx: dict[int, list[tuple]] = defaultdict(list)
    for o in orders:
        idx[o[0]].append(o)
    return dict(idx)

def get_orders_by_id(idx: dict[int, list[tuple]], order_id: int) -> list[tuple]:
    return idx.get(order_id, [])
Enter fullscreen mode Exit fullscreen mode

Why this works: defaultdict(list) auto-creates an empty list on first access for each order_id, so idx[o[0]].append(o) works without an explicit "if key not in" guard. Construction is O(n) for the n input tuples. Each query is O(1) average for the dict lookup, plus O(k) to return the k matching records. Returning [] for missing keys (instead of raising KeyError) gives the caller a uniform iteration API. Total: O(n) build, O(1 + k) query.

PYTHON
Topic — indexing
Indexing problems

Practice →

PYTHON
Topic — array
Array problems

Practice →


3. Aggregation and Data Analysis on Tuples in Python

Aggregation on tuples in Python for data engineering

Aggregation problems split into two shapes: flat aggregates (one number across all rows—sum, mean, min, max) and grouped aggregates (one number per group—defaultdict or Counter). Build the flat one-liner first; then nest with defaultdict for the grouped version.

Pro tip: Generator expressions (sum(o[2] for o in orders)) are lazy—O(1) memory regardless of input size. Reach for Counter for frequency aggregates and defaultdict(int|float) for per-key totals.

sum, min, max, statistics.mean over tuple fields

The built-in flat aggregates are sum, min, max, len, plus statistics.mean / median / stdev. All accept any iterable, so a generator expression that pulls one field per tuple keeps memory O(1).

  • sum(gen) — defaults to integer 0; pass a typed start (Decimal(0)) for non-numeric types.
  • statistics.mean(gen) — handles empty input by raising StatisticsError; guard with if orders:.
  • min / max — single-pass, O(n).

Worked example: aggregates over 4 orders.

order_id amount
1 10.0
2 20.0
3 30.0
4 40.0
metric value
sum 100.0
min 10.0
max 40.0
mean 25.0
count 4
import statistics

def order_stats(orders: list[tuple]) -> dict:
    return {
        "total": sum(o[2] for o in orders),
        "min": min(o[2] for o in orders),
        "max": max(o[2] for o in orders),
        "mean": statistics.mean(o[2] for o in orders),
        "count": len(orders),
    }
Enter fullscreen mode Exit fullscreen mode

Group-by-key aggregation with defaultdict(list) then reduce

For per-group aggregates ("total per customer," "count per status"), two patterns: two-pass (group with defaultdict(list), then reduce each group) or single-pass (running aggregate folded into one loop).

  • Two-pass — O(N) extra memory for intermediate lists; cleaner for multi-metric outputs.
  • Single-pass — O(K) memory for K groups; preferred for very large N.

Worked example: per-status totals over 4 orders → {paid: 70, pending: 5}.

order_id status amount
1 paid 10
2 paid 20
3 pending 5
4 paid 40
from collections import defaultdict

def total_per_status(orders: list[tuple]) -> dict[str, float]:
    totals: dict[str, float] = defaultdict(float)
    for _, _, amount, status in orders:
        totals[status] += amount
    return dict(totals)
Enter fullscreen mode Exit fullscreen mode

itertools.groupby for sorted group-by

itertools.groupby only groups consecutive equal keys, so unsorted input produces fragmented groups—sort first, then groupby. Cost: O(N log N) sort + O(N) pass. Slower than defaultdict(list) but emits groups lazily, so it works on memory-constrained or already-sorted streams.

  • Use groupby — input already sorted (e.g. database ORDER BY), memory-constrained, or processing in sorted-key order.
  • Use defaultdict(list) — unsorted input that fits in memory; O(N) total.

Worked example: sorted input groups cleanly.

order_id status
1 paid
2 paid
3 pending
4 shipped

groupby(orders, key=lambda o: o[1])[("paid", [1, 2]), ("pending", [3]), ("shipped", [4])].

from itertools import groupby

def groups_by_status(orders: list[tuple]) -> dict[str, list[tuple]]:
    sorted_orders = sorted(orders, key=lambda o: o[3])
    return {
        status: list(group) for status, group in groupby(sorted_orders, key=lambda o: o[3])
    }
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Materializing intermediate lists (list(generator)) when sum/min/max already consume iterables lazily.
  • Using itertools.groupby on unsorted input—silent fragmented groups.
  • Mixing Decimal and int in sum without an explicit start value—TypeError.
  • Using a regular dict with += 1 and crashing on the first unseen key—use defaultdict(int) or Counter.
  • Computing each metric in a separate pass (4 passes for total + min + max + mean) when one pass with running aggregates is enough.

Python interview question on aggregation over tuples

Given a list of (order_id, customer, amount, status) tuples, return a dictionary with: total_amount, mean_amount, count, count_by_status (per-status count), and total_by_customer (per-customer total).

Solution using a single pass with defaultdict and Counter

from collections import Counter, defaultdict
import statistics

def analyze_batch(orders: list[tuple]) -> dict:
    if not orders:
        return {"total_amount": 0, "mean_amount": 0, "count": 0, "count_by_status": {}, "total_by_customer": {}}
    amounts = [o[2] for o in orders]
    count_by_status: Counter = Counter(o[3] for o in orders)
    total_by_customer: dict[str, float] = defaultdict(float)
    for _, customer, amount, _ in orders:
        total_by_customer[customer] += amount
    return {
        "total_amount": sum(amounts),
        "mean_amount": statistics.mean(amounts),
        "count": len(orders),
        "count_by_status": dict(count_by_status),
        "total_by_customer": dict(total_by_customer),
    }
Enter fullscreen mode Exit fullscreen mode

Why this works: Building amounts once and reusing it for sum, mean, and len avoids three separate passes over orders. Counter is the canonical one-pass frequency aggregate. defaultdict(float) builds per-customer totals in a single pass without explicit key initialization. The empty-input guard returns a structurally consistent result so downstream code doesn't need to special-case empty batches. Total cost: O(N) time, O(N + K) memory where K is unique statuses + customers.

PYTHON
Topic — aggregation
Aggregation problems

Practice →

PYTHON
Topic — data analysis
Data analysis problems

Practice →


4. Hash Tables for Set Operations: Intersection and Difference in Python

Set operations for inventory reconciliation in Python for data engineering

"What's the difference between two collections" is set algebra: convert each list to a set, then use & (intersection), - (difference), | (union), ^ (symmetric difference). Total cost is O(N + M)—the conversion pays for itself the moment you avoid one nested loop. For 100K items, this beats nested-loop O(N*M) by five orders of magnitude.

Two-circle Venn diagram showing set intersection, left difference, and right difference for inventory reconciliation.

Pro tip: Whenever you reach for if item in list: inside a loop over another list, stop—convert to set instead. Membership testing on set is O(1); on list it's O(N).

Drill hash-table problems →

set constructor + &, -, |

set(iterable) deduplicates into a hash table. Elements must be hashable (tuples of immutables work; lists and dicts don't). Operation costs:

  • a & b — intersection, O(min(|a|, |b|)).
  • a | b — union, O(|a| + |b|).
  • a - b — difference, O(|a|).
  • a ^ b — symmetric difference, O(|a| + |b|).
  • x in a — membership, O(1) average.

Worked example: two SKU lists.

set contents
warehouse {SKU-101, SKU-102, SKU-200}
store {SKU-200, SKU-301}
operation result
warehouse & store {SKU-200} (in both)
warehouse - store {SKU-101, SKU-102} (only in warehouse)
store - warehouse {SKU-301} (only in store)
warehouse ^ store {SKU-101, SKU-102, SKU-301} (mismatched)
def reconcile(warehouse: list[str], store: list[str]) -> dict:
    w, s = set(warehouse), set(store)
    return {
        "in_both": w & s,
        "only_warehouse": w - s,
        "only_store": s - w,
        "mismatched": w ^ s,
    }
Enter fullscreen mode Exit fullscreen mode

Symmetric difference for "what changed both ways"

a ^ b returns items in exactly one of the two sets—the union of the two non-overlapping Venn regions. Use cases: "what changed between yesterday and today," "which test cases differ between runs."

a ^ b == (a | b) - (a & b) == (a - b) | (b - a)—three equivalent expressions; the operator form is shortest.

Worked example: warehouse ^ store = {SKU-101, SKU-102, SKU-301} — three SKUs that don't match between inventories.

def mismatches(warehouse: list[str], store: list[str]) -> set:
    return set(warehouse) ^ set(store)
Enter fullscreen mode Exit fullscreen mode

dict comparison via keys() & keys() for keyed-store reconciliation

When inventories are keyed stores (SKU → quantity), the keys tell you what's in both; the values tell you whether they match. dict.keys() is set-like—use &, -, |, ^ directly without converting to set.

Worked example: same SKUs but different quantities.

warehouse = {"SKU-101": 10, "SKU-200": 5}
store = {"SKU-200": 3, "SKU-301": 7}
in_both_skus = warehouse.keys() & store.keys()    # {"SKU-200"}
# qty mismatch: warehouse=5, store=3
Enter fullscreen mode Exit fullscreen mode
def reconcile_keyed(
    warehouse: dict[str, int], store: dict[str, int]
) -> dict:
    in_both = warehouse.keys() & store.keys()
    return {
        "only_warehouse": warehouse.keys() - store.keys(),
        "only_store": store.keys() - warehouse.keys(),
        "qty_mismatch": {
            sku: (warehouse[sku], store[sku])
            for sku in in_both
            if warehouse[sku] != store[sku]
        },
    }
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using if x in list: inside a nested loop (O(N*M)) instead of set(list) for O(1) membership checks.
  • Forgetting that set elements must be hashable—lists and dicts can't be set elements.
  • Using == to compare two sets of tuples and getting a non-deterministic result if the tuples are unordered—they're not; sets compare by membership.
  • Using set(dict.keys()) when dict.keys() is already set-like.
  • Reaching for nested for loops when set algebra expresses the same thing in one line.

Python interview question on set operations

Given two lists of SKU codes (warehouse_inventory, store_inventory), return three sets: items only in warehouse, items only in store, and items in both. Use O(N + M) time.

Solution using set algebra

def reconcile_inventory(
    warehouse_inventory: list[str], store_inventory: list[str]
) -> dict[str, set[str]]:
    w, s = set(warehouse_inventory), set(store_inventory)
    return {
        "only_warehouse": w - s,
        "only_store": s - w,
        "in_both": w & s,
    }
Enter fullscreen mode Exit fullscreen mode

Why this works: Converting each list to a set is O(N) and O(M)—deduplication plus hash-table construction. The three set operations (-, -, &) are each O(min size) on average. Total cost: O(N + M), much faster than the naive O(N * M) nested-loop approach. The result is structurally clean: three named sets, no quantity ambiguity, easy to consume downstream.

PYTHON
Topic — hash table
Hash table problems

Practice →

PYTHON
Topic — set operations
Set operations problems

Practice →


5. Conditionals and Comparison Operators in Python

Conditionals and comparison operators in Python for data engineering

Conditionals are graded on cleanliness, exhaustiveness, and idioms—not on whether you know if exists. Two Python-specific idioms separate fluent code from naive code: chained comparisons (100 < amount <= 1000) for range checks, and short-circuiting and/or for guards and defaults.

Pro tip: if 100 < amount <= 1000: is faster and clearer than if amount > 100 and amount <= 1000:—the middle term is evaluated once. State this in interviews; it shows Python fluency.

Comparison operators and chained comparisons

The six operators (<, <=, >, >=, ==, !=) return bool and chain with and/or/not. Python's unique feature: chained comparisons evaluate the middle term once.

  • Chained comparisona < b < c is (a < b) and (b < c) with b computed once.
  • Type strictness1 < "1" raises TypeError in Python 3 (no implicit numeric/string coercion).
  • == vs is== is value equality; is is identity. Use is for None / True / False, == for everything else.

Worked example: range checks.

comparison meaning
0 < x <= 100 x is in (0, 100]
0 <= x < 100 x is in [0, 100)
a == b == c all three equal
a < b > c b greater than both (allowed but rare)
def in_range(x: float, low: float, high: float) -> bool:
    return low <= x < high  # half-open range
Enter fullscreen mode Exit fullscreen mode

if/elif/else decision trees vs dispatch tables

if/elif/else is the right shape for range checks and complex conditions. A dispatch table (dict from key to value/callable) is the right shape for equality-only routing: O(1) lookup, code reads as data.

  • if/elif — order matters when conditions overlap; the first match wins.
  • Dispatch tableactions = {"a": handle_a, "b": handle_b}; actions[key]().
  • Decision ruleif/elif for ranges/compound, dispatch for ≥5 equality-only branches.

Worked example: decision tree for order priority.

amount priority
≤ 10 low
10 < a ≤ 100 medium
> 100 high
def priority(amount: float) -> str:
    if amount <= 10:
        return "low"
    elif amount <= 100:
        return "medium"
    else:
        return "high"
Enter fullscreen mode Exit fullscreen mode

Boolean short-circuiting (and / or) for guarded checks

a and b evaluates b only if a is truthy; a or b evaluates b only if a is falsy. Two uses: skip expensive checks (if items and items[0] > 0) and provide defaults (value = arg or "default").

  • Truthiness gotcha0, "", [], {}, None are all falsy; 0 or "default" returns "default".
  • Explicit None checkarg if arg is not None else "default" when zero/empty is a valid input.
  • Return valueand returns the first falsy operand (or last); or returns the first truthy operand (or last).

Worked example: short-circuit behavior.

expression evaluates b? result
True and b yes b
False and b no False
True or b no True
False or b yes b
0 or "default" yes "default"
"" or "default" yes "default"
5 or "default" no 5
def safe_first_amount(orders: list[tuple]) -> float:
    return orders[0][2] if orders else 0.0
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Comparing with is None/is not None for objects (works for None, breaks for other objects with __eq__).
  • Reversing if/elif order so an earlier branch swallows what should hit a later one.
  • Building a 50-line if/elif chain when a dispatch dict would be 5 lines.
  • Forgetting that 0, "", [], {}, None are all falsy—if x: rejects all of them.
  • Using == to compare to None—works most of the time but breaks for custom __eq__.

Python interview question on conditionals

Given an order with amount (float) and customer_tier (one of "gold", "silver", "bronze"), return its priority: "high" for amount > 1000 OR gold tier; "medium" for amount > 100 OR silver tier; "low" otherwise.

Solution using chained comparisons and short-circuiting

def classify_priority(amount: float, tier: str) -> str:
    if amount > 1000 or tier == "gold":
        return "high"
    if amount > 100 or tier == "silver":
        return "medium"
    return "low"
Enter fullscreen mode Exit fullscreen mode

Why this works: Each branch combines an amount check and a tier check with or—if either is true, the priority assigns. The branches are evaluated top-down; the first match wins, so a gold-tier customer with a low amount still gets "high" priority. The implicit else at the bottom catches everything that didn't match the upper two branches. No elif needed because each branch ends with return. Total cost: O(1) per call, two comparisons in the worst case.

PYTHON
Topic — conditionals
Conditional problems

Practice →

PYTHON
Topic — comparison
Comparison problems

Practice →


6. Conditionals with Membership Testing for Routing in Python

Membership testing and routing in Python for data engineering

Routing is "given input X, decide where to send it." Two performance levers: in on set or dict is O(1) while in on list is O(N), and dispatch tables beat nested if/elif for equality-only routing (O(1) lookup, code reads as data). Convert routing tables to sets the moment you exceed ~10 entries.

Pro tip: State the choice in interviews: "I'll use a dispatch dict because adding a new region is one line and the lookup stays O(1) regardless of table size."

Membership testing with in on list, set, dict

The in operator's cost depends on container type. Pick by data size and access pattern.

  • x in list — linear scan, O(N).
  • x in set — hash lookup, O(1) average; element must be hashable.
  • x in dict — checks keys, not values; O(1) average.
  • x in str — substring check, O(N*M) worst case.
  • NaN gotchafloat('nan') == float('nan') is False; if x in [NaN, ...] silently fails. Use math.isnan(x).

Worked example: membership cost.

container cost of x in container
list (10 items) O(10) — fast but linear
set (10 items) O(1) average
list (10K items) O(10K) — slow in a hot loop
set (10K items) O(1) average
WEST_STATES = {"CA", "OR", "WA", "NV", "AZ"}  # set, not list

def is_west(state: str) -> bool:
    return state in WEST_STATES  # O(1)
Enter fullscreen mode Exit fullscreen mode

Nested conditionals vs flat dispatch tables

For 5+ equality-only branches, replace if/elif chains with a dict keyed by the input value. The dict can map to constants or handler callables.

  • if/elif chain — O(N) in branch count; required for range/compound conditions.
  • Dispatch table — O(1) lookup; one-line additions; equality-only.
  • Decision rule — dispatch for ≥5 equality routes, conditionals for ≤4 routes or compound conditions.

Worked example: region → warehouse mapping.

region handler
"west" send_west_warehouse
"east" send_east_warehouse
"central" send_central_warehouse
"unknown" None → raise
# Dispatch table form
ROUTES = {
    "west": send_west_warehouse,
    "east": send_east_warehouse,
    "central": send_central_warehouse,
}

def route_dispatch(region: str):
    handler = ROUTES.get(region)
    if handler is None:
        raise ValueError(f"unknown region: {region}")
    return handler()
Enter fullscreen mode Exit fullscreen mode

Guard clauses for early returns

A guard clause is an early return or raise that exits on edge cases before the main logic runs—flattening nested conditionals so the happy path lives at the top level of indentation. The invariant: guards exit early on invalid state; the body assumes valid state.

  • Nested form — happy path is buried 3–4 indents deep.
  • Guarded form — each guard handles one edge case; happy path is one line at the bottom.

Worked example: compare nesting depth.

version max nesting happy path indent
nested 4 4
guards 1 1
# With guard clauses — flat
def process(order):
    if order is None:
        return None
    if order.amount <= 0:
        return reject_zero_amount(order)
    if order.status != "paid":
        return queue_for_payment(order)
    return ship(order)
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using in list for routing tables with 100+ entries—linear scan, slow.
  • Building a 50-line if/elif chain for equality routing instead of a dispatch dict.
  • Deeply nested conditionals when guards would flatten the code.
  • Forgetting dict.get(k) returns None on missing key—must explicitly handle.
  • Routing tables that share state across calls without making them module-level constants—per-call rebuild is O(N) overhead.

Python interview question on conditional routing

You have orders with a region field (one of "west", "east", "central", "international"). Build a router that, given an order, returns the warehouse code: WH-W for west, WH-E for east, WH-C for central, WH-INT for international. Unknown regions raise ValueError. Use O(1) routing.

Solution using a dispatch table

ROUTES = {
    "west": "WH-W",
    "east": "WH-E",
    "central": "WH-C",
    "international": "WH-INT",
}

def route_order(region: str) -> str:
    warehouse = ROUTES.get(region)
    if warehouse is None:
        raise ValueError(f"unknown region: {region!r}")
    return warehouse
Enter fullscreen mode Exit fullscreen mode

Why this works: The dispatch dict is built once at module load; subsequent lookups are O(1) average. dict.get(region) returns None on missing key (no exception), so we can produce a custom error message including the offending value via the !r format spec. Adding a new region is one line in the dict—no code-flow changes. The function's body is 3 lines, regardless of how many regions exist.

PYTHON
Topic — conditional logic
Conditional logic problems

Practice →

PYTHON
Topic — conditionals
Conditional problems

Practice →


7. Functions, Returns, and Tuple-Based Outputs in Python

Functions and tuple-based return values in Python for data engineering

Python functions are first-class: return a, b, c implicitly constructs a tuple, and callers can unpack at the call site (a, b, c = fn()). For ad-hoc returns, raw tuples are fine; for values that flow through many functions, upgrade to namedtuple or @dataclass so fields are self-documenting.

Pro tip: Comprehensions ([fn(x) for x in iterable]) are usually faster and more Pythonic than map/filter in CPython. Reserve functools.reduce for genuinely custom aggregations—sum/max/min are clearer.

Multiple return values via tuple unpacking

return a, b, c returns the tuple (a, b, c) (parentheses optional). Callers unpack with a, b, c = fn() or use first, *rest = fn() for partial unpacking.

  • Tuple constructed implicitly — no special multi-return syntax.
  • Unpack count must match — wrong count raises ValueError.
  • Partial unpackingfirst, *rest = fn() collects the tail into a list.
def order_summary(orders: list[tuple]) -> tuple[float, float, int]:
    if not orders:
        return 0.0, 0.0, 0
    total = sum(o[2] for o in orders)
    return total, total / len(orders), len(orders)

total, avg, count = order_summary(my_orders)
Enter fullscreen mode Exit fullscreen mode

Named outputs via namedtuple or dataclass

For values that flow through multiple functions (passed around, serialized, logged), use a named record so call sites read cleanly.

  • namedtuple — tuple-shaped + attribute-named; backward-compatible with unpacking (total, avg, count = result) AND supports result.total.
  • @dataclass — class-shaped + auto-generated __init__ / __repr__ / __eq__; mutable by default; supports defaults.
  • Decision rulenamedtuple for most DE records, @dataclass when the value has methods or default values.

Worked example: Summary(100, 25, 4)result.total is 100, result[0] is also 100.

from collections import namedtuple

Summary = namedtuple("Summary", ["total", "average", "count"])

def order_summary(orders: list[tuple]) -> Summary:
    if not orders:
        return Summary(0.0, 0.0, 0)
    total = sum(o[2] for o in orders)
    return Summary(total, total / len(orders), len(orders))
Enter fullscreen mode Exit fullscreen mode

Higher-order functions: map, filter, functools.reduce

Three lazy iterators that take callables: map(fn, it), filter(pred, it), functools.reduce(fn, it, init). Modern Python prefers comprehensions for map/filter; reduce has no comprehension equivalent.

  • map / filter — usually replaced by comprehensions/generators, which are faster in CPython.
  • reduce — fold-style aggregation; reserved for custom folds (sum/max/min are clearer for the common ones).
  • Comprehensions[fn(x) for x in it] and [x for x in it if pred(x)] are the idiomatic forms.

Worked example: comparing approaches.

approach code result
map list(map(lambda x: x*2, [1,2,3])) [2, 4, 6]
comprehension [x*2 for x in [1,2,3]] [2, 4, 6]
filter list(filter(lambda x: x>1, [1,2,3])) [2, 3]
comprehension [x for x in [1,2,3] if x>1] [2, 3]
reduce reduce(operator.add, [1,2,3], 0) 6
built-in sum([1,2,3]) 6
def double_amounts(orders: list[tuple]) -> list[float]:
    return [o[2] * 2 for o in orders]  # comprehension over `map`

def expensive_orders(orders: list[tuple], threshold: float) -> list[tuple]:
    return [o for o in orders if o[2] > threshold]
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Returning a tuple of mixed types (some int, some float)—the unpacking caller may not realize.
  • Using map/filter when a comprehension would be more Pythonic and slightly faster.
  • Using functools.reduce when sum / max / min already does the job.
  • Forgetting that namedtuple instances are immutable—s.total = 100 raises AttributeError.
  • Returning None from some branches and a tuple from others—callers can't unpack uniformly.

Python interview question on functions and tuple returns

Write a function order_summary(orders) that returns a namedtuple Summary(total, average, count, max_amount) for a list of order tuples. Handle empty input gracefully (return zeros).

Solution using namedtuple and a single pass

from collections import namedtuple

Summary = namedtuple("Summary", ["total", "average", "count", "max_amount"])

def order_summary(orders: list[tuple]) -> Summary:
    if not orders:
        return Summary(0.0, 0.0, 0, 0.0)
    total = 0.0
    max_amount = float("-inf")
    for o in orders:
        amount = o[2]
        total += amount
        if amount > max_amount:
            max_amount = amount
    count = len(orders)
    return Summary(total, total / count, count, max_amount)
Enter fullscreen mode Exit fullscreen mode

Why this works: A single pass computes total and max_amount simultaneously, avoiding two separate sum/max calls (which would each iterate independently). The namedtuple return is unpackable AND attribute-accessible at the call site. The empty-input guard returns a structurally consistent Summary with zeros, so downstream code can result.total/result.average uniformly without a None check. Total: O(N) time, O(1) extra memory.

PYTHON
Topic — higher-order functions
Higher-order function problems

Practice →

PYTHON
Topic — list comprehension
List comprehension problems

Practice →


8. CSV Processing, Error Handling, File I/O, and Log Aggregation in Python

Production patterns for fault-tolerant data loading and log aggregation in Python for data engineering

Production loading uses four primitives together: csv.DictReader for typed row access, try/except per row for fault tolerance, for line in f: for memory-bounded streaming, and defaultdict(int) for streaming aggregation. A candidate who reads with f.read().split("\n") signals no production experience; for line in f: signals the opposite.

Three-step diagram showing dirty CSV rows on the left, a try/except per-row parser arrow in the middle, and valid rows plus quarantined rows output cards on the right.

Pro tip: State the framing out loud: "I'll stream the file in case it exceeds memory; I'll wrap each row in try/except so one bad row doesn't fail the whole load; I'll quarantine bad rows for later inspection."

See more file I/O problems →

csv.reader / csv.DictReader for typed row access

The csv module handles delimiters, quote escaping, and embedded newlines correctly—line.split(",") breaks the moment a field contains a comma. Every value the reader returns is a string; numeric coercion must be explicit and fault-tolerant.

  • csv.reader(f) — yields each row as a list of strings; integer-indexed access.
  • csv.DictReader(f) — yields each row as a dict keyed by header columns; self-documenting access (row["amount"]).
  • String valuesint(row["qty"]) / float(row["amount"]) may raise ValueError; wrap in try/except.

Worked example: reading a dirty CSV.

order_id,qty,amount
101,5,49.99
102,3,29.99
103,abc,9.99
Enter fullscreen mode Exit fullscreen mode
approach code row 3 result
csv.reader for row in csv.reader(f): ["103", "abc", "9.99"] (strings)
csv.DictReader for row in csv.DictReader(f): {"order_id": "103", "qty": "abc", "amount": "9.99"}
import csv

def read_orders(path: str) -> list[dict]:
    with open(path, newline="") as f:
        return list(csv.DictReader(f))
Enter fullscreen mode Exit fullscreen mode

try/except around per-row parsing for fault tolerance

The fault-tolerant loader pattern: wrap each row's coercion in try/except, push valid rows to one list and quarantined rows (with the error reason) to another. Never crash on a single bad row; never silently drop bad rows.

  • Catch specific exceptionsValueError for int/float failures, KeyError for missing columns; never bare Exception.
  • Capture the error message — quarantine (row, str(e)) so debugging has the reason.
  • Return both lists — let the caller decide whether to log, retry, or alert.

Worked example: loading the dirty CSV above.

row result
{order_id: "101", qty: "5", amount: "49.99"} valid → {order_id: 101, qty: 5, amount: 49.99}
{order_id: "102", qty: "3", amount: "29.99"} valid → {order_id: 102, qty: 3, amount: 29.99}
{order_id: "103", qty: "abc", amount: "9.99"} quarantine → (row, "invalid literal for int() ...")
import csv

def load_with_quarantine(path: str) -> tuple[list[dict], list[tuple]]:
    valid: list[dict] = []
    quarantine: list[tuple[dict, str]] = []
    with open(path, newline="") as f:
        for row in csv.DictReader(f):
            try:
                parsed = {
                    "order_id": int(row["order_id"]),
                    "qty": int(row["qty"]),
                    "amount": float(row["amount"]),
                }
                valid.append(parsed)
            except (ValueError, KeyError) as e:
                quarantine.append((row, str(e)))
    return valid, quarantine
Enter fullscreen mode Exit fullscreen mode

Streaming file reads with for line in open(...)

for line in f: reads one line at a time with O(1) memory regardless of file size—the only viable pattern for multi-GB logs. Wrap with with open(...) so the descriptor is always closed.

  • Anti-pattern: f.read() — loads the whole file into a single string; OOM on large files.
  • Anti-pattern: f.readlines() — loads all lines into a list; same OOM risk.
  • Trailing newlinefor line in f: keeps "\n"; strip with line.rstrip() before parsing.

Worked example: streaming a log file with bounded memory.

with open("access.log") as f:
    for line in f:
        process(line.rstrip())  # strip trailing \n
Enter fullscreen mode Exit fullscreen mode

defaultdict(int) for log line aggregation

Log aggregation is the canonical streaming workload: parse each line, increment a per-key counter, finish in O(N) time and O(K) memory for K unique keys. Combine streaming reads, per-line parsing with a malformed-line guard, and defaultdict(int) for auto-zero counters.

  • Streaming read — O(1) memory per line.
  • Malformed-line guardif len(parts) < 9: continue keeps the aggregator running.
  • defaultdict(int) — no if key not in check; first access auto-zeros.

Worked example: aggregating 4 log lines.

line status code counter after
... 200 ... 200 {200: 1}
... 200 ... 200 {200: 2}
... 404 ... 404 {200: 2, 404: 1}
... 500 ... 500 {200: 2, 404: 1, 500: 1}
from collections import defaultdict

def aggregate_status_codes(log_path: str) -> dict[str, int]:
    counts: dict[str, int] = defaultdict(int)
    with open(log_path) as f:
        for line in f:
            parts = line.rstrip().split()
            if len(parts) < 9:
                continue  # malformed line; skip
            status_code = parts[8]
            counts[status_code] += 1
    return dict(counts)
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Using f.read().split("\n") to load a large file—OOM on multi-GB inputs.
  • Catching Exception instead of specific types (ValueError, KeyError)—hides real bugs.
  • Forgetting with open(...) and leaking file descriptors.
  • Forgetting to strip the trailing newline before parsing—line == "expected" fails silently.
  • Using csv.reader for keyed access then mapping integer indices to column names manually—use DictReader instead.

Python interview question on fault-tolerant CSV loading

Implement load_orders(csv_path) that streams a CSV with columns order_id,qty,amount,status and returns (valid_orders, quarantined_rows). Each valid row is a dict with parsed types (int, int, float, str); each quarantined row is (original_dict, error_message). Use O(1) memory regardless of file size.

Solution using csv.DictReader and per-row try/except

import csv

def load_orders(csv_path: str) -> tuple[list[dict], list[tuple[dict, str]]]:
    valid: list[dict] = []
    quarantine: list[tuple[dict, str]] = []
    with open(csv_path, newline="") as f:
        reader = csv.DictReader(f)
        for raw in reader:
            try:
                parsed = {
                    "order_id": int(raw["order_id"]),
                    "qty": int(raw["qty"]),
                    "amount": float(raw["amount"]),
                    "status": raw["status"].strip(),
                }
                valid.append(parsed)
            except (ValueError, KeyError) as e:
                quarantine.append((dict(raw), str(e)))
    return valid, quarantine
Enter fullscreen mode Exit fullscreen mode

Why this works: csv.DictReader streams rows one at a time—O(1) memory per row. Each row is wrapped in try/except catching only the specific exceptions we expect (ValueError for int/float parse failures, KeyError for missing columns). Bad rows go to quarantine with their error reason; the loader never crashes. The with open(...) context manager handles file cleanup even on early returns. Total: O(N) time, O(N) memory for the result lists (unavoidable since we return them), but streaming memory (the in-flight processing) is O(1).

PYTHON
Topic — CSV processing
CSV processing problems

Practice →

PYTHON
Topic — exception handling
Exception handling problems

Practice →


Tips to crack Techpath data engineering interviews

These are habits that move the needle in real Techpath loops—not a re-statement of the topics above.

Python fundamentals preparation

Spend most of your prep on stdlib fluency: collections.deque, collections.defaultdict, collections.Counter, collections.namedtuple, itertools.groupby, csv.DictReader, functools.reduce. Type the patterns; do not just read them. The array, hash table, and conditionals topic pages cover the bulk.

Production-pattern preparation

Drill the four production primitives: streaming file reads (for line in f:), fault-tolerant parsing (try/except per row), CSV with type coercion (csv.DictReader + int(...) / float(...)), and log aggregation (defaultdict(int) over streamed lines). The CSV processing, exception handling, and file I/O topic pages have problems matching these patterns.

Order/inventory framing

Techpath's prompts dress general Python primitives in operational data: order queues, inventory reconciliation, log aggregation. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is FIFO simulation, use deque"; "this is set difference, use set algebra"; "this is conditional routing, use a dispatch dict"; "this is fault-tolerant CSV loading, use try/except per row." Mapping framings to families is the meta-skill.

Where to practice on PipeCode

Communication under time pressure

State assumptions before typing: "I'll assume the CSV has a header row"; "I'll assume order_ids are unique"; "I'll assume the file may exceed memory, so I'll stream." State invariants after key code blocks. State complexity: "this is O(N) for the streaming pass, O(K) memory for the aggregate dict." Interviewers grade clear reasoning above silent-and-perfect.


Frequently Asked Questions

What is the Techpath data engineering interview process like?

The Techpath data engineering interview typically includes a phone screen (Python warm-up around lists, tuples, or hash tables), one or two coding rounds focused on Python fundamentals and production patterns (CSV loaders, log aggregators), a system-design conversation around pipelines and data workflows, and behavioral interviews. The curated 9-problem Techpath practice set on PipeCode mirrors what you will see on the technical rounds.

Does Techpath test SQL in their data engineering interviews?

The curated Techpath practice set is 100% Python—no SQL problems among the nine. Other Techpath interviewers may bring SQL in ad-hoc rounds, but the published company set is fundamentals-and-production-pattern Python. Prepare for SQL separately if your role calls for it; the curated set will not drill it.

How important is Python for a Techpath data engineering interview?

Python is essentially the entire technical interview at Techpath—Python fundamentals, stdlib fluency, and idiomatic patterns. Memorize: collections.deque, defaultdict, Counter, namedtuple, itertools.groupby, csv.DictReader, try/except, for line in f:. Stdlib fluency separates a clean answer from a 30-line manual loop.

How hard are Techpath data engineering interview questions?

Techpath's curated set has 2 easy + 7 medium + 0 hard = no Hard tier. This is the most fundamentals-friendly company hub covered in PipeCode's company guides. If you're early in your DE prep journey, this is the right hub to start with; if you're a senior candidate prepping for FAANG, it's a quick refresher rather than a stretch.

What Python topics should I prioritize for Techpath?

In rough order: (1) lists vs deque for queue simulation, (2) tuples and indexing for structured records, (3) defaultdict(int) and Counter for aggregation, (4) set operations (&, -, |, ^) for reconciliation, (5) conditionals + dispatch tables for routing, (6) try/except for fault-tolerant parsing, (7) streaming file reads with for line in f:, (8) csv.DictReader + per-row error handling. The array, hash table, and CSV processing topic pages cover the spread.

How many Techpath practice problems should I solve before the interview?

Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Techpath-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.


Start practicing Techpath data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Techpath problems with tests, AI feedback, and a coding environment so you can drill the exact Python fundamentals and production patterns Techpath asks—without the noise of generic SQL prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Techpath practice →
View all practice →

Top comments (0)