DEV Community

Cover image for Python for Data Engineering: A Complete Beginner's Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

Python for Data Engineering: A Complete Beginner's Guide

python for data engineering is the single highest-impact skill stack a junior data engineer can learn: Python data structures (lists, dicts, sets, tuples) to model rows and aggregates, file I/O and CSV / JSON parsing to ingest source data, list comprehensions and generators for memory-efficient transforms, pandas DataFrames for the bulk of ETL work, API and database clients (requests, SQLAlchemy, psycopg2) to move data across systems, and error handling (try / except, retries, logging) to make pipelines production-safe. Together these six areas form the python data engineering foundation that every data engineering interview questions loop circles back to once the SQL conversation winds down.

This guide walks through every concept in the learn python for data engineering curriculum that reviewers love to test in data engineering interview questions: the Python basics any DE engineer should know cold, the data structures that map naturally to rows and aggregates, file io in python patterns for CSV / JSON / Parquet, list comprehension and generator expression patterns that beat naïve loops, pandas for data engineering workflows (read → transform → write), python sqlalchemy and psycopg2 for database integration, python rest api patterns with requests, and the seven gotchas (UnicodeDecodeError, mutable default args, dict.get defaults, generator exhaustion) that fail most candidates. Every section ends as python interview questions with answers: a runnable Python snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape python for data engineers rounds reward when Python comes up.

PipeCode blog header for a Python data engineering tutorial — bold white headline 'Python · Data Engineering' with subtitle 'a complete beginner's guide' and a minimal Python code snippet on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse Python practice library →, drill data-manipulation problems →, sharpen list comprehension drills →, rehearse CSV processing patterns →, reinforce ETL Python drills →, or widen coverage on the full Python practice library →.


On this page


1. Why Python is the default language of data engineering

Six Python skills cover 90% of junior DE work — that's the whole map

The one-sentence invariant: python for data engineering is dominant because Python's data ecosystem (pandas, pyarrow, sqlalchemy, requests, pyspark, airflow, dbt) covers every step of the ingest-transform-load pipeline, while the language stays approachable for SQL-first engineers learning to script. Once you know the six Python skills below, every data engineering interview questions prompt that asks "show me a Python script that …" becomes mechanical.

The six Python skill areas a DE engineer must know cold.

  • Core data structures — lists, dicts, sets, tuples; the building blocks of every row, aggregate, lookup table.
  • File I/O — CSV, JSON, Parquet; the universal data-on-disk formats.
  • List comprehensions + generators — memory-efficient transforms; the Python-idiomatic alternative to for loops.
  • pandas DataFrames — the bulk of every Python ETL script lives here.
  • Database + API clientsSQLAlchemy, psycopg2, requests; moving data across system boundaries.
  • Error handling + retriestry / except, exponential backoff, logging; the production-safety toolkit.

Why Python won the DE language wars.

  • pandas made ad-hoc analytics 10× faster than SQL-only workflows for many shapes; the language's data ecosystem is unmatched.
  • pyspark brought distributed compute to the same syntax — engineers who know pandas can pick up Spark DataFrames in days.
  • airflow made Python the orchestration default — DAGs are Python files; sensors and operators are Python classes.
  • dbt reuses Python for testing and macros even though SQL is the surface API.
  • The "import → process → export" loop is shorter in Python than in Scala / Java / Go for typical DE work.

What interviewers listen for.

  • Do you reach for dict instead of nested if chains for lookup logic? — basic-but-tested Python fluency.
  • Do you prefer list comprehensions over append-in-a-loop when transforming a list? — Pythonic style signal.
  • Do you mention generators (yield, iter) when the dataset is too large to fit in memory? — senior signal.
  • Do you wrap external calls in try / except with retries? — production-safety signal.

Worked example — a 12-line Python script that touches all six skills

Detailed explanation. Realistic DE scripts blend multiple skills. The snippet below reads a CSV, parses each row with a dict comprehension, filters by predicate, aggregates with defaultdict, handles errors at the row level, and writes the result to JSON — touching every one of the six skill areas in 12 functional lines.

Question. From orders.csv (columns: order_id, customer_id, region, amount, status), compute total paid revenue per region and write the result to revenue_by_region.json. Skip rows that fail to parse without aborting.

Code.

import csv
import json
from collections import defaultdict

revenue = defaultdict(float)

with open("orders.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            if row["status"] == "paid":
                revenue[row["region"]] += float(row["amount"])
        except (ValueError, KeyError) as e:
            print(f"Skipping bad row: {row} -> {e}")

with open("revenue_by_region.json", "w", encoding="utf-8") as f:
    json.dump(dict(revenue), f, indent=2)

print(f"Wrote {len(revenue)} regions to revenue_by_region.json")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. csv.DictReader parses each line into a dict keyed by the header row.
  2. defaultdict(float) lazily initialises any new region key to 0.0 on first access.
  3. The try / except block catches malformed amounts or missing columns per row, logs the bad row, and continues.
  4. revenue[row["region"]] += float(row["amount"]) is the aggregation — one line per group.
  5. json.dump(dict(revenue), f, indent=2) writes the result; dict(...) converts defaultdict to a plain dict for clean JSON.

Sample output (revenue_by_region.json).

{
  "US": 12500.0,
  "EU": 8750.0,
  "APAC": 5400.0
}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every real DE script combines I/O + parsing + dict-based aggregation + error handling; recognising this shape is the senior signal interviewers chase.

Python
Topic — data-manipulation
Python DE drills

Practice →

Python
Topic — etl
ETL Python practice

Practice →


2. Python data structures — lists, dicts, sets, tuples for DE work

Diagram of the four core Python data structures — four side-by-side cards (list, dict, set, tuple) each with a small literal example and a one-line caption about typical DE usage, on a light PipeCode card.

python data structures — four containers that model every row, aggregate, and lookup

Before any ETL pattern, the mental model: every row is a dict, every batch is a list of dicts, every unique-key set is a set, and every immutable fixed-shape record is a tuple. Get fluent with the four core containers and the rest of python for data engineering is mechanical.

list — ordered, mutable, duplicates allowed.

  • Literal[1, 2, 3]; indexinglst[0], lst[-1]; slicinglst[1:3].
  • Typical DE use — a batch of rows, a column of values, a queue of files to process.
  • appendlst.append(x) is O(1) amortised.
  • extendlst.extend(other) adds every element; lst + other returns a new list.
  • Iterationfor item in lst:.

dict — key→value mapping; the workhorse of DE Python.

  • Literal{"region": "US", "amount": 100}; lookupd["region"]; safe lookupd.get("region", "unknown").
  • Typical DE use — one row of data, a lookup table (country_code → region), an aggregator (region → total).
  • d.setdefault(key, default) — set if missing; return existing value otherwise.
  • d.update(other) — merge another dict in place; {**a, **b} returns a new dict.
  • Iterationfor k, v in d.items():.

set — unordered, no duplicates.

  • Literal{1, 2, 3}; emptyset() (NOT {}, which is an empty dict).
  • Typical DE use — unique values (e.g. distinct customer IDs), set algebra for diffs (new_ids - old_ids).
  • Operationsunion (|), intersection (&), difference (-), symmetric_difference (^).
  • in membershipO(1) average; way faster than in list for large collections.

tuple — ordered, immutable, fixed-shape records.

  • Literal(1, "US", 100) or just 1, "US", 100.
  • Typical DE use — composite dict keys ({(region, year): total}), immutable rows, function returns of multiple values.
  • Unpackingregion, year, total = tup works in for loops too.
  • namedtupleRow = namedtuple("Row", ["id", "region", "amount"]) adds field-access syntax.

Choice rules of thumb.

  • Need to look something up by key?dict.
  • Need uniqueness or fast membership?set.
  • Need order + duplicates?list.
  • Need an immutable fixed-shape record or a composite key?tuple.
  • Need a row with named fields without the overhead of a class?namedtuple or @dataclass.

collections module — the senior toolkit.

  • defaultdict(int) — auto-initialises missing keys; perfect for counters and aggregators.
  • CounterCounter(words) returns a frequency map in one call.
  • OrderedDict — maintains insertion order (less needed since Python 3.7+ where dict does too).
  • namedtuple — lightweight record type with named fields.
  • deque — double-ended queue; O(1) append / popleft.

The dict vs list of tuples decision.

  • Use dict when you do lookups by key more than once.
  • Use list of tuples when you mostly iterate in order and never look up.
  • dict is O(1) lookup; list is O(n) lookup — for large data this is the difference between sub-millisecond and timeout.

Python
Topic — data-structures
Python data structures drills

Practice →

Python
Topic — hash-table
Hash-table Python practice

Practice →


3. File I/O — CSV, JSON, and Parquet ingestion patterns

file io in python — the three on-disk formats every DE script touches

file io in python is the second skill area you'll use in every DE script. The three formats that cover 95% of ingest work: CSV (universal but stringy), JSON (nested-friendly), Parquet (columnar, compressed, fast). Knowing how to read and write each is non-negotiable.

csv.DictReader and csv.DictWriter — the standard-library pattern.

import csv

# Read
with open("orders.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = list(reader)   # each row is a dict keyed by the header

# Write
with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["order_id", "region", "amount"])
    writer.writeheader()
    writer.writerows(rows)
Enter fullscreen mode Exit fullscreen mode
  • newline="" — required on Windows to avoid extra blank lines.
  • encoding="utf-8" — explicit; default varies by platform.
  • csv.DictReader parses each row into a dict; csv.reader returns a list per row.
  • The string trap — every CSV cell is a string; cast int(row["amount"]) or float(...) before arithmetic.

json — parse and dump nested structures.

import json

# Load
with open("config.json", encoding="utf-8") as f:
    config = json.load(f)

# Dump
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, default=str)
Enter fullscreen mode Exit fullscreen mode
  • json.load(f) — read from a file-like object.
  • json.loads(s) — parse a string.
  • json.dump(data, f, indent=2) — pretty-print.
  • default=str — fallback to string for non-serialisable types (datetimes, Decimal).
  • The Decimal / datetime trap — neither serialises by default; either pre-convert or pass a default= callable.

pyarrow.parquet — columnar storage for analytical workloads.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Write
df = pd.DataFrame({"region": ["US", "EU"], "amount": [100, 80]})
table = pa.Table.from_pandas(df)
pq.write_table(table, "orders.parquet", compression="snappy")

# Read
table = pq.read_table("orders.parquet")
df = table.to_pandas()
Enter fullscreen mode Exit fullscreen mode
  • Columnar — Parquet stores data column-by-column, not row-by-row; perfect for analytical scans.
  • Compression — Snappy (fast) / GZIP (smaller) / ZSTD (modern); always compress in production.
  • Schema evolution — Parquet handles added / removed columns gracefully.
  • The DE production format — Spark, Snowflake, BigQuery, Athena all read Parquet natively.

with context managers — always close file handles.

  • with open(...) as f: — automatically closes the file on block exit, even if an exception fires.
  • Avoid bare open() without with — leaked file handles, harder to debug crashes.
  • Multiple fileswith open("a") as fa, open("b") as fb: opens both in one statement.

Pathlib — modern path handling.

from pathlib import Path

data_dir = Path("/data/2026/05")
for csv_file in data_dir.glob("*.csv"):
    # process each
    pass

output = data_dir / "summary.json"
output.write_text(json.dumps(data, indent=2), encoding="utf-8")
Enter fullscreen mode Exit fullscreen mode
  • Path objects replace string-based os.path calls.
  • / operatordata_dir / "summary.json" joins paths cleanly across platforms.
  • .glob("*.csv") — pattern matching files in a directory.
  • .read_text() / .write_text() — file content as a string in one line.

Reading large files in chunks — when memory matters.

  • CSVfor chunk in pd.read_csv("big.csv", chunksize=10000): processes 10k rows at a time.
  • Line-oriented JSON (JSONL) — read line-by-line: for line in f: rec = json.loads(line).
  • Generatorsyield rows instead of returning a list (see §4).

Python
Topic — file-io
File I/O Python drills

Practice →

Python
Topic — csv-processing
CSV processing Python practice

Practice →


4. List comprehensions and generators — memory-efficient transforms

Diagram of list comprehensions vs generators — left card shows a list comprehension producing a full list in memory, right card shows a generator expression producing values lazily one at a time, with annotations about memory footprint and use cases, on a light PipeCode card.

list comprehension python and generators — the Pythonic way to transform data

list comprehension python and generator expressions are the Python-idiomatic way to transform a sequence: shorter, often faster, and definitely more readable than a for loop with append. Knowing both — and when to reach for a generator vs a list — is the cleanest senior-vs-junior signal in any Python interview.

List comprehension — [expr for x in iter if cond].

# Filter + transform
amounts_above_100 = [r["amount"] for r in rows if r["amount"] > 100]

# Dict from a list
region_totals = {r["region"]: r["amount"] for r in rows}

# Set of distinct values
distinct_regions = {r["region"] for r in rows}
Enter fullscreen mode Exit fullscreen mode
  • Three parts — output expression, source iterable, optional filter predicate.
  • Faster than the equivalent for-loop + append by ~2× in CPython.
  • Always materialises the entire list — uses memory proportional to the result size.
  • Use case — when the result list is small enough to fit in memory.

Generator expression — (expr for x in iter if cond).

# Same shape, but lazy
amounts_gen = (r["amount"] for r in rows if r["amount"] > 100)

# Use with sum / any / max / etc.
total = sum(r["amount"] for r in rows if r["status"] == "paid")

# Iterate manually
for amount in amounts_gen:
    process(amount)
Enter fullscreen mode Exit fullscreen mode
  • Round parentheses, not square brackets — that's the only syntactic difference from a list comprehension.
  • Produces values lazily — no list is built; values are yielded one at a time.
  • Memory-constant — perfect for streaming over millions of rows.
  • Single-use — once exhausted, the generator is empty; re-create it to iterate again.

yield — generator functions.

def read_rows(path):
    with open(path, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row   # produce one row at a time

# Use it
for row in read_rows("big_file.csv"):
    process(row)        # only one row in memory at a time
Enter fullscreen mode Exit fullscreen mode
  • yield turns a function into a generator — execution pauses at yield and resumes on the next iteration.
  • Memory footprint — one row at a time, regardless of file size.
  • Composable — generators can chain: chunks = (chunk for chunk in stream()).
  • yield from — delegates to another iterable; useful for flattening nested generators.

enumerate, zip, map, filter — built-in iterator helpers.

  • enumerate(seq, start=0)for i, x in enumerate(rows): … gives row index + value.
  • zip(a, b, c)for x, y, z in zip(list_a, list_b, list_c): … walks multiple iterables in parallel.
  • map(func, seq) — applies func to each element; generator-style.
  • filter(pred, seq) — yields elements where pred(x) is true; generator-style.
  • itertools modulechain, groupby, islice, accumulate; the senior-toolkit of iteration patterns.

Common comprehension patterns for DE.

  • Flatten nested lists[item for sublist in lists for item in sublist].
  • Transposelist(zip(*matrix)) swaps rows and columns.
  • Conditional value[x if x > 0 else 0 for x in values]if/else goes before the for.
  • Multiple filters[x for x in seq if cond1(x) if cond2(x)] — chained ifs.
  • Cross-product[(a, b) for a in A for b in B] — nested fors.

When to reach for a generator over a list comprehension.

  • Data is too large to fit in memory — millions of rows from a database or file.
  • Only iterating once — no need to materialise.
  • Composing pipelines — generator → filter → map → sink.
  • Streaming I/Ofor line in open(...) is already a generator.

Python
Topic — list-comprehension
List comprehension drills

Practice →

Python
Topic — data-manipulation
Iteration + generator patterns

Practice →


5. pandas for data engineering — read, transform, write

Diagram of the canonical pandas ETL flow — three labelled cards (READ from CSV / Parquet / SQL, TRANSFORM with assign / groupby / merge, WRITE to CSV / Parquet / SQL) connected by arrows, on a light PipeCode card.

pandas for data engineering — read, transform, write, repeat

pandas for data engineering is the most-used Python library in DE work. The DataFrame is essentially a table you can manipulate with SQL-like operations in Python: read from a source, transform with chained method calls, write to a sink. Every junior DE script eventually becomes a pandas script.

Read — every common source has a pd.read_* function.

import pandas as pd

df_csv     = pd.read_csv("orders.csv")
df_parquet = pd.read_parquet("orders.parquet")
df_json    = pd.read_json("orders.json")
df_excel   = pd.read_excel("orders.xlsx", sheet_name="Orders")
df_sql     = pd.read_sql("SELECT * FROM orders", conn)
df_html    = pd.read_html("https://example.com/table")[0]
Enter fullscreen mode Exit fullscreen mode
  • pd.read_csv — the most-used; supports compressed files (.gz, .zip) natively, chunked reading via chunksize=, and a dtype= mapping for explicit column types.
  • pd.read_parquet — fast columnar; requires pyarrow or fastparquet installed.
  • pd.read_sql — pulls a query result into a DataFrame; pair with SQLAlchemy or psycopg2.
  • pd.read_html — pulls tables from web pages; surprisingly useful for ad-hoc scraping.

Transform — the SQL-like method chain.

result = (
    df
    .query("status == 'paid' and amount > 0")        # WHERE
    .assign(revenue_eur = lambda d: d["amount"] * 0.92)  # SELECT new col
    .groupby(["region", "currency"])                  # GROUP BY
    .agg(total_revenue=("revenue_eur", "sum"),
         order_count =("order_id",   "count"))       # aggregates
    .reset_index()
    .sort_values("total_revenue", ascending=False)
    .head(10)
)
Enter fullscreen mode Exit fullscreen mode
  • .query("col == val")WHERE shape; supports and / or / not.
  • .assign(new_col = lambda d: d.x + d.y) — adds a column; chain-friendly.
  • .groupby(cols) + .agg(...) — GROUP BY + aggregates; the pandas equivalent of Blog73.
  • .merge(other, on="key", how="left") — JOIN; supports inner / left / right / outer.
  • .rename(columns={"old": "new"}) — rename columns.
  • .drop(columns=[...]) / .dropna() — remove columns or NULL rows.

Write — pd.to_* mirrors pd.read_*.

result.to_csv     ("revenue_by_region.csv",     index=False)
result.to_parquet ("revenue_by_region.parquet", compression="snappy")
result.to_json    ("revenue_by_region.json",    orient="records", indent=2)
result.to_sql     ("revenue_by_region", conn, if_exists="replace", index=False)
Enter fullscreen mode Exit fullscreen mode
  • index=False — almost always; the default True writes the DataFrame's index as a column.
  • to_parquet — preferred production format; columnar, compressed, type-preserving.
  • to_sql(if_exists="replace") — drops and recreates; "append" keeps existing rows; "fail" errors on conflict.

Date / time handling — pd.to_datetime and dt accessor.

df["order_ts"] = pd.to_datetime(df["order_ts"])
df["order_month"] = df["order_ts"].dt.to_period("M")
df["order_dow"]   = df["order_ts"].dt.day_name()
Enter fullscreen mode Exit fullscreen mode
  • pd.to_datetime — parses strings into datetime64; handles ISO and most common formats.
  • .dt accessoryear, month, day, dayofweek, to_period, floor, etc.
  • Time zone handlingdf["ts"].dt.tz_localize("UTC").dt.tz_convert("US/Pacific").

Common pandas gotchas.

  • SettingWithCopyWarningdf[df.x > 0]["y"] = 1 modifies a view, not the original; use df.loc[df.x > 0, "y"] = 1 instead.
  • Implicit string columns — CSV-read columns default to object (string); cast explicitly with astype(int).
  • Missing-value handlingNaN propagates through arithmetic; dropna() or fillna(0) early.
  • Memory blowupgroupby().apply() with a complex function can be slow; prefer .agg() with built-ins.

Python
Topic — data-analysis
pandas Python drills

Practice →

Python
Topic — data-manipulation
DataFrame transform patterns

Practice →


6. Database and API integration — SQLAlchemy, psycopg2, requests

Diagram of the Python integration toolkit — three side-by-side cards (SQLAlchemy ORM and engine, psycopg2 raw PostgreSQL driver, requests for HTTP APIs), each with a short code snippet and a one-line description, on a light PipeCode card.

python sqlalchemy and python requests — talking to databases and APIs

python sqlalchemy, python psycopg2, and python requests are the three libraries that connect Python to the outside world. Every DE pipeline either pulls from a database, pushes to a database, calls an API, or all three.

SQLAlchemy — the standard Python SQL toolkit.

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("postgresql+psycopg2://user:pwd@host:5432/dbname")

df = pd.read_sql("SELECT * FROM orders WHERE status = 'paid'", engine)

df.to_sql("revenue_by_region", engine, if_exists="replace", index=False)

# Lower-level: connection / transaction
with engine.begin() as conn:
    conn.execute("UPDATE orders SET status = 'paid' WHERE id = 42")
Enter fullscreen mode Exit fullscreen mode
  • create_engine(url) — connection URL format: dialect+driver://user:pwd@host:port/db.
  • Works with pandaspd.read_sql(query, engine) and df.to_sql(table, engine).
  • Multiple dialects — PostgreSQL (postgresql+psycopg2), MySQL (mysql+pymysql), SQLite (sqlite:///path.db), Snowflake (snowflake://...), BigQuery, Redshift, etc.
  • Connection poolingSQLAlchemy reuses connections; configure with pool_size=10.

psycopg2 — direct PostgreSQL driver (lower level).

import psycopg2
from psycopg2.extras import RealDictCursor

conn = psycopg2.connect(host="", database="", user="", password="")
with conn.cursor(cursor_factory=RealDictCursor) as cur:
    cur.execute("SELECT * FROM orders WHERE region = %s", ("US",))
    rows = cur.fetchall()   # list of dicts

conn.commit()
conn.close()
Enter fullscreen mode Exit fullscreen mode
  • %s parameter substitution — safe against SQL injection; never concatenate strings.
  • cursor_factory=RealDictCursor — returns dicts instead of tuples.
  • fetchone() / fetchall() / fetchmany(size) — three fetch modes.
  • executemany(sql, list_of_tuples) — bulk insert; faster than N execute calls.
  • COPY FROM STDIN — for huge bulk loads; the fastest Postgres ingest path.

requests — HTTP API client.

import requests

# GET
resp = requests.get(
    "https://api.example.com/v1/orders",
    params={"region": "US", "limit": 100},
    headers={"Authorization": f"Bearer {token}"},
    timeout=10,
)
resp.raise_for_status()
data = resp.json()

# POST
resp = requests.post(
    "https://api.example.com/v1/orders",
    json={"customer_id": 42, "amount": 100},
    headers={"Authorization": f"Bearer {token}"},
    timeout=10,
)
Enter fullscreen mode Exit fullscreen mode
  • params={...} — URL query parameters; properly encoded.
  • json={...} — request body as JSON; sets Content-Type: application/json automatically.
  • headers={...} — common: Authorization, User-Agent, Accept.
  • timeout=10always pass a timeout; no timeout = potential hang forever.
  • resp.raise_for_status() — raises HTTPError for 4xx / 5xx responses.
  • resp.json() — parses the body as JSON.

Session reuse — keep TCP / TLS connections alive.

with requests.Session() as session:
    session.headers.update({"Authorization": f"Bearer {token}"})
    for region in regions:
        resp = session.get(f"https://api.example.com/v1/orders?region={region}")
        process(resp.json())
Enter fullscreen mode Exit fullscreen mode
  • Session() — reuses underlying TCP connection across requests; 10-50× faster than fresh requests in a tight loop.
  • Session-level headers — set once; applied to every request.
  • Cookies persist within a session.

Pagination patterns.

def fetch_all(url, params):
    page = 1
    while True:
        params["page"] = page
        resp = session.get(url, params=params, timeout=10)
        resp.raise_for_status()
        items = resp.json()["items"]
        if not items:
            break
        yield from items
        page += 1
Enter fullscreen mode Exit fullscreen mode
  • Yield-based pagination — generator that streams pages.
  • Cursor-based — many APIs use next_cursor or next_page_token; same shape.
  • Rate limiting — respect Retry-After header on 429 responses.

asyncio + httpx — concurrent API calls.

  • asyncio — Python's async runtime for I/O-concurrent work.
  • httpx.AsyncClient — async drop-in replacement for requests.
  • Use case — fan out 100s of API calls in parallel without spawning threads.

Python
Topic — database
Database Python drills

Practice →

Python
Topic — etl
ETL + API patterns

Practice →


7. Error handling, retries, and pipeline safety patterns

try except python — the production-safety toolkit

Pipelines fail. Files arrive corrupt. APIs return 502. Disks fill. The senior signal in any python for data engineering interview is showing that you handle the failure modes — try / except, retries with exponential backoff, logging, idempotency, dead-letter queues.

try / except / else / finally — the full block.

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    data = response.json()
except requests.Timeout:
    logger.warning(f"Timeout fetching {url}")
    data = None
except requests.HTTPError as e:
    logger.error(f"HTTP {e.response.status_code} on {url}")
    data = None
except Exception as e:
    logger.exception(f"Unexpected error: {e}")
    raise
else:
    logger.info(f"Fetched {len(data)} records from {url}")
finally:
    response.close() if 'response' in dir() else None
Enter fullscreen mode Exit fullscreen mode
  • Specific exceptions firstrequests.Timeout, requests.HTTPError, then Exception.
  • logger.exception — logs the traceback; use for unexpected errors.
  • else — runs when no exception fires; cleaner than putting success code at the end of try.
  • finally — always runs; clean up resources.

Retries with exponential backoff.

import time
import random

def fetch_with_retries(url, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=10)
            resp.raise_for_status()
            return resp.json()
        except (requests.Timeout, requests.HTTPError) as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            logger.warning(f"Attempt {attempt+1} failed: {e}. Retrying in {delay:.1f}s")
            time.sleep(delay)
Enter fullscreen mode Exit fullscreen mode
  • Exponential backoff — wait time doubles each retry: 1s, 2s, 4s, 8s, 16s.
  • Jitterrandom.uniform(0, 1) prevents thundering-herd on shared services.
  • Maximum retries — bound the attempt count; re-raise on last failure.
  • tenacity library — production-grade retry decorator that handles all this declaratively.

Logging — not print.

import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
))
logger.addHandler(handler)

logger.info("Pipeline started")
logger.warning("Retrying API call")
logger.error("Job failed")
logger.exception("Unexpected error")  # includes traceback
Enter fullscreen mode Exit fullscreen mode
  • logging.getLogger(__name__) — module-scoped logger; lets per-module log levels.
  • Five levels — DEBUG / INFO / WARNING / ERROR / CRITICAL.
  • Structured logginglogger.info("event", extra={"region": "US"}); pair with JSON formatters for log aggregation.
  • Never print in production code — print goes to stdout regardless of log level.

Idempotency — re-runnable pipelines.

  • INSERT … ON CONFLICT DO NOTHING (PostgreSQL) — see Blog79 §5.
  • Truncate-and-reload patternsDELETE FROM target; INSERT INTO target SELECT … FROM staging; inside a transaction.
  • Use natural keys + UPSERT — every row gets a stable key; re-runs UPDATE instead of duplicating.
  • Checkpointing — track "last processed" markers per partition.

Dead-letter queues — capture bad rows without aborting.

errors = []
for row in rows:
    try:
        process(row)
    except Exception as e:
        errors.append({"row": row, "error": str(e), "ts": datetime.now()})

if errors:
    with open("dlq.jsonl", "a") as f:
        for err in errors:
            f.write(json.dumps(err, default=str) + "\n")
    logger.warning(f"{len(errors)} rows sent to DLQ")
Enter fullscreen mode Exit fullscreen mode
  • Catch per row, not per batch — one bad row doesn't kill the whole job.
  • Persist DLQ entries — file, queue, table — anywhere downstream review can pick them up.
  • Alert on threshold — if DLQ count > 1% of input, page on-call.

The seven Python gotchas interviewers test most often.

  • Mutable default argumentsdef fn(x=[]): shares the same list across calls; use def fn(x=None): + x = x if x is not None else [].
  • dict[key] vs dict.get(key, default) — bare brackets KeyErrors; .get returns None or the default.
  • Float equality0.1 + 0.2 != 0.3; compare with math.isclose(a, b).
  • == vs is== compares values; is compares identity (only correct for None, True, False, singletons).
  • Generator exhaustion — a generator iterates once; re-create it to iterate again.
  • UnicodeDecodeError — non-UTF-8 file opened without encoding hint; pass encoding="…" explicitly.
  • SettingWithCopyWarning in pandas — see §5.

Python
Topic — etl
Pipeline safety drills

Practice →

Python
Topic — data-manipulation
Error handling Python practice

Practice →


Choosing the right Python pattern (cheat sheet)

A one-screen cheat sheet for using Python for data engineering — pick the pattern that matches your task.

You want to … Pattern Notes
Look up a value by key dict[key] or dict.get(key, default) O(1); safer with .get
Check fast membership if x in set: O(1) for sets; O(n) for lists
Iterate with index for i, x in enumerate(lst): Pythonic; no manual counter
Walk two lists in parallel for a, b in zip(lst_a, lst_b): Stops at shorter list
Build a filtered list [x for x in seq if cond] List comprehension
Build a lookup map {k: v for k, v in pairs} Dict comprehension
Read a CSV pd.read_csv(path) or csv.DictReader pandas for tables, csv for tight memory
Read a large CSV in chunks pd.read_csv(path, chunksize=10000) Returns an iterator of DataFrames
Stream rows from a file def reader(path): yield from open(path) Generator
Aggregate by group df.groupby("col").agg(...) (pandas) or defaultdict(int) Pick what fits the dataset
Pull from a database pd.read_sql(query, engine) SQLAlchemy engine
Push to a database df.to_sql(table, engine, if_exists="append", index=False) Always index=False
Call an HTTP API requests.get(url, params=..., timeout=10) Always set timeout
Retry on transient errors tenacity.retry decorator or manual backoff Exponential delay + jitter
Catch a specific error except requests.Timeout: Specific first, Exception last
Log instead of print logger.info("message") Module-scoped logger
Write Parquet df.to_parquet(path, compression="snappy") Columnar production default
Handle Decimal / datetime in JSON json.dumps(obj, default=str) Or pass a custom encoder

Frequently asked questions

Why is Python the default language for data engineering?

Python won the data engineering language wars because its data ecosystem is unmatched: pandas for analytical transforms, pyarrow and Parquet for fast columnar I/O, SQLAlchemy and psycopg2 for database integration, requests and httpx for APIs, pyspark for distributed compute, airflow for orchestration, and dbt for SQL transformation. The language stays approachable for SQL-first engineers learning to script, the syntax is forgiving, and the same Python skills transfer cleanly across ad-hoc analysis (pandas), batch ETL (pyspark / airflow), and streaming (Kafka clients, Faust). The interview-canonical answer: Python is dominant because the import → process → export loop is shorter and the ecosystem covers every step of the pipeline.

What Python skills do I need for a data engineering interview?

Six skill areas cover 90% of junior DE work. Core data structureslist, dict, set, tuple for modelling rows, lookups, uniqueness, and immutable records. File I/O — CSV (csv.DictReader), JSON (json.load / json.dumps), Parquet (pyarrow.parquet). List comprehensions and generators — Pythonic transforms ([x for x in seq if cond]) and memory-efficient streaming (yield). pandas DataFramesread_csvqueryassigngroupbymergeto_parquet; this is the bulk of every ETL script. Database and API clientsSQLAlchemy for cross-dialect SQL, psycopg2 for raw PostgreSQL, requests for HTTP APIs. Error handling and pipeline safetytry / except, retries with exponential backoff, structured logging, idempotent writes. Master those six and you've cleared the bar for most junior DE loops.

What's the difference between a list comprehension and a generator expression in Python?

A list comprehension[x*2 for x in seq if x > 0] — uses square brackets and materialises the entire result list in memory. It's slightly faster than the equivalent for-loop + append (~2× in CPython) and is the right default when the result fits in memory comfortably. A generator expression(x*2 for x in seq if x > 0) — uses round parentheses and produces values lazily, one at a time, with constant memory usage regardless of input size. Use a generator when streaming over millions of rows (sum(r["amount"] for r in big_iterable)), when only iterating once, or when composing pipelines (squared = (x*x for x in nums); large = (x for x in squared if x > 1000)). Generators are also a function pattern via yielddef reader(path): for line in open(path): yield line.strip() is the canonical streaming-ingest shape.

How do I avoid memory issues when processing large files in Python?

Three patterns. Stream with generatorsfor line in open("big.csv"): reads one line at a time; never f.readlines() (loads the whole file). Chunk with pandasfor chunk in pd.read_csv("big.csv", chunksize=10000): processes 10k rows at a time; aggregate across chunks with a running accumulator (defaultdict(int) or a partial DataFrame). Use Parquet + column projectionpd.read_parquet("file.parquet", columns=["region", "amount"]) reads only the columns you need; Parquet's columnar layout makes this dramatically faster than CSV. For truly large datasets (10s of GB+), graduate to PySpark or Polars, both of which handle out-of-memory transforms natively. The general rule: pick the smallest representation that supports your transform — generator > chunked DataFrame > full DataFrame > distributed DataFrame.

How do I safely make HTTP API calls in a Python data pipeline?

Five rules. Always set a timeoutrequests.get(url, timeout=10) prevents the call from hanging forever. Use a Session for reusewith requests.Session() as s: keeps the TCP / TLS connection alive across many calls (10-50× faster in a tight loop). Check the statusresp.raise_for_status() raises HTTPError on 4xx / 5xx so you don't silently process bad responses. Retry with exponential backoff and jittertenacity.retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60)) declaratively handles transient failures. Respect rate limits — read the Retry-After header on 429 responses and sleep accordingly; pre-emptively throttle if the API has a known QPS limit. For high-fanout pipelines (100s of concurrent calls), switch to httpx.AsyncClient + asyncio.gather to parallelise I/O without spawning threads.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including Python practice keyed to data structures, list comprehensions, file I/O, CSV processing, pandas DataFrames, ETL workflows, and the production-safety patterns every junior DE must learn. Whether you're drilling python interview questions for data engineer or grinding python for data engineers end-to-end, the practice library mirrors the same six-skill mental model this guide teaches.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the data-manipulation lane →; reinforce list comprehension drills →; rehearse CSV processing patterns →; widen coverage on the full ETL Python library →.

Top comments (0)