Gowtham Potureddi

Posted on Jun 5

Pandas for Data Engineering: melt, pivot, groupby, merge & DuckDB Migration

#python #sql #dataengineering #interview

pandas melt is the single most useful reshape primitive in a data engineer's toolbox — it is the "untidy wide table → tidy long table" transformation that lets every downstream groupby, merge, and SQL UNPIVOT actually work. Five other primitives — pivot_table, stack/unstack, groupby.agg / transform / apply, and merge (with its six join types plus the underrated merge_asof) — round out the pandas data engineering core, and duckdb pandas is the 2026 escape hatch for when your dataframe stops fitting in RAM.

This guide walks the five primitives that show up most in production Python ETL — pandas pivot, pandas groupby, pandas merge, the DuckDB migration pattern (duckdb.query(df) SQL directly on a dataframe), plus a side-by-side cheat sheet that maps every Pandas operation to its SQL and DuckDB equivalent. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll know when to reach for melt over stack, when transform beats apply by 10×, when to drop Pandas entirely for duckdb.sql(), and how to use pandas to sql patterns to land dataframes in your warehouse without rewriting the pipeline.

When you want hands-on reps immediately after reading, browse Python practice library →, drill aggregation drills in Python →, and rehearse on ETL pipeline problems →.

On this page

Why Pandas still ships in production DE pipelines
Reshape primitives — melt + pivot_table + stack/unstack
groupby — agg, transform, apply
merge — the seven join types
DuckDB migration — SQL on dataframes
Cheat sheet — Pandas to SQL to DuckDB
Frequently asked questions
Practice on PipeCode

1. Why Pandas still ships in production DE pipelines

Pandas is the duct-tape layer between every API, file, and database in a modern Python pipeline

The one-sentence invariant: Pandas is an in-memory, columnar dataframe library that excels at single-machine workloads up to roughly 10 GB, with the richest ecosystem of file readers, SQL connectors, API clients, and visualisation hooks in the Python data world. Once you internalise that mental model — fast in RAM, hopeless at scale, but the lingua franca of every Python data tool — the rest of the pandas data engineering decision space becomes a clean trade-off: stay in Pandas for the small steps, escalate to DuckDB, Polars, or Spark for the big ones.

Pandas vs Polars vs DuckDB — when each one wins.

Pandas. Default for sub-10 GB workloads. Universal API surface — every library on PyPI speaks Pandas. Best when you need to call ML libraries, plot, or hand a dataframe to a downstream script.
Polars. Columnar, lazy, multi-threaded. 5–30× faster than Pandas on aggregate-heavy workloads. Use when the pipeline is pure dataframe ops and you've outgrown Pandas but still need single-machine.
DuckDB. Embedded analytic SQL engine. Runs IN the same Python process, queries Pandas dataframes directly, scans Parquet/CSV/S3 natively. Use when SQL is the cleaner expression of the pipeline, or when you need to scale past Pandas without leaving Python.
Spark / PySpark. Distributed. Use when single-machine truly isn't enough (> 100 GB) and you can pay the cluster cost.

The 2026 lightweight ETL stack.

Pandas + DuckDB has eaten roughly half of the "small Spark cluster" use cases. The pattern is simple: read with Pandas (or DuckDB) → DuckDB-SQL for the heavy joins and aggregates → Pandas for the small downstream steps that need scikit-learn, matplotlib, or a Django ORM hand-off.
DuckDB queries a Pandas dataframe with zero copy — the DataFrame's Arrow-backed columns are read directly by DuckDB's vectorised engine.
Pandas 2.x adopts a PyArrow backend optionally — narrowing the performance gap with Polars and DuckDB without changing the API.

The four production roles Pandas still owns.

Glue. The pipeline reads JSON from an API, joins against a CSV, writes to Postgres — Pandas covers all three in twenty lines.
Quick reshape. melt, pivot_table, stack/unstack are easier to write and read than the SQL equivalents.
Ad-hoc exploration. Jupyter + Pandas is still the fastest path from "give me the answer" to "I have the answer."
Tail-end of an ETL. After a big Spark/DuckDB job collapses 100 GB into a 50 MB summary, Pandas takes over for plotting, alerts, and ML feature prep.

What interviewers listen for.

Do you reach for groupby + agg instead of a Python for loop? — minimum signal.
Do you know the difference between pivot (no aggregation) and pivot_table (with aggregation)? — required answer.
Do you mention merge_asof when a time-window join comes up? — senior signal.
Do you propose DuckDB when the dataset gets large or the SQL is cleaner? — current-stack signal.

Worked example — design a Pandas pipeline for monthly KPI rollup

Detailed explanation. Realistic Pandas pipelines start from the schema ("we get one CSV per region per day"), the transformation ("roll up to monthly KPIs by region and product"), and the output target ("write to a Postgres table the BI tool reads"). Pandas wins here because all three steps live in one process, one file, no infra.

Question. Design a Pandas pipeline that ingests daily sales CSVs from S3, computes monthly KPIs (gross_sales, units, orders) per (region, product), and writes the result to a Postgres monthly_kpis table. Sketch the imports, the read step, the transform step, and the to_sql step.

Code.

import pandas as pd
from sqlalchemy import create_engine
import s3fs

# 1. Read — one call per partitioned prefix
df = pd.read_csv(
    "s3://sales-raw/2026/05/*.csv",
    parse_dates=["order_ts"],
    dtype={"region": "category", "product_id": "category", "units": "int32"},
)

# 2. Transform — monthly bucket + groupby + agg
df["month"] = df["order_ts"].dt.to_period("M")
kpis = (
    df.groupby(["month", "region", "product_id"], observed=True)
      .agg(gross_sales=("amount", "sum"),
           units=("units", "sum"),
           orders=("order_id", "nunique"))
      .reset_index()
)

# 3. Write — Pandas to SQL (Postgres)
engine = create_engine("postgresql+psycopg2://etl:***@db:5432/warehouse")
kpis.to_sql("monthly_kpis", engine, if_exists="append",
            method="multi", chunksize=5_000, index=False)

Step-by-step explanation.

Read with explicit dtype. Passing dtype={"region": "category", ...} shrinks the dataframe by 5–10× compared to default object columns. parse_dates does the date-string-to-datetime64 work at read time, not later.
Monthly bucket via dt.to_period("M"). Cheaper than dt.strftime because it stays in pandas Period dtype, which compares and groups correctly.
groupby(..., observed=True). With categorical group keys, observed=True skips empty (region, product) combinations — both faster and what the BI tool actually wants.
agg(...) with named aggregations. The gross_sales=("amount", "sum") syntax names the output column explicitly — no MultiIndex column mess.
to_sql(..., method="multi", chunksize=5_000). method="multi" batches INSERT statements; chunksize=5_000 caps each batch's parameter count so the driver doesn't OOM.

Output.

Decision	Value	Why
Read	`pd.read_csv` glob on S3	one call, partition-friendly
Categorical dtypes	`region`, `product_id`	5-10× memory savings
Month key	`dt.to_period("M")`	proper grouping dtype
GroupBy	`observed=True`	skip empty cells
Agg	named aggregations	clean output columns
Write	`to_sql(method="multi")`	batched INSERTs

Rule of thumb. Pandas pipelines should fit in a single function. If you need more than ~150 lines or you're already worrying about RAM, escalate one step — usually DuckDB for the heavy transforms, Pandas at the tail.

Pandas interview question on monthly KPI rollup

A common warm-up: "Walk me through a Pandas pipeline that reads daily sales CSVs, computes monthly KPIs per region and product, and writes the result to a SQL warehouse." It exercises read_csv, dtype hygiene, groupby with observed, named agg, and to_sql — the five muscles every pandas data engineering opening round wants to see.

Solution Using `groupby` + named `agg` + `to_sql`

import pandas as pd
from sqlalchemy import create_engine

df = pd.read_csv("s3://sales-raw/2026/05/*.csv",
                 parse_dates=["order_ts"],
                 dtype={"region": "category", "product_id": "category"})

df["month"] = df["order_ts"].dt.to_period("M").astype(str)

kpis = (
    df.groupby(["month", "region", "product_id"], observed=True)
      .agg(gross_sales=("amount", "sum"),
           units=("units", "sum"),
           orders=("order_id", "nunique"))
      .reset_index()
)

engine = create_engine("postgresql+psycopg2://etl:***@db:5432/warehouse")
kpis.to_sql("monthly_kpis", engine, if_exists="append",
            method="multi", chunksize=5_000, index=False)

Step-by-step trace.

Input rows (excerpt from df):

order_id	order_ts	region	product_id	amount	units
1	2026-05-02	EU	A	120.0	2
2	2026-05-04	EU	A	60.0	1
3	2026-05-09	EU	B	200.0	4
4	2026-05-11	US	A	80.0	1
5	2026-05-21	US	A	80.0	1

Trace:

df["month"] = ... adds "2026-05" to every row.
groupby(["month","region","product_id"], observed=True) creates three groups: (2026-05, EU, A), (2026-05, EU, B), (2026-05, US, A).
agg(gross_sales=("amount","sum"), units=("units","sum"), orders=("order_id","nunique")) reduces each group to one row.
reset_index() flattens the MultiIndex back to plain columns.
to_sql(...) issues batched INSERT INTO monthly_kpis VALUES (...), (...) statements via method="multi".

Output:

month	region	product_id	gross_sales	units	orders
2026-05	EU	A	180.0	3	2
2026-05	EU	B	200.0	4	1
2026-05	US	A	160.0	2	2

Why this works — concept by concept:

Categorical dtypes — region and product_id as category store the column as an int8/int16 code into a small dictionary; memory drops 5–10× and group operations are vectorised on the codes.
dt.to_period("M") — converts a datetime to a month-period dtype that groups and compares correctly without string acrobatics.
groupby(..., observed=True) — by default, categorical groupby produces a row for every Cartesian combo of categories, including empty ones; observed=True only emits rows that actually exist in the data.
Named aggregations — the output=("col","func") form names the output explicitly, so the resulting dataframe has flat columns and not the legacy ('amount','sum') MultiIndex tuples.
to_sql(method="multi", chunksize=...) — Pandas batches INSERT statements into one round-trip per chunk; without it, each row would be its own INSERT and the writer would be 50–100× slower.
Cost — read = O(rows); groupby = O(rows) with O(groups) memory; write = O(rows / chunksize) DB round-trips.

Python
Topic — ETL
ETL pipeline problems (Pandas, groupby, to_sql)

Practice →

2. Reshape primitives — melt + pivot_table + stack/unstack

`pandas melt` is the wide-to-long transformation every downstream join, groupby, and chart needs

The mental model in one line: wide tables are for humans (one row per entity, one column per attribute), long tables are for computers (one row per (entity, attribute) pair) — melt converts wide → long, pivot_table converts long → wide with aggregation, stack/unstack rotate between row-index and column-index. Once you internalise that, the entire pandas melt / pandas pivot family collapses into "which direction am I going, and do I need aggregation on the way?"

The three reshape primitives every interview probes.

melt(id_vars=..., value_vars=..., var_name=..., value_name=...) — wide → long. The id_vars are kept as-is; every value_vars column becomes a row labelled with its column name in var_name and value in value_name.
pivot_table(index=..., columns=..., values=..., aggfunc=...) — long → wide with aggregation. If two rows collide on the (index, columns) pair, aggfunc resolves them (default mean; usually you want sum for additive metrics).
stack() / unstack() — rotate the last level of column index into the row index, or vice versa. Most useful with a MultiIndex on rows or columns.

Three more reshape moves senior interviews quietly love.

pivot(...) (no _table) — same shape transform as pivot_table but errors on duplicate (index, columns) — useful as an assertion.
wide_to_long(stubnames=..., i=..., j=...) — melt on steroids for column patterns like x_2024, x_2025, x_2026.
explode(column) — list-valued column → one row per list element. Common after JSON ingestion.

SQL equivalents.

melt ≅ UNPIVOT (SQL Server, Snowflake) or cross join unnest(array[...]) (Postgres-style).
pivot_table ≅ GROUP BY ... PIVOT (...) (Snowflake, DuckDB) or GROUP BY + conditional SUM(CASE WHEN ...) everywhere else.
stack / unstack have no clean SQL analog — they're a hierarchical-index trick that exists because Pandas has MultiIndex.

Worked example — reshape a monthly KPI report wide ↔ long

Detailed explanation. Analysts often receive monthly KPI tables in wide format — one row per region, columns Q1, Q2, Q3, Q4. Joining that against a long "targets" table (one row per (region, quarter)) requires melting the wide table first; producing the executive summary requires pivoting back.

Question. Given a wide quarterly KPIs table, (a) melt it to long, (b) join against the targets long table, (c) pivot back to wide so the final executive report has one row per region.

Input (wide KPIs).

region	Q1	Q2	Q3	Q4
EU	120	140	155	170
US	200	215	230	250
APAC	80	95	110	130

Input (long targets).

region	quarter	target
EU	Q1	100
EU	Q2	130
US	Q1	180
US	Q2	210
APAC	Q1	70
APAC	Q2	90

Code.

import pandas as pd

wide = pd.DataFrame({
    "region": ["EU", "US", "APAC"],
    "Q1": [120, 200, 80],
    "Q2": [140, 215, 95],
    "Q3": [155, 230, 110],
    "Q4": [170, 250, 130],
})

# (a) melt wide → long
long = wide.melt(id_vars="region",
                 value_vars=["Q1", "Q2", "Q3", "Q4"],
                 var_name="quarter",
                 value_name="actual")

# (b) join against the long targets
targets = pd.DataFrame({
    "region": ["EU","EU","US","US","APAC","APAC"],
    "quarter": ["Q1","Q2","Q1","Q2","Q1","Q2"],
    "target": [100, 130, 180, 210, 70, 90],
})
joined = long.merge(targets, on=["region", "quarter"], how="left")
joined["gap"] = joined["actual"] - joined["target"]

# (c) pivot back — one row per region, one column per quarter, value = gap
report = joined.pivot_table(index="region",
                            columns="quarter",
                            values="gap",
                            aggfunc="sum")

Step-by-step explanation.

wide.melt(id_vars="region", value_vars=["Q1","Q2","Q3","Q4"]) produces 3 × 4 = 12 rows, one per (region, quarter) pair. The melted dataframe has three columns: region, quarter, actual.
long.merge(targets, on=["region","quarter"], how="left") enriches every long row with its matching target (or NaN for Q3/Q4 which aren't in targets).
joined["gap"] = ... adds the actual - target delta column. Q3 and Q4 rows have NaN gap.
joined.pivot_table(index="region", columns="quarter", values="gap", aggfunc="sum") rotates back to wide. Each region is one row; each quarter is one column; the cell value is the gap.
Because there's exactly one row per (region, quarter), aggfunc="sum" is a no-op — but you'd want it the moment duplicates appear (e.g. multiple snapshots per quarter).

Output (after melt — long format).

region	quarter	actual
EU	Q1	120
EU	Q2	140
EU	Q3	155
EU	Q4	170
US	Q1	200
US	Q2	215
...	...	...

Output (final pivoted gap report).

region	Q1	Q2	Q3	Q4
APAC	10	5	NaN	NaN
EU	20	10	NaN	NaN
US	20	5	NaN	NaN

Rule of thumb. Always melt before joining heterogeneously-shaped tables — it's almost always cheaper than building a custom multi-column join. Pivot back at the very end, after every join, filter, and aggregate is done.

Pandas interview question on `pivot_table` vs `pivot`

The probe typically sounds like: "What's the difference between df.pivot(...) and df.pivot_table(...)?" or "When does pivot fail and pivot_table succeed?" — testing whether the candidate knows that pivot errors on duplicates whereas pivot_table aggregates them away.

Solution Using `pivot_table` with explicit `aggfunc` for duplicate keys

import pandas as pd

# Long table with duplicates on (region, quarter)
df = pd.DataFrame({
    "region":  ["EU","EU","EU","US","US","APAC","APAC","APAC"],
    "quarter": ["Q1","Q1","Q2","Q1","Q1","Q1","Q1","Q2"],
    "amount":  [60, 60, 140, 100, 100, 30, 50, 95],
})

# df.pivot(...) — raises ValueError because (EU,Q1) has 2 rows
# df.pivot_table(...) — aggregates them away
report = df.pivot_table(
    index="region",
    columns="quarter",
    values="amount",
    aggfunc="sum",
    fill_value=0,
)

Step-by-step trace.

Step	Action	Effect
1	`df` has duplicates on (EU, Q1), (US, Q1), (APAC, Q1)	non-unique (index, columns) pair
2	`pivot_table(..., aggfunc="sum")` groups by (region, quarter) then sums `amount`	EU/Q1 → 120, US/Q1 → 200, APAC/Q1 → 80
3	Rotates the result so quarter values become columns	wide layout
4	`fill_value=0` replaces missing (region, quarter) cells (e.g. APAC Q2 only) with `0` instead of `NaN`	clean integer output

Output:

region	Q1	Q2
APAC	80	95
EU	120	140
US	200	0

Why this works — concept by concept:

pivot vs pivot_table semantics — pivot is a pure reshape (no reduction); it raises ValueError on duplicate (index, columns) pairs. pivot_table accepts duplicates and resolves them with aggfunc.
aggfunc="sum" — explicit choice. The library default is mean, which silently lies for additive metrics. Always pass aggfunc for production code.
fill_value=0 — for sparse pivots, NaN cells make downstream arithmetic produce NaNs. fill_value=0 keeps the integer dtype and the math correct.
Determinism — the resulting column order matches sorted(unique(quarter)) unless you pass columns= as a categorical with ordered=True.
Cost — O(rows) aggregation pass + O(distinct(index) × distinct(columns)) to allocate the wide block. Don't pivot ten-million-cardinality columns.

Python
Topic — aggregation (Python)
Pivot / melt aggregation problems (Pandas)

Practice →

Worked example — `wide_to_long` for `xN` column patterns

Detailed explanation. Real-world wide tables aren't always neatly named Q1, Q2, Q3, Q4 — they often look like revenue_2023, revenue_2024, revenue_2025, cost_2023, cost_2024, cost_2025. Plain melt would produce one row per (entity, column) and you'd have to split the column name afterward. wide_to_long does it in one call.

Question. Given a wide table with multi-year revenue_YYYY and cost_YYYY columns, convert it to a long table with three columns: entity, year, revenue, cost.

Input.

entity	revenue_2023	revenue_2024	revenue_2025	cost_2023	cost_2024	cost_2025
A	100	120	140	60	65	70
B	80	90	110	50	55	60

Code.

import pandas as pd

df = pd.DataFrame({
    "entity": ["A", "B"],
    "revenue_2023": [100, 80],
    "revenue_2024": [120, 90],
    "revenue_2025": [140, 110],
    "cost_2023": [60, 50],
    "cost_2024": [65, 55],
    "cost_2025": [70, 60],
})

long = pd.wide_to_long(
    df,
    stubnames=["revenue", "cost"],
    i="entity",
    j="year",
    sep="_",
).reset_index()

Step-by-step explanation.

wide_to_long recognises the <stub>_<year> pattern from stubnames + sep="_".
For every entity, it produces one row per year (3 years × 2 stubs collapses to 3 rows × 2 columns per entity).
i="entity" marks the id column; j="year" is the name of the new column extracted from the suffix.
reset_index() lifts the MultiIndex out so the result is a flat dataframe.

Output.

entity	year	revenue	cost
A	2023	100	60
A	2024	120	65
A	2025	140	70
B	2023	80	50
B	2024	90	55
B	2025	110	60

Rule of thumb. Whenever the column names contain the variable you want as a row, reach for wide_to_long. Plain melt is for the simple case where every melted column shares one role.

Python
Topic — data analysis (Python)
Reshape and wide-to-long problems (Pandas)

Practice →

3. groupby — agg, transform, apply

`pandas groupby` has three output shapes — agg compresses, transform broadcasts, apply does anything

The mental model in one line: agg returns one row per group, transform returns one row per input row (per-group stat broadcast back), apply returns whatever the user function returns. Pick by the shape you need next. This single distinction collapses 80% of the pandas groupby interview questions into the same answer pattern.

The three groupby output shapes.

agg — reduces each group to one row. Equivalent SQL is SELECT key, AGG(col) FROM t GROUP BY key. Fastest of the three.
transform — broadcasts a per-group statistic back to every row. Equivalent SQL is AGG(col) OVER (PARTITION BY key). Useful for "compare row to its group mean" patterns.
apply — runs a user function per group and concatenates the results. The function can return a Series, a DataFrame, or a scalar — the shape varies. Slowest of the three.

The fourth move every senior interview quietly wants.

filter — groupby(key).filter(lambda g: predicate(g)) keeps only the groups (not rows) that pass a predicate. Equivalent SQL is WHERE key IN (SELECT key FROM t GROUP BY key HAVING predicate(...)).

Why transform beats apply by 10×.

apply falls back to per-group Python for arbitrary callables — one Python call per group.
transform keeps the result vectorised because pandas knows the output shape up front (same length as input); when used with built-in agg names ("mean", "sum"), it stays on the C path.
Rule: if your function can be expressed as transform("mean") or transform("rank") or even transform(lambda x: x - x.mean()), never reach for apply.

groupby.agg shapes you'll write daily.

groupby(key).agg("sum") — every numeric column summed per group.
groupby(key).agg({"a": "sum", "b": "mean"}) — per-column agg map. Returns the old-style column names.
groupby(key).agg(out_a=("a","sum"), out_b=("b","mean")) — named aggregations (modern, flat columns). Use this in production.

Worked example — per-customer rolling 7-day average

Detailed explanation. A common analytics ask: "for each customer, compute the rolling 7-day average of their order amount, attached to every row." This requires per-customer windowing — exactly what groupby + rolling + transform does in one expression.

Question. Given an orders table with customer_id, order_ts, amount, compute a rolling_7d_avg column = the rolling 7-day average of amount for that customer up to and including the current row.

Input.

customer_id	order_ts	amount
1	2026-05-01	100
1	2026-05-03	120
1	2026-05-08	80
1	2026-05-10	200
2	2026-05-01	50
2	2026-05-02	70
2	2026-05-09	60

Code.

import pandas as pd

orders = pd.DataFrame({
    "customer_id": [1,1,1,1,2,2,2],
    "order_ts": pd.to_datetime([
        "2026-05-01","2026-05-03","2026-05-08","2026-05-10",
        "2026-05-01","2026-05-02","2026-05-09",
    ]),
    "amount": [100,120,80,200,50,70,60],
})

orders = orders.sort_values(["customer_id", "order_ts"])

# Index on order_ts so rolling("7D") understands the window
orders = orders.set_index("order_ts")

orders["rolling_7d_avg"] = (
    orders.groupby("customer_id")["amount"]
          .rolling("7D")
          .mean()
          .reset_index(level=0, drop=True)
)

orders = orders.reset_index()

Step-by-step explanation.

sort_values(["customer_id","order_ts"]) ensures every customer's rows are ordered by time — rolling assumes ascending time.
set_index("order_ts") puts the time column into the index. rolling("7D") understands a string like "7D" only when the index is a datetime.
groupby("customer_id")["amount"].rolling("7D").mean() produces a Series with a MultiIndex (customer_id, order_ts).
.reset_index(level=0, drop=True) drops the customer_id level so the result aligns back onto the original order_ts index — orders["rolling_7d_avg"] = ... then broadcasts it correctly.
The final .reset_index() lifts order_ts back into a normal column for downstream code.

Output.

customer_id	order_ts	amount	rolling_7d_avg
1	2026-05-01	100	100.0
1	2026-05-03	120	110.0
1	2026-05-08	80	100.0
1	2026-05-10	200	133.33
2	2026-05-01	50	50.0
2	2026-05-02	70	60.0
2	2026-05-09	60	60.0

Rule of thumb. For per-group windowed stats, the chain groupby → rolling → mean → reset_index(level=0, drop=True) is the canonical pattern. The reset_index trick is the single most common source of "why doesn't the assignment line up?" debugging.

Pandas interview question on `transform` vs `apply`

The probe usually sounds like: "How would you add a % of group column to a payment dataframe — each row's amount as a percent of its merchant's total?" — testing whether the candidate reaches for transform("sum") (fast, vectorised) or apply (slow, easy trap).

Solution Using `groupby.transform("sum")` for the per-group denominator

import pandas as pd

df = pd.DataFrame({
    "merchant_id": [7,7,7,8,8,9],
    "payment_id":  ["a","b","c","d","e","f"],
    "amount":      [50,150,100,200,300,400],
})

# Per-group total broadcast back to every row, vectorised
df["merchant_total"] = df.groupby("merchant_id")["amount"].transform("sum")
df["pct_of_merchant"] = df["amount"] / df["merchant_total"]

Step-by-step trace.

Step	Action	Result
1	`groupby("merchant_id")`	three groups: 7, 8, 9
2	`["amount"].transform("sum")`	new Series, same length as input, each row replaced with its group sum
3	broadcast back: 50→300, 150→300, 100→300, 200→500, 300→500, 400→400	one row per input row
4	`df["pct_of_merchant"] = df["amount"] / df["merchant_total"]`	row-level division — no apply needed

Output:

merchant_id	payment_id	amount	merchant_total	pct_of_merchant
7	a	50	300	0.1667
7	b	150	300	0.5000
7	c	100	300	0.3333
8	d	200	500	0.4000
8	e	300	500	0.6000
9	f	400	400	1.0000

Why this works — concept by concept:

transform shape contract — pandas knows the output Series has the same length and index as the input, so it can keep the operation vectorised and avoid the per-group Python call overhead apply would incur.
String-named aggregations — transform("sum") stays on the C path; transform(lambda g: g.sum()) would fall back to Python (still per-group, not per-row, so it's not catastrophic but it's measurably slower).
Broadcast assignment — df["merchant_total"] = ... aligns by index, which lines up because transform preserves the input index.
SQL equivalent — SUM(amount) OVER (PARTITION BY merchant_id). Same intent, same shape, same name in the result.
Cost — O(rows) single pass + O(groups) aggregation state. apply(lambda) would be O(groups × per_group_python_overhead), often 10× slower on small groups.

Python
Topic — grouping
GroupBy and window function problems

Practice →

Worked example — `groupby.filter` to drop low-volume groups

Detailed explanation. A common cleaning step: drop every merchant with fewer than 10 transactions before computing metrics. The naive approach computes a count, joins back, and filters — three lines and two passes. groupby.filter does it in one expression.

Question. Given a payments dataframe, keep only the rows belonging to merchants with at least 10 transactions.

Code.

import pandas as pd

clean = df.groupby("merchant_id").filter(lambda g: len(g) >= 10)

Step-by-step explanation.

groupby("merchant_id") partitions the dataframe by merchant.
.filter(lambda g: len(g) >= 10) evaluates the predicate on each group — not each row.
Groups that pass keep all their rows; groups that fail are dropped entirely.
The result preserves the original row order within each surviving group.

Output.

The output retains the same column schema as the input but excludes every row whose merchant_id had fewer than 10 records. Per-group counts that started at 4, 7, 12, 25 reduce the kept set to merchants with counts 12 and 25.

Rule of thumb. filter is the right tool when you want to keep rows based on a group-level predicate. Don't confuse it with the dataframe-level .filter() (which selects columns) — different method, different signature.

4. merge — the seven join types

`pandas merge` is the full SQL JOIN family plus two superpowers (`merge_asof`, `merge_ordered`)

The mental model in one line: pd.merge(left, right, on=..., how=...) is the API; how picks the SQL JOIN type; merge_asof adds time-window joins that SQL needs LATERAL JOIN ... LIMIT 1 to express; merge_ordered handles sorted joins efficiently. Once you can name all seven outputs of the how argument, the pandas merge interview surface is essentially closed.

The seven join types every interview probes.

how="inner" — only rows whose keys match in both dataframes. The default.
how="left" — every row from left; matching right rows filled in; NaN for unmatched.
how="right" — mirror of left. Rare in practice — usually rewritten as a left join with sides swapped.
how="outer" — every row from either side, NaN-filled where the other side is missing.
how="cross" — Cartesian product. Useful for "every customer × every product" matrix building. Be careful — output grows as O(N × M).
Anti-join via indicator=True + filter — merge(..., how="outer", indicator=True) adds a _merge column with values left_only, right_only, both. Filter to one of these to get a true anti-join.
Semi-join via isin — left[left["key"].isin(right["key"])] keeps left rows whose key appears in right, without the columns from right. Simulates SQL's WHERE EXISTS (...).

Two superpowers SQL needs ugly code to match.

merge_asof(left, right, on=..., by=..., tolerance=...) — for each row in left, snap to the most recent row in right whose key is ≤ the left key. Canonical use case: trades + quotes, sensor readings + weather, "what was the price just before this trade?"
merge_ordered(left, right, on=..., fill_method="ffill") — outer join on sorted keys, with optional forward-fill. Used for irregular time-series alignment.

The two arguments interviewers always probe.

on= vs left_on= / right_on= — on requires the same column name in both; left_on/right_on is for mismatched names. Pass both as lists for composite keys.
validate= — "one_to_one", "one_to_many", "many_to_one", "many_to_many". Use this in production code to assert join cardinality and catch fan-out bugs early.

Worked example — `merge_asof` for trades + quotes time-window join

Detailed explanation. A textbook financial use case: every trade row needs the most recent quote that existed at or before the trade timestamp. A naive merge(..., on="ts") would only match exact timestamps — usually zero rows. merge_asof snaps each trade to the nearest preceding quote in O((N + M) log) time using a merge-sort-style walk.

Question. Given a trades table (one row per trade with symbol, trade_ts, qty) and a quotes table (one row per quote tick with symbol, quote_ts, bid, ask), produce one row per trade with the quote that was active at trade time.

Input (trades).

symbol	trade_ts	qty
AAPL	09:30:01.005	100
AAPL	09:30:01.013	50
MSFT	09:30:01.007	200

Input (quotes).

symbol	quote_ts	bid	ask
AAPL	09:30:01.000	189.95	190.05
AAPL	09:30:01.010	189.97	190.04
MSFT	09:30:01.000	425.00	425.10
MSFT	09:30:01.008	425.03	425.12

Code.

import pandas as pd

trades = pd.DataFrame({
    "symbol":   ["AAPL","AAPL","MSFT"],
    "trade_ts": pd.to_datetime([
        "2026-06-04 09:30:01.005",
        "2026-06-04 09:30:01.013",
        "2026-06-04 09:30:01.007",
    ]),
    "qty": [100, 50, 200],
})

quotes = pd.DataFrame({
    "symbol":   ["AAPL","AAPL","MSFT","MSFT"],
    "quote_ts": pd.to_datetime([
        "2026-06-04 09:30:01.000",
        "2026-06-04 09:30:01.010",
        "2026-06-04 09:30:01.000",
        "2026-06-04 09:30:01.008",
    ]),
    "bid": [189.95, 189.97, 425.00, 425.03],
    "ask": [190.05, 190.04, 425.10, 425.12],
})

# Both sides must be sorted by the merge key
trades = trades.sort_values("trade_ts")
quotes = quotes.sort_values("quote_ts")

joined = pd.merge_asof(
    trades,
    quotes,
    left_on="trade_ts",
    right_on="quote_ts",
    by="symbol",                       # match per-symbol
    direction="backward",              # snap to ≤ quote_ts
    tolerance=pd.Timedelta("1s"),      # ignore stale quotes
)

Step-by-step explanation.

Both dataframes are sorted by the time key (merge_asof requires this — it doesn't sort for you).
For each row in trades, merge_asof finds the row in quotes with the largest quote_ts ≤ trade_ts and the same symbol.
direction="backward" is the default — snap to the most recent prior tick. Other choices: "forward" (next tick) or "nearest".
by="symbol" partitions both sides — AAPL trades only match AAPL quotes; MSFT only matches MSFT.
tolerance=pd.Timedelta("1s") rejects matches older than 1 second — better to emit NaN than to snap to a quote that's no longer relevant.

Output.

symbol	trade_ts	qty	quote_ts	bid	ask
AAPL	09:30:01.005	100	09:30:01.000	189.95	190.05
AAPL	09:30:01.013	50	09:30:01.010	189.97	190.04
MSFT	09:30:01.007	200	09:30:01.000	425.00	425.10

Rule of thumb. Whenever a SQL solution would start with LATERAL JOIN ... ORDER BY ... LIMIT 1, the Pandas equivalent is merge_asof. Always sort both inputs first; always pass by= when you have a partitioning column like symbol or device_id.

Pandas interview question on safe joins

A senior interviewer often shapes this round as: "You merge a 1M-row orders table with a 50k-row customers table and end up with 1.4M rows. What went wrong?" — testing whether the candidate uses validate= to assert join cardinality.

Solution Using `merge(..., validate="many_to_one")`

import pandas as pd

orders = pd.DataFrame({
    "order_id":   [1,2,3,4,5],
    "customer_id":[10,10,20,20,30],
    "amount":     [50,75,200,30,90],
})

# customers table has a (silent) duplicate on customer_id=10
customers = pd.DataFrame({
    "customer_id":[10,10,20,30],
    "name":       ["Ann","Ann (dup)","Bob","Cara"],
})

# This will raise — orders→customers should be many_to_one but isn't
try:
    bad = orders.merge(customers, on="customer_id", how="left",
                       validate="many_to_one")
except Exception as e:
    print("validate caught the bug:", e)

# Fix the dup first
customers_clean = customers.drop_duplicates(subset="customer_id", keep="first")

good = orders.merge(customers_clean, on="customer_id", how="left",
                    validate="many_to_one")

Step-by-step trace.

Step	Action	Result
1	`orders.merge(customers, ..., validate="many_to_one")`	pandas checks every right-side key appears at most once
2	`customer_id=10` appears twice in `customers`	validation raises `MergeError: Merge keys are not unique in right dataset`
3	`customers.drop_duplicates(subset="customer_id", keep="first")`	dedup before merging
4	retry with `validate="many_to_one"`	passes; resulting dataframe has exactly 5 rows

Output:

order_id	customer_id	amount	name
1	10	50	Ann
2	10	75	Ann
3	20	200	Bob
4	20	30	Bob
5	30	90	Cara

Why this works — concept by concept:

validate= semantics — the four values "one_to_one", "one_to_many", "many_to_one", "many_to_many" make Pandas assert the cardinality of the join. The check runs once, at merge time, before the cross-product blows up.
Fan-out bugs — the classic "join inflated my row count" bug happens when one side has accidental duplicates. validate catches it; without it, the bug ships silently to production and the BI tool double-counts revenue.
drop_duplicates(subset=, keep=) — explicit dedup with keep="first" (or "last") is the right fix; never silently rely on Pandas to "do the right thing" on duplicate join keys.
SQL equivalent — LEFT JOIN customers c ON o.customer_id = c.customer_id plus a primary-key constraint on customers.customer_id. validate= is Pandas's substitute for the missing PK constraint.
Cost — validate adds an O(N + M) uniqueness check; with the cardinality guarantee, the merge itself is O(N + M) for a hash-join. Cheap insurance.

Python
Topic — joins
Join cardinality and merge problems

Practice →

Worked example — anti-join via `indicator=True`

Detailed explanation. Anti-join = "left rows whose key is not in right." Pandas doesn't have a how="anti", so the canonical idiom is merge(..., how="outer", indicator=True) + filter on _merge == "left_only". It's verbose but explicit, and senior interviewers expect you to know it.

Question. Find every order whose customer_id does not appear in the active_customers table.

Code.

import pandas as pd

orders = pd.DataFrame({
    "order_id":   [1,2,3,4],
    "customer_id":[10,20,30,40],
})

active = pd.DataFrame({"customer_id": [10, 30]})

result = (
    orders.merge(active, on="customer_id", how="left", indicator=True)
          .query("_merge == 'left_only'")
          .drop(columns="_merge")
)

Step-by-step explanation.

merge(..., how="left", indicator=True) keeps every order row and adds a _merge column.
_merge is "both" for orders whose customer_id matched, "left_only" for orders whose customer_id did not.
.query("_merge == 'left_only'") keeps only the unmatched orders.
.drop(columns="_merge") removes the helper column.

Output.

order_id	customer_id
2	20
4	40

Rule of thumb. Use indicator=True whenever you need to know why rows survived a join. For pure anti-join, the _merge == "left_only" idiom is the canonical Pandas pattern.

SQL
Topic — anti-join (SQL)
Anti-join and exists problems (SQL)

Practice →

5. DuckDB migration — SQL on dataframes

`duckdb pandas` is the 2026 escape hatch — query a Pandas dataframe with SQL, in the same Python process

The mental model in one line: DuckDB is an embedded columnar analytic database that runs IN your Python process, can read a Pandas DataFrame directly (zero-copy), and runs 10–100× faster than Pandas on aggregate-heavy queries. You don't replace Pandas — you escalate the heavy steps to DuckDB SQL and keep Pandas for the small steps before and after.

Why DuckDB pairs so well with Pandas.

Zero-copy query of a DataFrame. duckdb.query("SELECT ... FROM my_df") registers the dataframe as a virtual table; DuckDB reads it directly via Arrow.
Native readers. read_csv_auto, read_parquet, read_json_auto, the httpfs extension for S3 — DuckDB can replace pd.read_csv and pd.read_parquet in one line.
Columnar vectorised engine. Built around Arrow-style columnar batches. Aggregates and joins are 10–100× faster than Pandas on the same hardware.
.df() on the result. Every query result has a .df() method that returns a normal Pandas DataFrame — the round-trip is trivial.

When to drop Pandas and use DuckDB instead.

> 10 GB datasets. Pandas materialises the whole dataframe; DuckDB streams over Parquet/CSV without loading everything.
Complex multi-table joins. SQL JOINs over 3+ tables are clearer than nested .merge(...) chains.
Aggregate-heavy queries. GROUP BY + window functions are 10–100× faster.
You want SQL anyway. If the team thinks in SQL, ship SQL — don't translate to Pandas.

When to stay in Pandas.

The dataframe fits in RAM and the operation is a one-liner.
You're handing the dataframe to scikit-learn, matplotlib, or another Python library that expects a Pandas API.
Iterative exploration in Jupyter where you're tweaking one step at a time.

Worked example — migrate a slow Pandas pipeline to DuckDB

Detailed explanation. A team has a Pandas pipeline that reads 10 GB of parquet, joins three tables, groups by region/month, and writes the result. The pipeline takes 10 minutes, needs 32 GB of RAM, and OOMs roughly once a week. Migration target: the same pipeline in DuckDB SQL, same outputs, dramatically faster and smaller memory footprint.

Question. Migrate the following Pandas pipeline to DuckDB SQL:

orders   = pd.read_parquet("s3://lake/orders/")        # 10M rows
items    = pd.read_parquet("s3://lake/order_items/")   # 50M rows
products = pd.read_parquet("s3://lake/products/")      # 100k rows

joined = (orders
          .merge(items,    on="order_id")
          .merge(products, on="product_id"))

joined["month"] = pd.to_datetime(joined["order_ts"]).dt.to_period("M").astype(str)

result = (joined.groupby(["month","region","category"], observed=True)
                .agg(gross_sales=("amount","sum"),
                     orders=("order_id","nunique"))
                .reset_index())

Code (the DuckDB equivalent).

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")            # S3 reader

result = con.execute("""
    SELECT
        strftime(o.order_ts, '%Y-%m')           AS month,
        o.region,
        p.category,
        SUM(i.amount)                           AS gross_sales,
        COUNT(DISTINCT o.order_id)              AS orders
    FROM read_parquet('s3://lake/orders/*.parquet')        AS o
    JOIN read_parquet('s3://lake/order_items/*.parquet')   AS i USING (order_id)
    JOIN read_parquet('s3://lake/products/*.parquet')      AS p USING (product_id)
    GROUP BY 1, 2, 3
    ORDER BY 1, 2, 3
""").df()                                              # ← back to Pandas

Step-by-step explanation.

duckdb.connect() opens an in-process database — no separate server, no docker, nothing to install beyond pip install duckdb.
INSTALL httpfs; LOAD httpfs; enables S3 reads on the fly.
read_parquet('s3://...*.parquet') is a DuckDB native reader — it streams the files columnar without materialising them into memory.
The three-table join, the month bucket, and the group-by are one SQL query — DuckDB plans them together, pushes filters down, and executes vectorised.
.df() on the result returns a Pandas DataFrame — the rest of the pipeline (plotting, downstream ML) doesn't change at all.

Output (comparison).

Metric	Pandas pipeline	DuckDB pipeline
Runtime	~10 min	~30 sec
Peak RAM	~32 GB	~4 GB
Lines of code	~12	~10
Failure mode	`MemoryError` once a week	none observed
Output	identical	identical

Rule of thumb. When the Pandas pipeline does heavy joins or heavy aggregates and the data is bigger than 1 GB, rewrite the heavy middle as a DuckDB SQL string and .df() back to Pandas at the boundary. Don't migrate everything at once — migrate the bottleneck step.

Pandas interview question on choosing Pandas vs DuckDB

A common senior probe: "You have a 2 GB Parquet dataset. You need to compute monthly cohort retention. Pandas or DuckDB?" The candidate is expected to reach for DuckDB because (a) cohort retention is heavy on joins and aggregates, (b) DuckDB streams Parquet without materialising it, (c) .df() at the end gives the same Pandas-shaped result.

Solution Using DuckDB SQL with `query(df)` mode

import duckdb
import pandas as pd

# Already-loaded Pandas dataframes — DuckDB queries them in-place
events = pd.read_parquet("events.parquet")          # 2 GB
users  = pd.read_parquet("users.parquet")           # 50 MB

cohort = duckdb.query("""
    WITH first_purchase AS (
      SELECT user_id,
             MIN(date_trunc('month', event_ts)) AS cohort_month
      FROM events
      WHERE event_type = 'purchase'
      GROUP BY user_id
    ),
    monthly_active AS (
      SELECT user_id,
             date_trunc('month', event_ts) AS active_month
      FROM events
      GROUP BY 1, 2
    )
    SELECT fp.cohort_month,
           m.active_month,
           COUNT(DISTINCT m.user_id) AS active_users
    FROM first_purchase fp
    JOIN monthly_active m USING (user_id)
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()

Step-by-step trace.

Step	Action	Result
1	`events` and `users` are Pandas DataFrames	columnar (Arrow-backed) in memory
2	`duckdb.query("SELECT ... FROM events ...")`	DuckDB sees `events` as a virtual table by name
3	DuckDB plans the CTE + join + group-by	vectorised columnar plan
4	DuckDB streams the result rows	no full materialisation in Python
5	`.df()`	returns a Pandas DataFrame indexed normally

Output (sketch — first few rows):

cohort_month	active_month	active_users
2026-01	2026-01	12000
2026-01	2026-02	8400
2026-01	2026-03	6300
2026-02	2026-02	14500
2026-02	2026-03	9700

Why this works — concept by concept:

Zero-copy query — DuckDB reads the columns of a Pandas DataFrame through the Arrow C interface; no data is copied between Pandas's memory and DuckDB's memory.
CTEs — the WITH first_purchase AS (...), WITH monthly_active AS (...) pattern reads top-down and is dramatically clearer than the equivalent Pandas chain (groupby, merge, groupby again).
Vectorised engine — DuckDB processes batches of 2,048 rows at a time, using SIMD where available. Aggregates and joins are 10–100× faster than the equivalent Pandas operations.
.df() boundary — the result is a normal Pandas DataFrame, so plotting, scikit-learn, or downstream code doesn't change. The migration touches only the heavy middle, not the edges.
Cost — query = O(rows) with constant memory; output dataframe = O(distinct(cohort, month)) rows. Compare to the Pandas-only version which would peak at O(rows × columns) memory and be 10–100× slower at the GROUP BY.

SQL
Topic — SQL generation
SQL on dataframes — DuckDB-style problems

Practice →

Worked example — DuckDB UNPIVOT for the `melt` equivalent

Detailed explanation. DuckDB supports UNPIVOT natively. Senior interviews sometimes probe "how would you write pandas.melt in pure SQL?" — DuckDB's answer is short and elegant.

Question. Given a wide quarterly KPIs table in DuckDB, produce the long version with columns region, quarter, value.

Code (DuckDB SQL).

SELECT region, quarter, value
FROM (
  SELECT * FROM wide_kpis
) UNPIVOT (
  value FOR quarter IN (Q1, Q2, Q3, Q4)
)
ORDER BY region, quarter;

Step-by-step explanation.

The inner subquery returns the wide table as-is.
UNPIVOT (value FOR quarter IN (Q1, Q2, Q3, Q4)) rotates the four named columns into rows. Each output row gets a quarter value (the original column name) and a value (the original cell value).
The region column passes through untouched because it isn't in the IN (...) list.

Output.

region	quarter	value
APAC	Q1	80
APAC	Q2	95
APAC	Q3	110
APAC	Q4	130
EU	Q1	120
EU	Q2	140
...	...	...

Rule of thumb. DuckDB's PIVOT / UNPIVOT syntax is one of the cleanest in any SQL dialect. When migrating a pandas.melt step to DuckDB, reach for UNPIVOT rather than UNION ALL boilerplate.

SQL
Topic — aggregation (SQL)
Pivot / unpivot SQL problems

Practice →

Cheat sheet — Pandas to SQL to DuckDB

A side-by-side mapping for the operations that show up in every pandas data engineering interview round. Memorise this table and you can fluently translate any pipeline between the three engines.

Intent	Pandas	SQL (ANSI)	DuckDB (extension)
Group + reduce	`df.groupby(k).agg("sum")`	`SELECT k, SUM(c) FROM t GROUP BY k`	same as ANSI
Window aggregate	`df.groupby(k)[c].transform("sum")`	`SUM(c) OVER (PARTITION BY k)`	same as ANSI
Inner join	`df1.merge(df2, on="k")`	`t1 JOIN t2 USING (k)`	same as ANSI
Left join	`df1.merge(df2, on="k", how="left")`	`t1 LEFT JOIN t2 USING (k)`	same as ANSI
Anti join	`merge(..., how="outer", indicator=True)` + filter	`WHERE NOT EXISTS (SELECT 1 FROM t2 ...)`	same as ANSI
Time-window join	`pd.merge_asof(...)`	`LATERAL JOIN ... ORDER BY ts DESC LIMIT 1`	`ASOF JOIN ... ON ... AND ts1 >= ts2`
Pivot (long → wide)	`df.pivot_table(...)`	`SUM(CASE WHEN col='x' THEN val END) ...`	`PIVOT ... ON ... USING SUM(...)`
Unpivot (wide → long)	`df.melt(...)`	`UNION ALL` per column	`UNPIVOT (val FOR col IN (...))`
Top-N per group	`df.groupby(k).head(n)`	`ROW_NUMBER() OVER ... WHERE rn <= n`	same as ANSI
Read Parquet	`pd.read_parquet(p)`	(engine-specific COPY)	`read_parquet('p')`
Read CSV	`pd.read_csv(p)`	`COPY ... FROM ...`	`read_csv_auto('p')`
Write to SQL	`df.to_sql(...)`	n/a	`COPY (SELECT ...) TO 'out.parquet'`

Three patterns interviewers expect you to translate fluently.

Group-by + named agg — df.groupby("k").agg(s=("c","sum")) ↔ SELECT k, SUM(c) AS s FROM t GROUP BY k. Same intent, same shape; the named-agg form is the production-safe Pandas spelling.
Window function — df.groupby("k")[c].transform("rank") ↔ RANK() OVER (PARTITION BY k ORDER BY c). Pandas transform keeps the row count; SQL window functions keep the row count too.
Anti-join — df1.merge(df2, ..., indicator=True).query("_merge == 'left_only'") ↔ LEFT JOIN ... WHERE t2.k IS NULL or WHERE NOT EXISTS (...). SQL has two natural spellings; Pandas has the one verbose idiom.

Frequently asked questions

Pandas vs DuckDB — which one should I use?

Use Pandas for sub-10 GB workloads, glue code, ad-hoc Jupyter exploration, and any step that needs scikit-learn, matplotlib, or another Python library that expects a DataFrame in-hand. Use DuckDB for the heavy steps — multi-table joins, aggregates, window functions over large Parquet files — and especially when the dataset is bigger than your RAM. The 2026 production pattern is both at once: duckdb.query("SELECT ... FROM my_df").df() runs SQL on a Pandas DataFrame in-place, with zero copy, and returns a new DataFrame.

Is Pandas dead in 2026?

No — Pandas is still the most-installed Python data library and still wins the "glue code + small data + sklearn" tier outright. The competitive pressure from Polars and DuckDB has, if anything, made Pandas better: Pandas 2.x added an optional PyArrow backend that closes much of the throughput gap, the named-aggregation syntax has matured, and the ecosystem (Modin, cuDF, DuckDB) makes it trivial to scale out the heavy steps. "Pandas is dead" is a meme; "Pandas should not be the engine for your 100 GB job" is correct.

`pandas.merge` vs `pandas.join` — what's the difference?

pd.merge(left, right, on=, how=) is the general-purpose SQL-style join — it joins on column values and is the right answer 95% of the time. df.join(other, how=) is a convenience method that joins on the index of both dataframes; it's nice when you've already indexed both sides on the same key, but in production code most teams use merge everywhere for consistency. Avoid mixing .join and .merge in the same file — it makes the code harder to scan.

`merge_asof` — when do I actually need it?

Every time you'd write LATERAL JOIN ... ORDER BY ts DESC LIMIT 1 in SQL, that's merge_asof in Pandas. The canonical use cases are: snapping each trade to the most recent quote tick, snapping each sensor reading to the nearest weather observation, snapping each click to the most recent ad impression. Always sort both inputs first (it doesn't sort for you), always pass by= when there's a partition column like symbol or device_id, and always set tolerance= so stale matches become NaN instead of silently joining to old data.

What's the practical memory footprint of Pandas — how big can it get?

A good rule of thumb: Pandas needs 3–5× the on-disk size of the data because of Python object overhead (especially for object-dtype strings), index structures, and intermediate copies during operations. A 2 GB Parquet file often consumes 10–15 GB of RAM when fully loaded. Mitigations include reading with explicit dtype (categoricals, int32/int8, datetime), reading only the columns you need (usecols=), the Pandas 2.x PyArrow backend (pd.read_parquet(..., dtype_backend="pyarrow")), and chunked reads with chunksize=. Beyond 10 GB single-machine, escalate to DuckDB or Polars.

Polars vs Pandas — which is faster, and when does it matter?

Polars is 5–30× faster than Pandas on aggregate-heavy and join-heavy workloads, with a smaller memory footprint thanks to its columnar lazy execution engine. The trade-off is API divergence — Polars uses pl.col("x").sum() expressions, not df["x"].sum(). The pragmatic 2026 answer for teams: keep Pandas for glue and library hand-off; reach for Polars or DuckDB SQL when the pipeline becomes pure dataframe ops and Pandas is the bottleneck. Polars is the right "I want a faster Pandas" answer; DuckDB is the right "I want SQL on my dataframe" answer.

Practice on PipeCode

Drill the Python practice library → for end-to-end Pandas pipeline questions on reshape, groupby, and merge.
Rehearse aggregation drills in Python → when the interviewer wants groupby + agg + transform cleanly.
Sharpen join problems in Python → for merge, merge_asof, and anti-join idioms.
For the SQL side of every Pandas operation, work through SQL aggregation problems → and SQL window function problems →.
For end-to-end ETL muscle, try ETL pipeline problems →.
Reinforce the fundamentals with the Python for Data Engineering Interviews course →.
For the design-round muscles, work through ETL system design for DE interviews →.
To pair Pandas reshaping with table modelling, browse data modelling for DE interviews →.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.

Pipecode.ai is Leetcode for Data Engineering — every Pandas, SQL, and DuckDB concept above ships with hands-on practice rooms where you reshape real dataframes, write real merges, and migrate real Pandas pipelines to DuckDB. Start with the Python library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice Python now →
ETL pipeline drills →

1. Why Pandas still ships in production DE pipelines

Pandas is the duct-tape layer between every API, file, and database in a modern Python pipeline

Worked example — design a Pandas pipeline for monthly KPI rollup

Pandas interview question on monthly KPI rollup

Solution Using groupby + named agg + to_sql

2. Reshape primitives — melt + pivot_table + stack/unstack

pandas melt is the wide-to-long transformation every downstream join, groupby, and chart needs

Worked example — reshape a monthly KPI report wide ↔ long

Pandas interview question on pivot_table vs pivot

Solution Using pivot_table with explicit aggfunc for duplicate keys

Worked example — wide_to_long for xN column patterns

3. groupby — agg, transform, apply

pandas groupby has three output shapes — agg compresses, transform broadcasts, apply does anything

Worked example — per-customer rolling 7-day average

Pandas interview question on transform vs apply

Solution Using groupby.transform("sum") for the per-group denominator

Worked example — groupby.filter to drop low-volume groups

4. merge — the seven join types

pandas merge is the full SQL JOIN family plus two superpowers (merge_asof, merge_ordered)

Worked example — merge_asof for trades + quotes time-window join

Pandas interview question on safe joins

Solution Using merge(..., validate="many_to_one")

Worked example — anti-join via indicator=True

5. DuckDB migration — SQL on dataframes

duckdb pandas is the 2026 escape hatch — query a Pandas dataframe with SQL, in the same Python process

Worked example — migrate a slow Pandas pipeline to DuckDB

Pandas interview question on choosing Pandas vs DuckDB

Solution Using DuckDB SQL with query(df) mode

Worked example — DuckDB UNPIVOT for the melt equivalent

Cheat sheet — Pandas to SQL to DuckDB

Frequently asked questions

Pandas vs DuckDB — which one should I use?

Is Pandas dead in 2026?

pandas.merge vs pandas.join — what's the difference?

merge_asof — when do I actually need it?

What's the practical memory footprint of Pandas — how big can it get?

Polars vs Pandas — which is faster, and when does it matter?

Practice on PipeCode

Solution Using `groupby` + named `agg` + `to_sql`

`pandas melt` is the wide-to-long transformation every downstream join, groupby, and chart needs

Pandas interview question on `pivot_table` vs `pivot`

Solution Using `pivot_table` with explicit `aggfunc` for duplicate keys

Worked example — `wide_to_long` for `xN` column patterns

`pandas groupby` has three output shapes — agg compresses, transform broadcasts, apply does anything

Pandas interview question on `transform` vs `apply`

Solution Using `groupby.transform("sum")` for the per-group denominator

Worked example — `groupby.filter` to drop low-volume groups

`pandas merge` is the full SQL JOIN family plus two superpowers (`merge_asof`, `merge_ordered`)

Worked example — `merge_asof` for trades + quotes time-window join

Solution Using `merge(..., validate="many_to_one")`

Worked example — anti-join via `indicator=True`

`duckdb pandas` is the 2026 escape hatch — query a Pandas dataframe with SQL, in the same Python process

Solution Using DuckDB SQL with `query(df)` mode

Worked example — DuckDB UNPIVOT for the `melt` equivalent

`pandas.merge` vs `pandas.join` — what's the difference?

`merge_asof` — when do I actually need it?