pandas melt is the single most useful reshape primitive in a data engineer's toolbox — it is the "untidy wide table → tidy long table" transformation that lets every downstream groupby, merge, and SQL UNPIVOT actually work. Five other primitives — pivot_table, stack/unstack, groupby.agg / transform / apply, and merge (with its six join types plus the underrated merge_asof) — round out the pandas data engineering core, and duckdb pandas is the 2026 escape hatch for when your dataframe stops fitting in RAM.
This guide walks the five primitives that show up most in production Python ETL — pandas pivot, pandas groupby, pandas merge, the DuckDB migration pattern (duckdb.query(df) SQL directly on a dataframe), plus a side-by-side cheat sheet that maps every Pandas operation to its SQL and DuckDB equivalent. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll know when to reach for melt over stack, when transform beats apply by 10×, when to drop Pandas entirely for duckdb.sql(), and how to use pandas to sql patterns to land dataframes in your warehouse without rewriting the pipeline.
When you want hands-on reps immediately after reading, browse Python practice library →, drill aggregation drills in Python →, and rehearse on ETL pipeline problems →.
On this page
- Why Pandas still ships in production DE pipelines
- Reshape primitives — melt + pivot_table + stack/unstack
- groupby — agg, transform, apply
- merge — the seven join types
- DuckDB migration — SQL on dataframes
- Cheat sheet — Pandas to SQL to DuckDB
- Frequently asked questions
- Practice on PipeCode
1. Why Pandas still ships in production DE pipelines
Pandas is the duct-tape layer between every API, file, and database in a modern Python pipeline
The one-sentence invariant: Pandas is an in-memory, columnar dataframe library that excels at single-machine workloads up to roughly 10 GB, with the richest ecosystem of file readers, SQL connectors, API clients, and visualisation hooks in the Python data world. Once you internalise that mental model — fast in RAM, hopeless at scale, but the lingua franca of every Python data tool — the rest of the pandas data engineering decision space becomes a clean trade-off: stay in Pandas for the small steps, escalate to DuckDB, Polars, or Spark for the big ones.
Pandas vs Polars vs DuckDB — when each one wins.
- Pandas. Default for sub-10 GB workloads. Universal API surface — every library on PyPI speaks Pandas. Best when you need to call ML libraries, plot, or hand a dataframe to a downstream script.
- Polars. Columnar, lazy, multi-threaded. 5–30× faster than Pandas on aggregate-heavy workloads. Use when the pipeline is pure dataframe ops and you've outgrown Pandas but still need single-machine.
- DuckDB. Embedded analytic SQL engine. Runs IN the same Python process, queries Pandas dataframes directly, scans Parquet/CSV/S3 natively. Use when SQL is the cleaner expression of the pipeline, or when you need to scale past Pandas without leaving Python.
- Spark / PySpark. Distributed. Use when single-machine truly isn't enough (> 100 GB) and you can pay the cluster cost.
The 2026 lightweight ETL stack.
- Pandas + DuckDB has eaten roughly half of the "small Spark cluster" use cases. The pattern is simple: read with Pandas (or DuckDB) → DuckDB-SQL for the heavy joins and aggregates → Pandas for the small downstream steps that need scikit-learn, matplotlib, or a Django ORM hand-off.
- DuckDB queries a Pandas dataframe with zero copy — the DataFrame's Arrow-backed columns are read directly by DuckDB's vectorised engine.
- Pandas 2.x adopts a PyArrow backend optionally — narrowing the performance gap with Polars and DuckDB without changing the API.
The four production roles Pandas still owns.
- Glue. The pipeline reads JSON from an API, joins against a CSV, writes to Postgres — Pandas covers all three in twenty lines.
-
Quick reshape.
melt,pivot_table,stack/unstackare easier to write and read than the SQL equivalents. - Ad-hoc exploration. Jupyter + Pandas is still the fastest path from "give me the answer" to "I have the answer."
- Tail-end of an ETL. After a big Spark/DuckDB job collapses 100 GB into a 50 MB summary, Pandas takes over for plotting, alerts, and ML feature prep.
What interviewers listen for.
- Do you reach for
groupby+agginstead of a Pythonforloop? — minimum signal. - Do you know the difference between
pivot(no aggregation) andpivot_table(with aggregation)? — required answer. - Do you mention
merge_asofwhen a time-window join comes up? — senior signal. - Do you propose DuckDB when the dataset gets large or the SQL is cleaner? — current-stack signal.
Worked example — design a Pandas pipeline for monthly KPI rollup
Detailed explanation. Realistic Pandas pipelines start from the schema ("we get one CSV per region per day"), the transformation ("roll up to monthly KPIs by region and product"), and the output target ("write to a Postgres table the BI tool reads"). Pandas wins here because all three steps live in one process, one file, no infra.
Question. Design a Pandas pipeline that ingests daily sales CSVs from S3, computes monthly KPIs (gross_sales, units, orders) per (region, product), and writes the result to a Postgres monthly_kpis table. Sketch the imports, the read step, the transform step, and the to_sql step.
Code.
import pandas as pd
from sqlalchemy import create_engine
import s3fs
# 1. Read — one call per partitioned prefix
df = pd.read_csv(
"s3://sales-raw/2026/05/*.csv",
parse_dates=["order_ts"],
dtype={"region": "category", "product_id": "category", "units": "int32"},
)
# 2. Transform — monthly bucket + groupby + agg
df["month"] = df["order_ts"].dt.to_period("M")
kpis = (
df.groupby(["month", "region", "product_id"], observed=True)
.agg(gross_sales=("amount", "sum"),
units=("units", "sum"),
orders=("order_id", "nunique"))
.reset_index()
)
# 3. Write — Pandas to SQL (Postgres)
engine = create_engine("postgresql+psycopg2://etl:***@db:5432/warehouse")
kpis.to_sql("monthly_kpis", engine, if_exists="append",
method="multi", chunksize=5_000, index=False)
Step-by-step explanation.
-
Read with explicit
dtype. Passingdtype={"region": "category", ...}shrinks the dataframe by 5–10× compared to defaultobjectcolumns.parse_datesdoes the date-string-to-datetime64work at read time, not later. -
Monthly bucket via
dt.to_period("M"). Cheaper thandt.strftimebecause it stays in pandas Period dtype, which compares and groups correctly. -
groupby(..., observed=True). With categorical group keys,observed=Trueskips empty (region, product) combinations — both faster and what the BI tool actually wants. -
agg(...)with named aggregations. Thegross_sales=("amount", "sum")syntax names the output column explicitly — no MultiIndex column mess. -
to_sql(..., method="multi", chunksize=5_000).method="multi"batchesINSERTstatements;chunksize=5_000caps each batch's parameter count so the driver doesn't OOM.
Output.
| Decision | Value | Why |
|---|---|---|
| Read |
pd.read_csv glob on S3 |
one call, partition-friendly |
| Categorical dtypes |
region, product_id
|
5-10× memory savings |
| Month key | dt.to_period("M") |
proper grouping dtype |
| GroupBy | observed=True |
skip empty cells |
| Agg | named aggregations | clean output columns |
| Write | to_sql(method="multi") |
batched INSERTs |
Rule of thumb. Pandas pipelines should fit in a single function. If you need more than ~150 lines or you're already worrying about RAM, escalate one step — usually DuckDB for the heavy transforms, Pandas at the tail.
Pandas interview question on monthly KPI rollup
A common warm-up: "Walk me through a Pandas pipeline that reads daily sales CSVs, computes monthly KPIs per region and product, and writes the result to a SQL warehouse." It exercises read_csv, dtype hygiene, groupby with observed, named agg, and to_sql — the five muscles every pandas data engineering opening round wants to see.
Solution Using groupby + named agg + to_sql
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_csv("s3://sales-raw/2026/05/*.csv",
parse_dates=["order_ts"],
dtype={"region": "category", "product_id": "category"})
df["month"] = df["order_ts"].dt.to_period("M").astype(str)
kpis = (
df.groupby(["month", "region", "product_id"], observed=True)
.agg(gross_sales=("amount", "sum"),
units=("units", "sum"),
orders=("order_id", "nunique"))
.reset_index()
)
engine = create_engine("postgresql+psycopg2://etl:***@db:5432/warehouse")
kpis.to_sql("monthly_kpis", engine, if_exists="append",
method="multi", chunksize=5_000, index=False)
Step-by-step trace.
Input rows (excerpt from df):
| order_id | order_ts | region | product_id | amount | units |
|---|---|---|---|---|---|
| 1 | 2026-05-02 | EU | A | 120.0 | 2 |
| 2 | 2026-05-04 | EU | A | 60.0 | 1 |
| 3 | 2026-05-09 | EU | B | 200.0 | 4 |
| 4 | 2026-05-11 | US | A | 80.0 | 1 |
| 5 | 2026-05-21 | US | A | 80.0 | 1 |
Trace:
-
df["month"] = ...adds"2026-05"to every row. -
groupby(["month","region","product_id"], observed=True)creates three groups: (2026-05, EU, A), (2026-05, EU, B), (2026-05, US, A). -
agg(gross_sales=("amount","sum"), units=("units","sum"), orders=("order_id","nunique"))reduces each group to one row. -
reset_index()flattens the MultiIndex back to plain columns. -
to_sql(...)issues batchedINSERT INTO monthly_kpis VALUES (...), (...)statements viamethod="multi".
Output:
| month | region | product_id | gross_sales | units | orders |
|---|---|---|---|---|---|
| 2026-05 | EU | A | 180.0 | 3 | 2 |
| 2026-05 | EU | B | 200.0 | 4 | 1 |
| 2026-05 | US | A | 160.0 | 2 | 2 |
Why this works — concept by concept:
-
Categorical dtypes —
regionandproduct_idascategorystore the column as an int8/int16 code into a small dictionary; memory drops 5–10× and group operations are vectorised on the codes. -
dt.to_period("M")— converts a datetime to a month-period dtype that groups and compares correctly without string acrobatics. -
groupby(..., observed=True)— by default, categorical groupby produces a row for every Cartesian combo of categories, including empty ones;observed=Trueonly emits rows that actually exist in the data. -
Named aggregations — the
output=("col","func")form names the output explicitly, so the resulting dataframe has flat columns and not the legacy('amount','sum')MultiIndex tuples. -
to_sql(method="multi", chunksize=...)— Pandas batchesINSERTstatements into one round-trip per chunk; without it, each row would be its own INSERT and the writer would be 50–100× slower. - Cost — read = O(rows); groupby = O(rows) with O(groups) memory; write = O(rows / chunksize) DB round-trips.
Python
Topic — ETL
ETL pipeline problems (Pandas, groupby, to_sql)
2. Reshape primitives — melt + pivot_table + stack/unstack
pandas melt is the wide-to-long transformation every downstream join, groupby, and chart needs
The mental model in one line: wide tables are for humans (one row per entity, one column per attribute), long tables are for computers (one row per (entity, attribute) pair) — melt converts wide → long, pivot_table converts long → wide with aggregation, stack/unstack rotate between row-index and column-index. Once you internalise that, the entire pandas melt / pandas pivot family collapses into "which direction am I going, and do I need aggregation on the way?"
The three reshape primitives every interview probes.
-
melt(id_vars=..., value_vars=..., var_name=..., value_name=...)— wide → long. Theid_varsare kept as-is; everyvalue_varscolumn becomes a row labelled with its column name invar_nameand value invalue_name. -
pivot_table(index=..., columns=..., values=..., aggfunc=...)— long → wide with aggregation. If two rows collide on the (index, columns) pair,aggfuncresolves them (defaultmean; usually you wantsumfor additive metrics). -
stack()/unstack()— rotate the last level of column index into the row index, or vice versa. Most useful with a MultiIndex on rows or columns.
Three more reshape moves senior interviews quietly love.
-
pivot(...)(no_table) — same shape transform aspivot_tablebut errors on duplicate (index, columns) — useful as an assertion. -
wide_to_long(stubnames=..., i=..., j=...)—melton steroids for column patterns likex_2024, x_2025, x_2026. -
explode(column)— list-valued column → one row per list element. Common after JSON ingestion.
SQL equivalents.
-
melt≅UNPIVOT(SQL Server, Snowflake) orcross join unnest(array[...])(Postgres-style). -
pivot_table≅GROUP BY ... PIVOT (...)(Snowflake, DuckDB) orGROUP BY + conditional SUM(CASE WHEN ...)everywhere else. -
stack/unstackhave no clean SQL analog — they're a hierarchical-index trick that exists because Pandas has MultiIndex.
Worked example — reshape a monthly KPI report wide ↔ long
Detailed explanation. Analysts often receive monthly KPI tables in wide format — one row per region, columns Q1, Q2, Q3, Q4. Joining that against a long "targets" table (one row per (region, quarter)) requires melting the wide table first; producing the executive summary requires pivoting back.
Question. Given a wide quarterly KPIs table, (a) melt it to long, (b) join against the targets long table, (c) pivot back to wide so the final executive report has one row per region.
Input (wide KPIs).
| region | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| EU | 120 | 140 | 155 | 170 |
| US | 200 | 215 | 230 | 250 |
| APAC | 80 | 95 | 110 | 130 |
Input (long targets).
| region | quarter | target |
|---|---|---|
| EU | Q1 | 100 |
| EU | Q2 | 130 |
| US | Q1 | 180 |
| US | Q2 | 210 |
| APAC | Q1 | 70 |
| APAC | Q2 | 90 |
Code.
import pandas as pd
wide = pd.DataFrame({
"region": ["EU", "US", "APAC"],
"Q1": [120, 200, 80],
"Q2": [140, 215, 95],
"Q3": [155, 230, 110],
"Q4": [170, 250, 130],
})
# (a) melt wide → long
long = wide.melt(id_vars="region",
value_vars=["Q1", "Q2", "Q3", "Q4"],
var_name="quarter",
value_name="actual")
# (b) join against the long targets
targets = pd.DataFrame({
"region": ["EU","EU","US","US","APAC","APAC"],
"quarter": ["Q1","Q2","Q1","Q2","Q1","Q2"],
"target": [100, 130, 180, 210, 70, 90],
})
joined = long.merge(targets, on=["region", "quarter"], how="left")
joined["gap"] = joined["actual"] - joined["target"]
# (c) pivot back — one row per region, one column per quarter, value = gap
report = joined.pivot_table(index="region",
columns="quarter",
values="gap",
aggfunc="sum")
Step-by-step explanation.
-
wide.melt(id_vars="region", value_vars=["Q1","Q2","Q3","Q4"])produces 3 × 4 = 12 rows, one per (region, quarter) pair. The melted dataframe has three columns:region,quarter,actual. -
long.merge(targets, on=["region","quarter"], how="left")enriches every long row with its matching target (orNaNfor Q3/Q4 which aren't intargets). -
joined["gap"] = ...adds theactual - targetdelta column. Q3 and Q4 rows haveNaNgap. -
joined.pivot_table(index="region", columns="quarter", values="gap", aggfunc="sum")rotates back to wide. Each region is one row; each quarter is one column; the cell value is the gap. - Because there's exactly one row per (region, quarter),
aggfunc="sum"is a no-op — but you'd want it the moment duplicates appear (e.g. multiple snapshots per quarter).
Output (after melt — long format).
| region | quarter | actual |
|---|---|---|
| EU | Q1 | 120 |
| EU | Q2 | 140 |
| EU | Q3 | 155 |
| EU | Q4 | 170 |
| US | Q1 | 200 |
| US | Q2 | 215 |
| ... | ... | ... |
Output (final pivoted gap report).
| region | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| APAC | 10 | 5 | NaN | NaN |
| EU | 20 | 10 | NaN | NaN |
| US | 20 | 5 | NaN | NaN |
Rule of thumb. Always melt before joining heterogeneously-shaped tables — it's almost always cheaper than building a custom multi-column join. Pivot back at the very end, after every join, filter, and aggregate is done.
Pandas interview question on pivot_table vs pivot
The probe typically sounds like: "What's the difference between df.pivot(...) and df.pivot_table(...)?" or "When does pivot fail and pivot_table succeed?" — testing whether the candidate knows that pivot errors on duplicates whereas pivot_table aggregates them away.
Solution Using pivot_table with explicit aggfunc for duplicate keys
import pandas as pd
# Long table with duplicates on (region, quarter)
df = pd.DataFrame({
"region": ["EU","EU","EU","US","US","APAC","APAC","APAC"],
"quarter": ["Q1","Q1","Q2","Q1","Q1","Q1","Q1","Q2"],
"amount": [60, 60, 140, 100, 100, 30, 50, 95],
})
# df.pivot(...) — raises ValueError because (EU,Q1) has 2 rows
# df.pivot_table(...) — aggregates them away
report = df.pivot_table(
index="region",
columns="quarter",
values="amount",
aggfunc="sum",
fill_value=0,
)
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 |
df has duplicates on (EU, Q1), (US, Q1), (APAC, Q1) |
non-unique (index, columns) pair |
| 2 |
pivot_table(..., aggfunc="sum") groups by (region, quarter) then sums amount
|
EU/Q1 → 120, US/Q1 → 200, APAC/Q1 → 80 |
| 3 | Rotates the result so quarter values become columns | wide layout |
| 4 |
fill_value=0 replaces missing (region, quarter) cells (e.g. APAC Q2 only) with 0 instead of NaN
|
clean integer output |
Output:
| region | Q1 | Q2 |
|---|---|---|
| APAC | 80 | 95 |
| EU | 120 | 140 |
| US | 200 | 0 |
Why this works — concept by concept:
-
pivotvspivot_tablesemantics —pivotis a pure reshape (no reduction); it raisesValueErroron duplicate (index, columns) pairs.pivot_tableaccepts duplicates and resolves them withaggfunc. -
aggfunc="sum"— explicit choice. The library default ismean, which silently lies for additive metrics. Always passaggfuncfor production code. -
fill_value=0— for sparse pivots,NaNcells make downstream arithmetic produceNaNs.fill_value=0keeps the integer dtype and the math correct. -
Determinism — the resulting column order matches
sorted(unique(quarter))unless you passcolumns=as a categorical withordered=True. -
Cost —
O(rows)aggregation pass +O(distinct(index) × distinct(columns))to allocate the wide block. Don't pivot ten-million-cardinality columns.
Python
Topic — aggregation (Python)
Pivot / melt aggregation problems (Pandas)
Worked example — wide_to_long for xN column patterns
Detailed explanation. Real-world wide tables aren't always neatly named Q1, Q2, Q3, Q4 — they often look like revenue_2023, revenue_2024, revenue_2025, cost_2023, cost_2024, cost_2025. Plain melt would produce one row per (entity, column) and you'd have to split the column name afterward. wide_to_long does it in one call.
Question. Given a wide table with multi-year revenue_YYYY and cost_YYYY columns, convert it to a long table with three columns: entity, year, revenue, cost.
Input.
| entity | revenue_2023 | revenue_2024 | revenue_2025 | cost_2023 | cost_2024 | cost_2025 |
|---|---|---|---|---|---|---|
| A | 100 | 120 | 140 | 60 | 65 | 70 |
| B | 80 | 90 | 110 | 50 | 55 | 60 |
Code.
import pandas as pd
df = pd.DataFrame({
"entity": ["A", "B"],
"revenue_2023": [100, 80],
"revenue_2024": [120, 90],
"revenue_2025": [140, 110],
"cost_2023": [60, 50],
"cost_2024": [65, 55],
"cost_2025": [70, 60],
})
long = pd.wide_to_long(
df,
stubnames=["revenue", "cost"],
i="entity",
j="year",
sep="_",
).reset_index()
Step-by-step explanation.
-
wide_to_longrecognises the<stub>_<year>pattern fromstubnames+sep="_". - For every entity, it produces one row per year (3 years × 2 stubs collapses to 3 rows × 2 columns per entity).
-
i="entity"marks the id column;j="year"is the name of the new column extracted from the suffix. -
reset_index()lifts the MultiIndex out so the result is a flat dataframe.
Output.
| entity | year | revenue | cost |
|---|---|---|---|
| A | 2023 | 100 | 60 |
| A | 2024 | 120 | 65 |
| A | 2025 | 140 | 70 |
| B | 2023 | 80 | 50 |
| B | 2024 | 90 | 55 |
| B | 2025 | 110 | 60 |
Rule of thumb. Whenever the column names contain the variable you want as a row, reach for wide_to_long. Plain melt is for the simple case where every melted column shares one role.
Python
Topic — data analysis (Python)
Reshape and wide-to-long problems (Pandas)
3. groupby — agg, transform, apply
pandas groupby has three output shapes — agg compresses, transform broadcasts, apply does anything
The mental model in one line: agg returns one row per group, transform returns one row per input row (per-group stat broadcast back), apply returns whatever the user function returns. Pick by the shape you need next. This single distinction collapses 80% of the pandas groupby interview questions into the same answer pattern.
The three groupby output shapes.
-
agg— reduces each group to one row. Equivalent SQL isSELECT key, AGG(col) FROM t GROUP BY key. Fastest of the three. -
transform— broadcasts a per-group statistic back to every row. Equivalent SQL isAGG(col) OVER (PARTITION BY key). Useful for "compare row to its group mean" patterns. -
apply— runs a user function per group and concatenates the results. The function can return a Series, a DataFrame, or a scalar — the shape varies. Slowest of the three.
The fourth move every senior interview quietly wants.
-
filter—groupby(key).filter(lambda g: predicate(g))keeps only the groups (not rows) that pass a predicate. Equivalent SQL isWHERE key IN (SELECT key FROM t GROUP BY key HAVING predicate(...)).
Why transform beats apply by 10×.
-
applyfalls back to per-group Python for arbitrary callables — one Python call per group. -
transformkeeps the result vectorised because pandas knows the output shape up front (same length as input); when used with built-in agg names ("mean","sum"), it stays on the C path. -
Rule: if your function can be expressed as
transform("mean")ortransform("rank")or eventransform(lambda x: x - x.mean()), never reach forapply.
groupby.agg shapes you'll write daily.
-
groupby(key).agg("sum")— every numeric column summed per group. -
groupby(key).agg({"a": "sum", "b": "mean"})— per-column agg map. Returns the old-style column names. -
groupby(key).agg(out_a=("a","sum"), out_b=("b","mean"))— named aggregations (modern, flat columns). Use this in production.
Worked example — per-customer rolling 7-day average
Detailed explanation. A common analytics ask: "for each customer, compute the rolling 7-day average of their order amount, attached to every row." This requires per-customer windowing — exactly what groupby + rolling + transform does in one expression.
Question. Given an orders table with customer_id, order_ts, amount, compute a rolling_7d_avg column = the rolling 7-day average of amount for that customer up to and including the current row.
Input.
| customer_id | order_ts | amount |
|---|---|---|
| 1 | 2026-05-01 | 100 |
| 1 | 2026-05-03 | 120 |
| 1 | 2026-05-08 | 80 |
| 1 | 2026-05-10 | 200 |
| 2 | 2026-05-01 | 50 |
| 2 | 2026-05-02 | 70 |
| 2 | 2026-05-09 | 60 |
Code.
import pandas as pd
orders = pd.DataFrame({
"customer_id": [1,1,1,1,2,2,2],
"order_ts": pd.to_datetime([
"2026-05-01","2026-05-03","2026-05-08","2026-05-10",
"2026-05-01","2026-05-02","2026-05-09",
]),
"amount": [100,120,80,200,50,70,60],
})
orders = orders.sort_values(["customer_id", "order_ts"])
# Index on order_ts so rolling("7D") understands the window
orders = orders.set_index("order_ts")
orders["rolling_7d_avg"] = (
orders.groupby("customer_id")["amount"]
.rolling("7D")
.mean()
.reset_index(level=0, drop=True)
)
orders = orders.reset_index()
Step-by-step explanation.
-
sort_values(["customer_id","order_ts"])ensures every customer's rows are ordered by time —rollingassumes ascending time. -
set_index("order_ts")puts the time column into the index.rolling("7D")understands a string like"7D"only when the index is a datetime. -
groupby("customer_id")["amount"].rolling("7D").mean()produces a Series with a MultiIndex(customer_id, order_ts). -
.reset_index(level=0, drop=True)drops thecustomer_idlevel so the result aligns back onto the originalorder_tsindex —orders["rolling_7d_avg"] = ...then broadcasts it correctly. - The final
.reset_index()liftsorder_tsback into a normal column for downstream code.
Output.
| customer_id | order_ts | amount | rolling_7d_avg |
|---|---|---|---|
| 1 | 2026-05-01 | 100 | 100.0 |
| 1 | 2026-05-03 | 120 | 110.0 |
| 1 | 2026-05-08 | 80 | 100.0 |
| 1 | 2026-05-10 | 200 | 133.33 |
| 2 | 2026-05-01 | 50 | 50.0 |
| 2 | 2026-05-02 | 70 | 60.0 |
| 2 | 2026-05-09 | 60 | 60.0 |
Rule of thumb. For per-group windowed stats, the chain groupby → rolling → mean → reset_index(level=0, drop=True) is the canonical pattern. The reset_index trick is the single most common source of "why doesn't the assignment line up?" debugging.
Pandas interview question on transform vs apply
The probe usually sounds like: "How would you add a % of group column to a payment dataframe — each row's amount as a percent of its merchant's total?" — testing whether the candidate reaches for transform("sum") (fast, vectorised) or apply (slow, easy trap).
Solution Using groupby.transform("sum") for the per-group denominator
import pandas as pd
df = pd.DataFrame({
"merchant_id": [7,7,7,8,8,9],
"payment_id": ["a","b","c","d","e","f"],
"amount": [50,150,100,200,300,400],
})
# Per-group total broadcast back to every row, vectorised
df["merchant_total"] = df.groupby("merchant_id")["amount"].transform("sum")
df["pct_of_merchant"] = df["amount"] / df["merchant_total"]
Step-by-step trace.
| Step | Action | Result |
|---|---|---|
| 1 | groupby("merchant_id") |
three groups: 7, 8, 9 |
| 2 | ["amount"].transform("sum") |
new Series, same length as input, each row replaced with its group sum |
| 3 | broadcast back: 50→300, 150→300, 100→300, 200→500, 300→500, 400→400 | one row per input row |
| 4 | df["pct_of_merchant"] = df["amount"] / df["merchant_total"] |
row-level division — no apply needed |
Output:
| merchant_id | payment_id | amount | merchant_total | pct_of_merchant |
|---|---|---|---|---|
| 7 | a | 50 | 300 | 0.1667 |
| 7 | b | 150 | 300 | 0.5000 |
| 7 | c | 100 | 300 | 0.3333 |
| 8 | d | 200 | 500 | 0.4000 |
| 8 | e | 300 | 500 | 0.6000 |
| 9 | f | 400 | 400 | 1.0000 |
Why this works — concept by concept:
-
transformshape contract — pandas knows the output Series has the same length and index as the input, so it can keep the operation vectorised and avoid the per-group Python call overheadapplywould incur. -
String-named aggregations —
transform("sum")stays on the C path;transform(lambda g: g.sum())would fall back to Python (still per-group, not per-row, so it's not catastrophic but it's measurably slower). -
Broadcast assignment —
df["merchant_total"] = ...aligns by index, which lines up becausetransformpreserves the input index. -
SQL equivalent —
SUM(amount) OVER (PARTITION BY merchant_id). Same intent, same shape, same name in the result. -
Cost —
O(rows)single pass +O(groups)aggregation state.apply(lambda)would beO(groups × per_group_python_overhead), often 10× slower on small groups.
Python
Topic — grouping
GroupBy and window function problems
Worked example — groupby.filter to drop low-volume groups
Detailed explanation. A common cleaning step: drop every merchant with fewer than 10 transactions before computing metrics. The naive approach computes a count, joins back, and filters — three lines and two passes. groupby.filter does it in one expression.
Question. Given a payments dataframe, keep only the rows belonging to merchants with at least 10 transactions.
Code.
import pandas as pd
clean = df.groupby("merchant_id").filter(lambda g: len(g) >= 10)
Step-by-step explanation.
-
groupby("merchant_id")partitions the dataframe by merchant. -
.filter(lambda g: len(g) >= 10)evaluates the predicate on each group — not each row. - Groups that pass keep all their rows; groups that fail are dropped entirely.
- The result preserves the original row order within each surviving group.
Output.
The output retains the same column schema as the input but excludes every row whose merchant_id had fewer than 10 records. Per-group counts that started at 4, 7, 12, 25 reduce the kept set to merchants with counts 12 and 25.
Rule of thumb. filter is the right tool when you want to keep rows based on a group-level predicate. Don't confuse it with the dataframe-level .filter() (which selects columns) — different method, different signature.
4. merge — the seven join types
pandas merge is the full SQL JOIN family plus two superpowers (merge_asof, merge_ordered)
The mental model in one line: pd.merge(left, right, on=..., how=...) is the API; how picks the SQL JOIN type; merge_asof adds time-window joins that SQL needs LATERAL JOIN ... LIMIT 1 to express; merge_ordered handles sorted joins efficiently. Once you can name all seven outputs of the how argument, the pandas merge interview surface is essentially closed.
The seven join types every interview probes.
-
how="inner"— only rows whose keys match in both dataframes. The default. -
how="left"— every row from left; matching right rows filled in;NaNfor unmatched. -
how="right"— mirror of left. Rare in practice — usually rewritten as a left join with sides swapped. -
how="outer"— every row from either side,NaN-filled where the other side is missing. -
how="cross"— Cartesian product. Useful for "every customer × every product" matrix building. Be careful — output grows as O(N × M). -
Anti-join via
indicator=True+ filter —merge(..., how="outer", indicator=True)adds a_mergecolumn with valuesleft_only,right_only,both. Filter to one of these to get a true anti-join. -
Semi-join via
isin—left[left["key"].isin(right["key"])]keeps left rows whose key appears in right, without the columns from right. Simulates SQL'sWHERE EXISTS (...).
Two superpowers SQL needs ugly code to match.
-
merge_asof(left, right, on=..., by=..., tolerance=...)— for each row inleft, snap to the most recent row inrightwhose key is≤the left key. Canonical use case: trades + quotes, sensor readings + weather, "what was the price just before this trade?" -
merge_ordered(left, right, on=..., fill_method="ffill")— outer join on sorted keys, with optional forward-fill. Used for irregular time-series alignment.
The two arguments interviewers always probe.
-
on=vsleft_on=/right_on=—onrequires the same column name in both;left_on/right_onis for mismatched names. Pass both as lists for composite keys. -
validate=—"one_to_one","one_to_many","many_to_one","many_to_many". Use this in production code to assert join cardinality and catch fan-out bugs early.
Worked example — merge_asof for trades + quotes time-window join
Detailed explanation. A textbook financial use case: every trade row needs the most recent quote that existed at or before the trade timestamp. A naive merge(..., on="ts") would only match exact timestamps — usually zero rows. merge_asof snaps each trade to the nearest preceding quote in O((N + M) log) time using a merge-sort-style walk.
Question. Given a trades table (one row per trade with symbol, trade_ts, qty) and a quotes table (one row per quote tick with symbol, quote_ts, bid, ask), produce one row per trade with the quote that was active at trade time.
Input (trades).
| symbol | trade_ts | qty |
|---|---|---|
| AAPL | 09:30:01.005 | 100 |
| AAPL | 09:30:01.013 | 50 |
| MSFT | 09:30:01.007 | 200 |
Input (quotes).
| symbol | quote_ts | bid | ask |
|---|---|---|---|
| AAPL | 09:30:01.000 | 189.95 | 190.05 |
| AAPL | 09:30:01.010 | 189.97 | 190.04 |
| MSFT | 09:30:01.000 | 425.00 | 425.10 |
| MSFT | 09:30:01.008 | 425.03 | 425.12 |
Code.
import pandas as pd
trades = pd.DataFrame({
"symbol": ["AAPL","AAPL","MSFT"],
"trade_ts": pd.to_datetime([
"2026-06-04 09:30:01.005",
"2026-06-04 09:30:01.013",
"2026-06-04 09:30:01.007",
]),
"qty": [100, 50, 200],
})
quotes = pd.DataFrame({
"symbol": ["AAPL","AAPL","MSFT","MSFT"],
"quote_ts": pd.to_datetime([
"2026-06-04 09:30:01.000",
"2026-06-04 09:30:01.010",
"2026-06-04 09:30:01.000",
"2026-06-04 09:30:01.008",
]),
"bid": [189.95, 189.97, 425.00, 425.03],
"ask": [190.05, 190.04, 425.10, 425.12],
})
# Both sides must be sorted by the merge key
trades = trades.sort_values("trade_ts")
quotes = quotes.sort_values("quote_ts")
joined = pd.merge_asof(
trades,
quotes,
left_on="trade_ts",
right_on="quote_ts",
by="symbol", # match per-symbol
direction="backward", # snap to ≤ quote_ts
tolerance=pd.Timedelta("1s"), # ignore stale quotes
)
Step-by-step explanation.
- Both dataframes are sorted by the time key (
merge_asofrequires this — it doesn't sort for you). - For each row in
trades,merge_asoffinds the row inquoteswith the largestquote_ts≤trade_tsand the samesymbol. -
direction="backward"is the default — snap to the most recent prior tick. Other choices:"forward"(next tick) or"nearest". -
by="symbol"partitions both sides — AAPL trades only match AAPL quotes; MSFT only matches MSFT. -
tolerance=pd.Timedelta("1s")rejects matches older than 1 second — better to emitNaNthan to snap to a quote that's no longer relevant.
Output.
| symbol | trade_ts | qty | quote_ts | bid | ask |
|---|---|---|---|---|---|
| AAPL | 09:30:01.005 | 100 | 09:30:01.000 | 189.95 | 190.05 |
| AAPL | 09:30:01.013 | 50 | 09:30:01.010 | 189.97 | 190.04 |
| MSFT | 09:30:01.007 | 200 | 09:30:01.000 | 425.00 | 425.10 |
Rule of thumb. Whenever a SQL solution would start with LATERAL JOIN ... ORDER BY ... LIMIT 1, the Pandas equivalent is merge_asof. Always sort both inputs first; always pass by= when you have a partitioning column like symbol or device_id.
Pandas interview question on safe joins
A senior interviewer often shapes this round as: "You merge a 1M-row orders table with a 50k-row customers table and end up with 1.4M rows. What went wrong?" — testing whether the candidate uses validate= to assert join cardinality.
Solution Using merge(..., validate="many_to_one")
import pandas as pd
orders = pd.DataFrame({
"order_id": [1,2,3,4,5],
"customer_id":[10,10,20,20,30],
"amount": [50,75,200,30,90],
})
# customers table has a (silent) duplicate on customer_id=10
customers = pd.DataFrame({
"customer_id":[10,10,20,30],
"name": ["Ann","Ann (dup)","Bob","Cara"],
})
# This will raise — orders→customers should be many_to_one but isn't
try:
bad = orders.merge(customers, on="customer_id", how="left",
validate="many_to_one")
except Exception as e:
print("validate caught the bug:", e)
# Fix the dup first
customers_clean = customers.drop_duplicates(subset="customer_id", keep="first")
good = orders.merge(customers_clean, on="customer_id", how="left",
validate="many_to_one")
Step-by-step trace.
| Step | Action | Result |
|---|---|---|
| 1 | orders.merge(customers, ..., validate="many_to_one") |
pandas checks every right-side key appears at most once |
| 2 |
customer_id=10 appears twice in customers
|
validation raises MergeError: Merge keys are not unique in right dataset
|
| 3 | customers.drop_duplicates(subset="customer_id", keep="first") |
dedup before merging |
| 4 | retry with validate="many_to_one"
|
passes; resulting dataframe has exactly 5 rows |
Output:
| order_id | customer_id | amount | name |
|---|---|---|---|
| 1 | 10 | 50 | Ann |
| 2 | 10 | 75 | Ann |
| 3 | 20 | 200 | Bob |
| 4 | 20 | 30 | Bob |
| 5 | 30 | 90 | Cara |
Why this works — concept by concept:
-
validate=semantics — the four values"one_to_one","one_to_many","many_to_one","many_to_many"make Pandas assert the cardinality of the join. The check runs once, at merge time, before the cross-product blows up. -
Fan-out bugs — the classic "join inflated my row count" bug happens when one side has accidental duplicates.
validatecatches it; without it, the bug ships silently to production and the BI tool double-counts revenue. -
drop_duplicates(subset=, keep=)— explicit dedup withkeep="first"(or"last") is the right fix; never silently rely on Pandas to "do the right thing" on duplicate join keys. -
SQL equivalent —
LEFT JOIN customers c ON o.customer_id = c.customer_idplus a primary-key constraint oncustomers.customer_id.validate=is Pandas's substitute for the missing PK constraint. -
Cost — validate adds an
O(N + M)uniqueness check; with the cardinality guarantee, the merge itself isO(N + M)for a hash-join. Cheap insurance.
Python
Topic — joins
Join cardinality and merge problems
Worked example — anti-join via indicator=True
Detailed explanation. Anti-join = "left rows whose key is not in right." Pandas doesn't have a how="anti", so the canonical idiom is merge(..., how="outer", indicator=True) + filter on _merge == "left_only". It's verbose but explicit, and senior interviewers expect you to know it.
Question. Find every order whose customer_id does not appear in the active_customers table.
Code.
import pandas as pd
orders = pd.DataFrame({
"order_id": [1,2,3,4],
"customer_id":[10,20,30,40],
})
active = pd.DataFrame({"customer_id": [10, 30]})
result = (
orders.merge(active, on="customer_id", how="left", indicator=True)
.query("_merge == 'left_only'")
.drop(columns="_merge")
)
Step-by-step explanation.
-
merge(..., how="left", indicator=True)keeps every order row and adds a_mergecolumn. -
_mergeis"both"for orders whosecustomer_idmatched,"left_only"for orders whosecustomer_iddid not. -
.query("_merge == 'left_only'")keeps only the unmatched orders. -
.drop(columns="_merge")removes the helper column.
Output.
| order_id | customer_id |
|---|---|
| 2 | 20 |
| 4 | 40 |
Rule of thumb. Use indicator=True whenever you need to know why rows survived a join. For pure anti-join, the _merge == "left_only" idiom is the canonical Pandas pattern.
SQL
Topic — anti-join (SQL)
Anti-join and exists problems (SQL)
5. DuckDB migration — SQL on dataframes
duckdb pandas is the 2026 escape hatch — query a Pandas dataframe with SQL, in the same Python process
The mental model in one line: DuckDB is an embedded columnar analytic database that runs IN your Python process, can read a Pandas DataFrame directly (zero-copy), and runs 10–100× faster than Pandas on aggregate-heavy queries. You don't replace Pandas — you escalate the heavy steps to DuckDB SQL and keep Pandas for the small steps before and after.
Why DuckDB pairs so well with Pandas.
-
Zero-copy query of a DataFrame.
duckdb.query("SELECT ... FROM my_df")registers the dataframe as a virtual table; DuckDB reads it directly via Arrow. -
Native readers.
read_csv_auto,read_parquet,read_json_auto, thehttpfsextension for S3 — DuckDB can replacepd.read_csvandpd.read_parquetin one line. - Columnar vectorised engine. Built around Arrow-style columnar batches. Aggregates and joins are 10–100× faster than Pandas on the same hardware.
-
.df()on the result. Every query result has a.df()method that returns a normal Pandas DataFrame — the round-trip is trivial.
When to drop Pandas and use DuckDB instead.
- > 10 GB datasets. Pandas materialises the whole dataframe; DuckDB streams over Parquet/CSV without loading everything.
-
Complex multi-table joins. SQL JOINs over 3+ tables are clearer than nested
.merge(...)chains. - Aggregate-heavy queries. GROUP BY + window functions are 10–100× faster.
- You want SQL anyway. If the team thinks in SQL, ship SQL — don't translate to Pandas.
When to stay in Pandas.
- The dataframe fits in RAM and the operation is a one-liner.
- You're handing the dataframe to scikit-learn, matplotlib, or another Python library that expects a Pandas API.
- Iterative exploration in Jupyter where you're tweaking one step at a time.
Worked example — migrate a slow Pandas pipeline to DuckDB
Detailed explanation. A team has a Pandas pipeline that reads 10 GB of parquet, joins three tables, groups by region/month, and writes the result. The pipeline takes 10 minutes, needs 32 GB of RAM, and OOMs roughly once a week. Migration target: the same pipeline in DuckDB SQL, same outputs, dramatically faster and smaller memory footprint.
Question. Migrate the following Pandas pipeline to DuckDB SQL:
orders = pd.read_parquet("s3://lake/orders/") # 10M rows
items = pd.read_parquet("s3://lake/order_items/") # 50M rows
products = pd.read_parquet("s3://lake/products/") # 100k rows
joined = (orders
.merge(items, on="order_id")
.merge(products, on="product_id"))
joined["month"] = pd.to_datetime(joined["order_ts"]).dt.to_period("M").astype(str)
result = (joined.groupby(["month","region","category"], observed=True)
.agg(gross_sales=("amount","sum"),
orders=("order_id","nunique"))
.reset_index())
Code (the DuckDB equivalent).
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;") # S3 reader
result = con.execute("""
SELECT
strftime(o.order_ts, '%Y-%m') AS month,
o.region,
p.category,
SUM(i.amount) AS gross_sales,
COUNT(DISTINCT o.order_id) AS orders
FROM read_parquet('s3://lake/orders/*.parquet') AS o
JOIN read_parquet('s3://lake/order_items/*.parquet') AS i USING (order_id)
JOIN read_parquet('s3://lake/products/*.parquet') AS p USING (product_id)
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
""").df() # ← back to Pandas
Step-by-step explanation.
-
duckdb.connect()opens an in-process database — no separate server, no docker, nothing to install beyondpip install duckdb. -
INSTALL httpfs; LOAD httpfs;enables S3 reads on the fly. -
read_parquet('s3://...*.parquet')is a DuckDB native reader — it streams the files columnar without materialising them into memory. - The three-table join, the month bucket, and the group-by are one SQL query — DuckDB plans them together, pushes filters down, and executes vectorised.
-
.df()on the result returns a Pandas DataFrame — the rest of the pipeline (plotting, downstream ML) doesn't change at all.
Output (comparison).
| Metric | Pandas pipeline | DuckDB pipeline |
|---|---|---|
| Runtime | ~10 min | ~30 sec |
| Peak RAM | ~32 GB | ~4 GB |
| Lines of code | ~12 | ~10 |
| Failure mode |
MemoryError once a week |
none observed |
| Output | identical | identical |
Rule of thumb. When the Pandas pipeline does heavy joins or heavy aggregates and the data is bigger than 1 GB, rewrite the heavy middle as a DuckDB SQL string and .df() back to Pandas at the boundary. Don't migrate everything at once — migrate the bottleneck step.
Pandas interview question on choosing Pandas vs DuckDB
A common senior probe: "You have a 2 GB Parquet dataset. You need to compute monthly cohort retention. Pandas or DuckDB?" The candidate is expected to reach for DuckDB because (a) cohort retention is heavy on joins and aggregates, (b) DuckDB streams Parquet without materialising it, (c) .df() at the end gives the same Pandas-shaped result.
Solution Using DuckDB SQL with query(df) mode
import duckdb
import pandas as pd
# Already-loaded Pandas dataframes — DuckDB queries them in-place
events = pd.read_parquet("events.parquet") # 2 GB
users = pd.read_parquet("users.parquet") # 50 MB
cohort = duckdb.query("""
WITH first_purchase AS (
SELECT user_id,
MIN(date_trunc('month', event_ts)) AS cohort_month
FROM events
WHERE event_type = 'purchase'
GROUP BY user_id
),
monthly_active AS (
SELECT user_id,
date_trunc('month', event_ts) AS active_month
FROM events
GROUP BY 1, 2
)
SELECT fp.cohort_month,
m.active_month,
COUNT(DISTINCT m.user_id) AS active_users
FROM first_purchase fp
JOIN monthly_active m USING (user_id)
GROUP BY 1, 2
ORDER BY 1, 2
""").df()
Step-by-step trace.
| Step | Action | Result |
|---|---|---|
| 1 |
events and users are Pandas DataFrames |
columnar (Arrow-backed) in memory |
| 2 | duckdb.query("SELECT ... FROM events ...") |
DuckDB sees events as a virtual table by name |
| 3 | DuckDB plans the CTE + join + group-by | vectorised columnar plan |
| 4 | DuckDB streams the result rows | no full materialisation in Python |
| 5 | .df() |
returns a Pandas DataFrame indexed normally |
Output (sketch — first few rows):
| cohort_month | active_month | active_users |
|---|---|---|
| 2026-01 | 2026-01 | 12000 |
| 2026-01 | 2026-02 | 8400 |
| 2026-01 | 2026-03 | 6300 |
| 2026-02 | 2026-02 | 14500 |
| 2026-02 | 2026-03 | 9700 |
Why this works — concept by concept:
- Zero-copy query — DuckDB reads the columns of a Pandas DataFrame through the Arrow C interface; no data is copied between Pandas's memory and DuckDB's memory.
-
CTEs — the
WITH first_purchase AS (...),WITH monthly_active AS (...)pattern reads top-down and is dramatically clearer than the equivalent Pandas chain (groupby,merge,groupbyagain). - Vectorised engine — DuckDB processes batches of 2,048 rows at a time, using SIMD where available. Aggregates and joins are 10–100× faster than the equivalent Pandas operations.
-
.df()boundary — the result is a normal Pandas DataFrame, so plotting, scikit-learn, or downstream code doesn't change. The migration touches only the heavy middle, not the edges. -
Cost — query =
O(rows)with constant memory; output dataframe =O(distinct(cohort, month))rows. Compare to the Pandas-only version which would peak atO(rows × columns)memory and be 10–100× slower at the GROUP BY.
SQL
Topic — SQL generation
SQL on dataframes — DuckDB-style problems
Worked example — DuckDB UNPIVOT for the melt equivalent
Detailed explanation. DuckDB supports UNPIVOT natively. Senior interviews sometimes probe "how would you write pandas.melt in pure SQL?" — DuckDB's answer is short and elegant.
Question. Given a wide quarterly KPIs table in DuckDB, produce the long version with columns region, quarter, value.
Code (DuckDB SQL).
SELECT region, quarter, value
FROM (
SELECT * FROM wide_kpis
) UNPIVOT (
value FOR quarter IN (Q1, Q2, Q3, Q4)
)
ORDER BY region, quarter;
Step-by-step explanation.
- The inner subquery returns the wide table as-is.
-
UNPIVOT (value FOR quarter IN (Q1, Q2, Q3, Q4))rotates the four named columns into rows. Each output row gets aquartervalue (the original column name) and avalue(the original cell value). - The
regioncolumn passes through untouched because it isn't in theIN (...)list.
Output.
| region | quarter | value |
|---|---|---|
| APAC | Q1 | 80 |
| APAC | Q2 | 95 |
| APAC | Q3 | 110 |
| APAC | Q4 | 130 |
| EU | Q1 | 120 |
| EU | Q2 | 140 |
| ... | ... | ... |
Rule of thumb. DuckDB's PIVOT / UNPIVOT syntax is one of the cleanest in any SQL dialect. When migrating a pandas.melt step to DuckDB, reach for UNPIVOT rather than UNION ALL boilerplate.
SQL
Topic — aggregation (SQL)
Pivot / unpivot SQL problems
Cheat sheet — Pandas to SQL to DuckDB
A side-by-side mapping for the operations that show up in every pandas data engineering interview round. Memorise this table and you can fluently translate any pipeline between the three engines.
| Intent | Pandas | SQL (ANSI) | DuckDB (extension) |
|---|---|---|---|
| Group + reduce | df.groupby(k).agg("sum") |
SELECT k, SUM(c) FROM t GROUP BY k |
same as ANSI |
| Window aggregate | df.groupby(k)[c].transform("sum") |
SUM(c) OVER (PARTITION BY k) |
same as ANSI |
| Inner join | df1.merge(df2, on="k") |
t1 JOIN t2 USING (k) |
same as ANSI |
| Left join | df1.merge(df2, on="k", how="left") |
t1 LEFT JOIN t2 USING (k) |
same as ANSI |
| Anti join |
merge(..., how="outer", indicator=True) + filter |
WHERE NOT EXISTS (SELECT 1 FROM t2 ...) |
same as ANSI |
| Time-window join | pd.merge_asof(...) |
LATERAL JOIN ... ORDER BY ts DESC LIMIT 1 |
ASOF JOIN ... ON ... AND ts1 >= ts2 |
| Pivot (long → wide) | df.pivot_table(...) |
SUM(CASE WHEN col='x' THEN val END) ... |
PIVOT ... ON ... USING SUM(...) |
| Unpivot (wide → long) | df.melt(...) |
UNION ALL per column |
UNPIVOT (val FOR col IN (...)) |
| Top-N per group | df.groupby(k).head(n) |
ROW_NUMBER() OVER ... WHERE rn <= n |
same as ANSI |
| Read Parquet | pd.read_parquet(p) |
(engine-specific COPY) | read_parquet('p') |
| Read CSV | pd.read_csv(p) |
COPY ... FROM ... |
read_csv_auto('p') |
| Write to SQL | df.to_sql(...) |
n/a | COPY (SELECT ...) TO 'out.parquet' |
Three patterns interviewers expect you to translate fluently.
-
Group-by + named agg —
df.groupby("k").agg(s=("c","sum"))↔SELECT k, SUM(c) AS s FROM t GROUP BY k. Same intent, same shape; the named-agg form is the production-safe Pandas spelling. -
Window function —
df.groupby("k")[c].transform("rank")↔RANK() OVER (PARTITION BY k ORDER BY c). Pandastransformkeeps the row count; SQL window functions keep the row count too. -
Anti-join —
df1.merge(df2, ..., indicator=True).query("_merge == 'left_only'")↔LEFT JOIN ... WHERE t2.k IS NULLorWHERE NOT EXISTS (...). SQL has two natural spellings; Pandas has the one verbose idiom.
Frequently asked questions
Pandas vs DuckDB — which one should I use?
Use Pandas for sub-10 GB workloads, glue code, ad-hoc Jupyter exploration, and any step that needs scikit-learn, matplotlib, or another Python library that expects a DataFrame in-hand. Use DuckDB for the heavy steps — multi-table joins, aggregates, window functions over large Parquet files — and especially when the dataset is bigger than your RAM. The 2026 production pattern is both at once: duckdb.query("SELECT ... FROM my_df").df() runs SQL on a Pandas DataFrame in-place, with zero copy, and returns a new DataFrame.
Is Pandas dead in 2026?
No — Pandas is still the most-installed Python data library and still wins the "glue code + small data + sklearn" tier outright. The competitive pressure from Polars and DuckDB has, if anything, made Pandas better: Pandas 2.x added an optional PyArrow backend that closes much of the throughput gap, the named-aggregation syntax has matured, and the ecosystem (Modin, cuDF, DuckDB) makes it trivial to scale out the heavy steps. "Pandas is dead" is a meme; "Pandas should not be the engine for your 100 GB job" is correct.
pandas.merge vs pandas.join — what's the difference?
pd.merge(left, right, on=, how=) is the general-purpose SQL-style join — it joins on column values and is the right answer 95% of the time. df.join(other, how=) is a convenience method that joins on the index of both dataframes; it's nice when you've already indexed both sides on the same key, but in production code most teams use merge everywhere for consistency. Avoid mixing .join and .merge in the same file — it makes the code harder to scan.
merge_asof — when do I actually need it?
Every time you'd write LATERAL JOIN ... ORDER BY ts DESC LIMIT 1 in SQL, that's merge_asof in Pandas. The canonical use cases are: snapping each trade to the most recent quote tick, snapping each sensor reading to the nearest weather observation, snapping each click to the most recent ad impression. Always sort both inputs first (it doesn't sort for you), always pass by= when there's a partition column like symbol or device_id, and always set tolerance= so stale matches become NaN instead of silently joining to old data.
What's the practical memory footprint of Pandas — how big can it get?
A good rule of thumb: Pandas needs 3–5× the on-disk size of the data because of Python object overhead (especially for object-dtype strings), index structures, and intermediate copies during operations. A 2 GB Parquet file often consumes 10–15 GB of RAM when fully loaded. Mitigations include reading with explicit dtype (categoricals, int32/int8, datetime), reading only the columns you need (usecols=), the Pandas 2.x PyArrow backend (pd.read_parquet(..., dtype_backend="pyarrow")), and chunked reads with chunksize=. Beyond 10 GB single-machine, escalate to DuckDB or Polars.
Polars vs Pandas — which is faster, and when does it matter?
Polars is 5–30× faster than Pandas on aggregate-heavy and join-heavy workloads, with a smaller memory footprint thanks to its columnar lazy execution engine. The trade-off is API divergence — Polars uses pl.col("x").sum() expressions, not df["x"].sum(). The pragmatic 2026 answer for teams: keep Pandas for glue and library hand-off; reach for Polars or DuckDB SQL when the pipeline becomes pure dataframe ops and Pandas is the bottleneck. Polars is the right "I want a faster Pandas" answer; DuckDB is the right "I want SQL on my dataframe" answer.
Practice on PipeCode
- Drill the Python practice library → for end-to-end Pandas pipeline questions on reshape, groupby, and merge.
- Rehearse aggregation drills in Python → when the interviewer wants
groupby + agg + transformcleanly. - Sharpen join problems in Python → for
merge,merge_asof, and anti-join idioms. - For the SQL side of every Pandas operation, work through SQL aggregation problems → and SQL window function problems →.
- For end-to-end ETL muscle, try ETL pipeline problems →.
- Reinforce the fundamentals with the Python for Data Engineering Interviews course →.
- For the design-round muscles, work through ETL system design for DE interviews →.
- To pair Pandas reshaping with table modelling, browse data modelling for DE interviews →.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Pipecode.ai is Leetcode for Data Engineering — every Pandas, SQL, and DuckDB concept above ships with hands-on practice rooms where you reshape real dataframes, write real merges, and migrate real Pandas pipelines to DuckDB. Start with the Python library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)