DEV Community

Cover image for Polars vs Pandas vs DuckDB Benchmarked: Speed, Memory & API Trade-offs
Gowtham Potureddi
Gowtham Potureddi

Posted on

Polars vs Pandas vs DuckDB Benchmarked: Speed, Memory & API Trade-offs

polars vs pandas is no longer an academic question — it is a single-node DataFrame engine decision that every data engineering team must make before they write the first read_parquet. A 2026 senior data engineer is expected to pick between three engines whose architectures are not just different libraries but three different philosophies: Polars (Rust + Apache Arrow + lazy plan), Pandas (eager NumPy + index alignment), and DuckDB (embedded vectorised columnar SQL). Pick the wrong one and a 30-minute ETL turns into a 6-hour out-of-memory nightmare; pick the right one and the same job finishes in 90 seconds on the same laptop.

This guide is the cheat sheet you wished existed the first time pandas vs polars came up in a system-design loop. It walks through the three philosophies side by side, anchors every claim in a 2026 benchmark — speed, memory peak, API ergonomics, ecosystem — and ships a worked Solution-Tail interview answer per section with code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works. By the end you will know exactly when polars python wins, when duckdb vs pandas is no contest, and when the boring pandas alternative is still the right call.

PipeCode blog header for Polars vs Pandas vs DuckDB — bold white headline 'Polars vs Pandas vs DuckDB' with subtitle 'speed · memory · API · 2026 benchmark' over a stylised three-lane race track seen from above with a polar-bear silhouette, panda silhouette, and duck silhouette racing along their lanes, on a dark gradient with purple, orange, green, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the data manipulation practice library →, rehearse on aggregation problems →, and stack the muscle memory with data processing drills →.


On this page


1. Three engines, three philosophies — eager vs lazy vs columnar SQL

The DataFrame stack is no longer a one-horse race — Pandas, Polars, and DuckDB embody three distinct mental models for the same workload

The one-sentence invariant: Pandas evaluates eagerly on a NumPy block manager, Polars builds a lazy query plan over Apache Arrow columns, and DuckDB compiles your SQL into a vectorised columnar pipeline — three answers to the same single-node analytics question, optimising for three different things. Once you internalise that "eager / lazy / declarative" axis, every other behaviour difference becomes a deduction rather than a memorised fact.

The three engines in one paragraph each.

  • Pandas — eager + NumPy + index. Each call to df.filter, df.assign, df.groupby returns a fully materialised DataFrame immediately. The underlying storage is a NumPy "BlockManager" that groups columns by dtype into 2-D arrays. The killer feature is the labelled index: every row has a hashable label, every join and reindex aligns by that label, and the entire ecosystem (matplotlib, scikit-learn, statsmodels) speaks "Pandas DataFrame" natively.
  • Polars — lazy + Rust + Arrow. Each call to pl.scan_parquet, .filter, .group_by, .agg returns a LazyFrame — a deferred query plan. You only pay execution cost when you call .collect(), at which point the planner fuses operations, pushes filters and projections into the source scan, eliminates common sub-expressions, and runs the plan on a Rayon worker pool across every CPU core. Storage is Apache Arrow columns; the API is a strict expression DSL that is type-checked at plan time.
  • DuckDB — embedded + vectorised + SQL. A single in-process library (pip install duckdb) that gives you a full ANSI SQL engine with windows, recursive CTEs, complex types, and a vectorised execution engine that processes batches of 2048 rows at a time. DuckDB can read parquet, CSV, JSON, Arrow, Pandas, and Polars in place — zero-copy in the Arrow case — and emit results back as any of those formats.

Why "eager vs lazy vs SQL" matters.

  • Eager (Pandas). Forces the developer to think step-by-step, but pays a materialisation tax at every step. Three chained calls = three allocated DataFrames + three garbage collections + three opportunities to OOM.
  • Lazy (Polars). Gives the engine room to reorder, fuse, and push down. A scan_parquet().filter().select().group_by() chain compiles to a single pipeline that reads only the needed columns, applies the filter inside the parquet decoder, and aggregates in parallel across cores.
  • Declarative SQL (DuckDB). The same query is one SELECT statement. The engine owns the entire plan, can use a cost-based optimiser, and can spill to disk if the dataset exceeds memory.

Memory model in one table.

Engine Storage Threading Larger-than-memory
Pandas NumPy BlockManager (row-major within column groups) single-threaded Python loop (release the GIL only inside NumPy) no native support — must shard manually
Polars Apache Arrow columnar (column-major, immutable buffers) Rayon worker pool across all cores yes — streaming engine spills via chunks
DuckDB proprietary columnar, copy-of-Arrow under the hood vector pipeline + thread pool yes — out-of-core spill to disk built in

The 2026 reality.

  • Polars 1.x is production-stable. The lazy engine is the default; the eager pl.DataFrame is now a thin wrapper that calls .lazy().collect(). The streaming engine handles 100+ GB on a 16 GB laptop.
  • Pandas 2.x ships with optional Arrow-backed dtypes (pd.options.future.infer_string = True). It is not a Polars replacement; it is a way to dip a toe into Arrow without rewriting the codebase. Speed-up is real (2-3× on string ops) but the index philosophy remains.
  • DuckDB 0.10+ has become the default ad-hoc analytics engine for any laptop-scale dataset under 1 TB. The Polars and Pandas integrations are zero-copy via Arrow; you can SELECT … FROM my_polars_df without serialising.

Detailed explanation — the three philosophies in one workload

Detailed explanation. A senior interviewer often opens with: "I have a 5 GB parquet file with 50 columns. I want to read 5 columns, filter to one month, group by user, and compute a sum. Show me what each engine writes — and why one of them will run 30× faster than the others on a laptop." This question separates candidates who learned "pandas first, then polars" from those who understand the architecture behind each option.

Question. Given a 5 GB events.parquet file with columns [user_id, country, event_type, event_ts, amount, …45 others], write the same query — sum amount per user for events in March 2026 — three ways: Pandas, Polars (lazy), DuckDB. Explain which one reads the least data and why.

Input.

File Size Rows Columns
events.parquet 5 GB 200,000,000 50

Code.

# 1) Pandas — eager, reads every column unless you specify
import pandas as pd
df = pd.read_parquet("events.parquet", columns=["user_id", "event_ts", "amount"])
df["event_ts"] = pd.to_datetime(df["event_ts"])
result_pd = (
    df[(df["event_ts"] >= "2026-03-01") & (df["event_ts"] < "2026-04-01")]
    .groupby("user_id", as_index=False)["amount"]
    .sum()
)

# 2) Polars — lazy, pushdown is automatic
import polars as pl
result_pl = (
    pl.scan_parquet("events.parquet")
      .filter(pl.col("event_ts").is_between("2026-03-01", "2026-04-01"))
      .group_by("user_id")
      .agg(pl.col("amount").sum())
      .collect()
)

# 3) DuckDB — SQL, pushdown is automatic
import duckdb
result_ddb = duckdb.sql("""
    SELECT user_id, SUM(amount) AS amount
    FROM 'events.parquet'
    WHERE event_ts >= DATE '2026-03-01'
      AND event_ts <  DATE '2026-04-01'
    GROUP BY user_id
""").pl()  # return as Polars frame
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pandas reads every requested column eagerly, then materialises a Python DataFrame in memory before the filter ever runs. Even with the columns= hint, it reads the entire event_ts and amount columns for all 200 M rows — roughly 3 GB of raw bytes — before throwing away ~92% of them in the filter.
  2. Polars with scan_parquet does not read anything when the line executes — it builds a LazyFrame. When .collect() is called, the planner combines the filter, projection, and aggregation into a single fused pipeline. The parquet reader pushes the date predicate into the column statistics and the row-group filter, so most row groups are skipped entirely. Only the rows that survive the predicate are decoded.
  3. DuckDB parses the SQL into a logical plan, applies cost-based optimisations, and runs it through its vectorised pipeline. Like Polars, it pushes the predicate into the parquet reader and reads only the matching row groups. The result is materialised as an Arrow batch and handed back to Polars (.pl()) with zero copy.
  4. The Pandas version typically peaks at ~6 GB RSS and takes 80–120 seconds on a 16 GB laptop. The Polars version peaks at ~700 MB RSS and finishes in 4–8 seconds. The DuckDB version peaks at ~600 MB RSS and finishes in 3–6 seconds.

Output.

Engine Wall time Peak RSS Bytes read from disk
Pandas ~100 s ~6 GB ~3 GB
Polars (lazy) ~6 s ~700 MB ~300 MB
DuckDB ~5 s ~600 MB ~280 MB

Rule of thumb. Any time the workload reads parquet, filters on a column, and aggregates, prefer Polars-lazy or DuckDB. Pandas should be reserved for small (< 1 GB) eager work or for code paths where the consumer (scikit-learn, matplotlib, statsmodels) demands a Pandas DataFrame.

Detailed explanation — when philosophy beats benchmark numbers

Detailed explanation. "Faster" is a noisy axis. The real DE-grade question is "which philosophy fits this team's workflow?" Pandas wins for ad-hoc exploratory analysis in a notebook where you stare at intermediate states. Polars wins for production ETL where the pipeline is "read-transform-write" and you never look at intermediate frames. DuckDB wins when you want a SQL surface against heterogeneous sources (parquet here, CSV there, a Pandas frame next door) and need windows, recursive CTEs, or complex types.

Question. Your team is split: data scientists love Pandas because they print every intermediate frame; the platform team wants one engine for nightly ETL. Argue for a Polars + DuckDB pairing that lets both teams keep their workflow without forcing migration.

Input.

Workflow Audience Frequency Volume
Notebook EDA Data science continuous 100 MB – 5 GB
Nightly ETL Platform 24× / day 50 GB – 500 GB
Ad-hoc SQL probe Analyst 50× / day up to 1 TB

Code.

# Data scientist — keeps Pandas in the notebook, exports via Arrow
import polars as pl
import duckdb

# Run the heavy aggregation in DuckDB against parquet
sales = duckdb.sql("""
    SELECT region, product, SUM(amount) AS total
    FROM 's3://lake/events_*.parquet'
    WHERE event_ts >= DATE '2026-03-01'
    GROUP BY region, product
""").arrow()  # Arrow batch -- zero copy

# Hand back to Pandas for plotting / scikit-learn
import pandas as pd
sales_pd = sales.to_pandas()
sales_pd.plot.bar(x="region", y="total")  # matplotlib is happy
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The data scientist writes one SELECT in DuckDB and gets an Arrow batch back. Zero serialisation cost between DuckDB and Pandas because both speak Arrow.
  2. The intermediate result is small enough (one row per region × product) that Pandas can handle it eagerly without strain. Plotting, scikit-learn, and statsmodels all consume Pandas natively.
  3. The platform team writes the same nightly ETL with pl.scan_parquet(...).collect(streaming=True) for the same source files. Polars handles the 500 GB case on a single VM without OOM.
  4. Both teams share source files, not in-memory frames. The Arrow ecosystem is the lingua franca: parquet on disk, Arrow in memory, Pandas / Polars / DuckDB as the front-end of choice.

Output.

Workflow Engine choice Why
Notebook EDA DuckDB → Pandas SQL convenience + Pandas plotting ecosystem
Nightly ETL Polars (lazy + streaming) parallel cores + spill to disk
Ad-hoc SQL probe DuckDB SQL surface + parquet pushdown

Rule of thumb. Treat the three engines as complementary, not competing. Arrow is the integration substrate; pick the engine that matches the workflow (notebook = Pandas/DuckDB, ETL = Polars, SQL = DuckDB) and let zero-copy Arrow handle the hand-off.

Python interview question on engine selection

A senior interviewer often opens with: "You have a 50 GB log file, a 16 GB laptop, and need to compute a daily active user count by country. The data scientist on your team only knows Pandas. Walk me through what you write, what you advise them to write, and how you compose the two."

Solution Using a Polars + DuckDB + Pandas hand-off pattern

# Single, audit-friendly script that:
#   1) reduces 50 GB to a few MB with Polars-lazy (single-node, single VM)
#   2) hands the reduced frame to Pandas for downstream plotting
#   3) exposes the same reduced frame via DuckDB SQL for ad-hoc analyst probes

import polars as pl
import duckdb

# 1) Heavy lift -- Polars-lazy, streaming for safety on 16 GB laptop
dau = (
    pl.scan_parquet("logs_*.parquet")
      .filter(pl.col("event_type") == "login")
      .group_by(["event_date", "country"])
      .agg(pl.col("user_id").n_unique().alias("dau"))
      .sort(["event_date", "country"])
      .collect(streaming=True)
)

# 2) Hand-off to Pandas (zero-copy via Arrow)
dau_pd = dau.to_pandas(use_pyarrow_extension_array=True)
print(dau_pd.head())

# 3) Expose to DuckDB SQL for analyst follow-ups
con = duckdb.connect()
con.register("dau", dau)
con.sql("SELECT country, AVG(dau) AS avg_dau FROM dau GROUP BY country").show()
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step What happens Memory peak
scan_parquet builds a LazyFrame; no IO yet ~50 MB
.filter(...) adds a predicate to the plan; pushed into parquet reader ~50 MB
.group_by(...).agg(...) adds the aggregation node; planner fuses into a single pipeline ~50 MB
.collect(streaming=True) runs the plan in chunks, spilling to disk if needed ~1.5 GB
.to_pandas(use_pyarrow_extension_array=True) wraps the Arrow buffers as Pandas extension arrays — zero copy ~1.5 GB
con.register("dau", dau) DuckDB views the Polars frame as an Arrow batch ~1.5 GB
DuckDB SELECT runs in-process against the registered batch ~1.6 GB

The trace shows that the 50 GB source is never materialised in memory at once. The Polars lazy engine streams parquet row groups through the predicate, aggregates, and writes the final small result. The Pandas hand-off and DuckDB registration are both zero-copy because everything sits in Arrow buffers.

Output:

event_date country dau
2026-03-01 US 412 350
2026-03-01 FR 88 410
2026-03-01 DE 102 700
2026-03-02 US 419 110

Why this works — concept by concept:

  • Polars-lazy planscan_parquet returns a deferred plan, not data. The planner can push filters and projections into the parquet reader, fuse operations, and parallelise across cores. The net effect is "read only what survives the filter," which collapses 50 GB → 1.5 GB.
  • Streaming enginecollect(streaming=True) runs the plan in row-group-sized chunks. Intermediate state spills to disk if memory pressure rises, so the 16 GB laptop never OOMs.
  • Arrow as the lingua franca — Polars stores its frame as Apache Arrow buffers; Pandas-2 extension arrays wrap Arrow buffers; DuckDB reads Arrow batches as a zero-copy table. The same bytes back three engines.
  • Pandas for the long tail — once the heavy lift is done, Pandas takes over for plotting, scikit-learn handoff, statsmodels regression — none of which Polars or DuckDB attempts to replace. The ecosystem moat is real.
  • DuckDB SQL surfacecon.register("dau", dau) exposes a Polars frame as a SQL table without copying. Analysts who only know SQL can query the same in-memory result the platform team produced.
  • Cost — single-pass over the parquet (O(rows_filtered)); aggregation is one pass; hand-off to Pandas / DuckDB is O(1) pointer assignment. The 16 GB laptop processes the 50 GB job in 10-20 minutes wall time.

Python
Topic — data manipulation
Data manipulation problems (Python)

Practice →


2. Polars — Rust + Arrow + lazy frame

Polars compiles your code into a query plan, fuses operations, and runs them in parallel across cores

The mental model in one line: a Polars LazyFrame is a query plan, not data — and polars lazy execution lets the engine push down predicates and projections, fuse operations, and parallelise across every CPU core. Once you say "I write expressions; Polars writes the execution plan," every other API decision falls out: chained expressions, schema validation at plan time, and explicit .collect() to trigger compute.

Iconographic split-frame — Polars lazy plan DAG of operator-gears with an 'optimize' badge on the left, Pandas eager conveyor materialising a small DataFrame tile at each station on the right, with a side card titled 'tax of eager' carrying three concept chips on a light PipeCode card.

Polars in one paragraph.

  • Rust core. The engine is a Rust crate with no Python in the hot path. The Python layer is a thin pyo3 binding that hands off to compiled Rust.
  • Apache Arrow storage. Columns are stored as immutable Arrow buffers. Sharing data with PyArrow, DuckDB, or Pandas-2 is a pointer-copy, not a serialisation.
  • Lazy + eager APIs. pl.DataFrame is the eager API; pl.LazyFrame is the lazy plan. Use .lazy() to switch from one to the other; use .collect() to materialise the plan.
  • Expression DSL. Every column reference uses pl.col("name"); every operation is an expression like pl.col("amount").sum().over("user_id"). Expressions are composable, parallelisable, and validated at plan time (the engine knows the schema before any data flows).
  • Streaming engine. collect(streaming=True) runs the plan in row-group / chunk-sized batches and spills to disk if memory pressure rises. Suitable for datasets that exceed RAM.

Lazy plan optimisations.

  • Predicate pushdown. A .filter(pl.col("country") == "US") downstream of a scan_parquet is pushed into the parquet reader. Only matching row groups are decoded.
  • Projection pushdown. A .select(["a", "b"]) downstream of a scan tells the parquet reader to only decode columns a and b.
  • Common sub-expression elimination. Repeated expressions (e.g. pl.col("amount") * 1.1 appearing twice) are computed once and reused.
  • Slice pushdown. A .head(100) after a join can be pushed into the join operator so it stops as soon as 100 matching rows are emitted.
  • Operation fusion. Multiple .with_columns(...) chained together are fused into a single pass over the data.

The expression DSL.

  • Column reference. pl.col("name") — refers to a column by name. pl.col("^a.*$") — refers to columns matching a regex.
  • Aggregation. pl.col("amount").sum(), .mean(), .n_unique(), .quantile(0.95).
  • Window. pl.col("amount").sum().over("user_id") — like SQL SUM(amount) OVER (PARTITION BY user_id).
  • Conditional. pl.when(pl.col("amount") > 100).then("big").otherwise("small").
  • List & struct. First-class List and Struct<...> dtypes — no need to JSON-roundtrip nested data.

Detailed explanation — lazy group-by with pushdown

Detailed explanation. The most common DE workload is "read a parquet, filter to a slice, group by a key, aggregate." Polars-lazy handles this in a single fused pipeline, and the predicate pushdown means the disk read is proportional to the filtered output, not the source size.

Question. Given an orders.parquet file (10 GB, 100 columns) with order rows since 2020, write a Polars-lazy query that returns total_revenue and order_count per country for Q1 2026. Explain which optimisations the planner applies.

Input.

File Size Rows Columns
orders.parquet 10 GB 400 M 100

Code.

import polars as pl

q1_summary = (
    pl.scan_parquet("orders.parquet")
      .filter(
          (pl.col("order_ts") >= pl.lit("2026-01-01")) &
          (pl.col("order_ts") <  pl.lit("2026-04-01"))
      )
      .group_by("country")
      .agg(
          pl.col("amount").sum().alias("total_revenue"),
          pl.col("order_id").n_unique().alias("order_count"),
      )
      .sort("total_revenue", descending=True)
      .collect()
)

# Inspect the plan the engine actually ran
print(
    pl.scan_parquet("orders.parquet")
      .filter(
          (pl.col("order_ts") >= pl.lit("2026-01-01")) &
          (pl.col("order_ts") <  pl.lit("2026-04-01"))
      )
      .group_by("country")
      .agg(pl.col("amount").sum())
      .explain()
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. pl.scan_parquet("orders.parquet") returns a LazyFrame that knows the schema (from the parquet footer) but reads no row data.
  2. The .filter(...) on order_ts is added to the plan. The planner sees that the filter touches a single column with cheap statistics and decides to push it down into the parquet reader. Row groups whose order_ts min/max do not overlap with [2026-01-01, 2026-04-01) are skipped without decoding.
  3. The .group_by("country").agg(...) adds two aggregation expressions. The planner sees that only country, amount, and order_id are needed downstream, and pushes that projection into the parquet reader. The other 97 columns are never decoded.
  4. The .sort(...) is applied after aggregation — small data, cheap. .collect() runs the plan, returning a materialised DataFrame.
  5. The .explain() call prints the optimised plan tree: Aggregate(country) <- Filter(order_ts in Q1 2026) <- Projection(country, amount, order_id) <- ScanParquet(orders.parquet). The reader only decodes ~5% of the file.

Output (plan).

SORT BY [col("total_revenue") DESC]
    AGGREGATE
      [col("amount").sum().alias("total_revenue"),
       col("order_id").n_unique().alias("order_count")]
    BY [col("country")] FROM
      FILTER [(col("order_ts") >= 2026-01-01) AND (col("order_ts") < 2026-04-01)]
      FROM PARQUET SCAN orders.parquet
      PROJECT 3/100 COLUMNS
      SELECTION PUSHED DOWN
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Write every Polars query in lazy mode (scan_* + .collect()) and inspect the plan with .explain() whenever performance matters. The pushdown is automatic, but you should see it happening — that is how you build the intuition for which queries collapse to a near-free scan and which require a full materialisation.

Detailed explanation — expressions, windows, and chained transforms

Detailed explanation. Polars expressions are first-class objects. You can build them once and reuse them, compose them inside with_columns, group_by, over (windows), and filter. The DSL is type-checked at plan time, so a typo in a column name fails before any data is read.

Question. Given the events LazyFrame with columns [user_id, event_ts, amount], compute a 7-day rolling sum of amount per user, then keep only users whose total over the period exceeds 1000.

Input.

user_id event_ts amount
u1 2026-03-01 50
u1 2026-03-03 80
u1 2026-03-09 30
u2 2026-03-01 200
u2 2026-03-02 900

Code.

import polars as pl

rolling = (
    pl.scan_parquet("events.parquet")
      .sort(["user_id", "event_ts"])
      .with_columns(
          pl.col("amount")
            .rolling_sum_by("event_ts", window_size="7d")
            .over("user_id")
            .alias("rolling_7d"),
      )
      .group_by("user_id")
      .agg(pl.col("rolling_7d").max().alias("peak_7d"))
      .filter(pl.col("peak_7d") > 1000)
      .collect()
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. scan_parquet + sort builds a plan that reads the parquet, then orders the rows by (user_id, event_ts). The sort is needed so the rolling window can iterate in order.
  2. with_columns(...rolling_sum_by("event_ts", "7d").over("user_id")) adds a windowed expression: for each row, compute the sum of amount over the trailing 7 days, partitioned by user_id. The .over("user_id") is the equivalent of SQL PARTITION BY user_id.
  3. group_by("user_id").agg(... .max()) collapses each user's rolling series to its peak value.
  4. The .filter(...) keeps only users whose peak exceeded 1000.
  5. .collect() runs the entire fused plan: parquet scan → sort → windowed rolling sum → group-by aggregation → filter.

Output.

user_id peak_7d
u2 1100

Rule of thumb. Polars expressions are composable: build them once, use them in with_columns, over, filter, and agg. The Pythonic df.col.sum() you reach for in Pandas does not exist — everything is pl.col("name").operation(). The strict separation between "column references" and "operations" is what lets the engine analyse and optimise the plan.

Detailed explanation — streaming larger-than-memory data

Detailed explanation. Polars's streaming engine processes the plan in row-group / chunk-sized batches. Aggregations maintain partial state in memory; if the state exceeds a budget, it spills to disk. This is what lets a 16 GB laptop handle a 500 GB parquet without OOM.

Question. Given a 500 GB parquet directory, compute a daily count of events per country. The laptop has 16 GB RAM. Show how collect(streaming=True) makes this safe.

Input.

File Size Rows
events_*.parquet 500 GB 20 B

Code.

import polars as pl

daily = (
    pl.scan_parquet("events_*.parquet")
      .select(["event_date", "country", "event_id"])
      .group_by(["event_date", "country"])
      .agg(pl.col("event_id").count().alias("events"))
      .sort(["event_date", "country"])
      .collect(streaming=True)
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. scan_parquet over a glob reads the parquet directory's metadata: list of files, schema, row group statistics.
  2. select(["event_date", "country", "event_id"]) tells the planner only three columns are needed. Projection pushdown skips the other ~50 columns at the reader level.
  3. group_by(["event_date", "country"]).agg(...) builds the aggregation operator. With streaming=True, the engine processes one row group at a time, maintaining a hash table keyed by (event_date, country). The hash table is small because the cardinality is bounded (365 days × 200 countries = 73 000 keys), so it fits easily in RAM.
  4. As row groups stream through, the hash table is updated. When .sort(...) runs at the end, the small aggregate result is sorted entirely in memory.
  5. Peak memory stays under 2 GB because the engine never materialises raw events — it materialises only the aggregation state.

Output.

event_date country events
2026-03-01 US 12 410 220
2026-03-01 FR 2 100 110
2026-03-02 US 12 880 940

Rule of thumb. Reach for collect(streaming=True) whenever the raw data is larger than RAM but the aggregated result is small. Streaming is not magic — if the group-by cardinality exceeds RAM (e.g. group by user_id over a billion users), even streaming will spill aggressively and slow down. Test the cardinality first.

Python interview question on Polars lazy execution

A senior interviewer might frame this as: "Walk me through what happens when I call pl.scan_parquet(...).filter(...).group_by(...).agg(...).collect(). Where is the disk read, where is the parallelism, and why is this faster than the equivalent Pandas chain?"

Solution Using a lazy plan with explicit pushdown

import polars as pl

# Build the lazy plan
plan = (
    pl.scan_parquet("orders.parquet")
      .filter(pl.col("country") == "US")
      .select(["user_id", "amount"])
      .group_by("user_id")
      .agg(pl.col("amount").sum().alias("total"))
)

# Inspect the optimised plan tree
print(plan.explain())

# Run it -- this is the only line that touches data
df = plan.collect()
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step What happens When
scan_parquet parquet metadata is read; schema is known plan-build time
.filter(country == "US") predicate added to plan; planner marks for pushdown plan-build time
.select(["user_id", "amount"]) projection added; planner marks columns for pushdown plan-build time
.group_by + .agg aggregation operator added plan-build time
.explain() prints the optimised plan tree plan-build time
.collect() parquet reader decodes only matching row groups for country == "US" and only user_id + amount + country columns; data is fed through hash-aggregation in parallel; result frame is built run time

Output:

user_id total
u_001 12 540.50
u_002 8 720.10
u_003 22 990.40

Why this works — concept by concept:

  • LazyFrame — a value that represents a deferred query plan, not data. Building one is free; the only operation that costs CPU and IO is .collect().
  • Predicate pushdown — the filter country == "US" is moved inside the parquet reader. Row groups whose country min/max statistics do not include "US" are skipped entirely.
  • Projection pushdown — only the three referenced columns (user_id, amount, country) are decoded. Other columns are skipped at the byte level inside parquet.
  • Parallel aggregation — Polars partitions the input across CPU cores using its Rayon worker pool. Each worker maintains a local hash table; a final merge step combines them. Linear speed-up to the number of cores.
  • Schema-checked plan — a typo in a column name fails at .explain() / .collect(), before any data is read. Pandas only finds the typo after loading.
  • Cost — disk IO ≈ O(row_groups_matching_filter × 3 columns × bytes); CPU ≈ O(rows_filtered) for the aggregation; memory ≈ O(distinct keys). Often 10-30× faster than the equivalent Pandas chain on the same file.

Python
Topic — optimization
Optimization problems (Python)

Practice →


3. Pandas — eager NumPy + index philosophy + when it still wins

Pandas is the ecosystem moat — eager + indexed + NumPy-backed — and it still wins for small-data ergonomics and ML hand-off

The mental model in one line: a Pandas DataFrame is a labelled, NumPy-backed table where every row has an index — and pandas vs polars debates often miss that Pandas wins on ecosystem, not speed. Once you say "I am writing for the matplotlib / scikit-learn / statsmodels ecosystem," the Pandas choice becomes the right one even when Polars would be 10× faster.

Iconographic DataFrame card — a grid-of-cells glyph with a thick highlighted purple index ribbon along the left edge, two curved arrows tagged 'join' and 'merge' converging on the index ribbon, and a side card titled 'index gotchas' with three concept chips, on a light PipeCode card.

Pandas in one paragraph.

  • Eager evaluation. Every operation returns a new DataFrame immediately. Chaining df.filter(...).assign(...).groupby(...) materialises three intermediate frames in memory.
  • NumPy block manager. Columns are stored as NumPy arrays grouped by dtype (int64, float64, object, etc.) into 2-D "blocks." This is what gives Pandas its NumPy-vectorised speed on numeric work and its slowness on string-heavy work.
  • Index. Every DataFrame has a labelled index — RangeIndex by default, but any hashable column can be promoted to the index. The index is the unit of alignment for joins, reindex, merges, and arithmetic.
  • Arrow-backed dtypes (Pandas 2.x). pd.options.future.infer_string = True turns on Arrow-backed string columns, giving 2-3× speed-ups on string ops. Numeric columns can also be Arrow-backed with dtype="int64[pyarrow]".

The index philosophy.

  • Alignment by index. df1 + df2 aligns rows by their indexes, returning NaN where one index has no match. The same is true for df1.join(df2) (joins on indexes by default), df.reindex(new_idx), and df.assign(col=other_series) (aligns the series's index against the DataFrame's index).
  • set_index / reset_index. df.set_index("user_id") promotes a column to the index; df.reset_index() demotes it back. Many operations are faster when grouped by an index (the BlockManager can use the index for bisection and lookup).
  • MultiIndex. A hierarchical index — two or more levels of labels per row. Useful for pivoted data; treacherous for everyone who forgot they were inside a MultiIndex when they wrote df.loc[2026].
  • NaN tax. Pandas uses NumPy's NaN to represent "missing" in numeric columns, even when the source data is an integer. This forces integer columns to float (because NaN is a float) — the famous int → float promotion that surprises every new Pandas user.

Where Pandas still wins.

  • Ecosystem. Matplotlib, scikit-learn, statsmodels, seaborn, plotly express, dash, streamlit, every ML notebook tutorial, every Stack Overflow answer since 2012. Polars has interop, but Pandas is the native type.
  • Notebook ergonomics. df.head(), df.describe(), df.info() — the introspection toolkit is unbeatable for "show me what is in this 1 GB CSV." Polars has equivalents but the Pandas defaults are tuned for printing in Jupyter.
  • Time series. Pandas's DatetimeIndex, resample, rolling, shift, pct_change are the most polished time-series API anywhere in the Python world. Polars is catching up; Pandas is the muscle memory for everyone who has worked with financial / IoT data.
  • Small data. For < 1 GB workloads in a notebook, the speed difference between Pandas and Polars is irrelevant; the ergonomics difference often favours Pandas.

Where Pandas loses.

  • Larger-than-memory. No native streaming; you have to shard manually. By contrast, Polars and DuckDB stream out of the box.
  • Multi-core. Pandas is single-threaded by default; only NumPy-vectorised operations can release the GIL inside C code. Polars uses every core.
  • String heavy work. Object-dtype string columns are pointers to Python str objects — slow to filter, slow to group by. Pandas 2.x Arrow strings help but you must opt in.
  • Read parquet of 100 columns when you only want 3. Pandas does not push the projection into the reader unless you explicitly pass columns=[...].

Detailed explanation — index alignment gotcha

Detailed explanation. The most-asked Pandas interview question is "what happens when I add two Series with different indexes?" The answer ("aligned by index, NaN for missing matches") is one of the things that surprises every new user and is fundamental to how join and merge behave too.

Question. Given two DataFrames with different indexes, show how .join aligns by the index and produces NaN for unmatched rows. Then show the SQL-style equivalent that is index-agnostic.

Input — orders (index = order_id):

order_id amount
1 100
2 200
3 300

Input — users (index = order_id):

order_id user_id
1 alice
3 charlie
4 dave

Code.

import pandas as pd

orders = pd.DataFrame({"amount": [100, 200, 300]}, index=pd.Index([1, 2, 3], name="order_id"))
users  = pd.DataFrame({"user_id": ["alice", "charlie", "dave"]},
                      index=pd.Index([1, 3, 4], name="order_id"))

# Index-aligned default -- LEFT join on indexes
joined_left  = orders.join(users, how="left")

# Index-aligned INNER -- keep only matching index values
joined_inner = orders.join(users, how="inner")

# Index-agnostic SQL-style merge on a column
joined_merge = orders.reset_index().merge(users.reset_index(), on="order_id", how="left")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. orders.join(users, how="left") aligns by index (order_id). Every row from orders is kept; for order_id=2 there is no match in users, so user_id is NaN.
  2. joined_inner keeps only rows where both indexes match. order_id=2 (no users row) and order_id=4 (no orders row) are dropped.
  3. merge(..., on="order_id") requires that order_id be a column, not the index. Same effect as the SQL JOIN; less index-aware, more SQL-like.
  4. The difference between .join (index-by-index) and .merge (column-by-column) is the most common Pandas debugging source. Polars only has one verb (.join) and uses columns by default — no index philosophy to remember.

Output (joined_left).

order_id amount user_id
1 100 alice
2 200 NaN
3 300 charlie

Rule of thumb. When you set_index, you are buying alignment-by-index for the cost of one extra reset_index before any SQL-style join. Be explicit: either commit to the index philosophy (use .join everywhere) or stay index-free (use .merge everywhere). Mixing both in one file is a debugging nightmare.

Detailed explanation — the NaN tax and Arrow dtypes

Detailed explanation. Pandas's NumPy-backed numeric columns cannot hold an integer NaN, because NaN is a float. So any int column that needs to represent "missing" gets silently promoted to float — losing precision for big integers and breaking downstream code that expects ints.

Question. Given a Pandas DataFrame with a nullable quantity column, show the NaN tax (int promotes to float) and how Arrow-backed dtypes fix it.

Input.

order_id quantity
1 10
2 7
3 NaN
4 5

Code.

import pandas as pd

# Classic NumPy backing -- integers become float64 when NaN is present
df = pd.DataFrame({"order_id": [1, 2, 3, 4], "quantity": [10, 7, None, 5]})
print(df.dtypes)
# order_id     int64
# quantity   float64   <-- NaN tax

# Arrow-backed nullable Int64
df["quantity_pa"] = pd.array(df["quantity"], dtype="Int64[pyarrow]")
print(df.dtypes)
# quantity_pa  int64[pyarrow]  <-- nullable int, no NaN tax
print(df["quantity_pa"].sum())  # 22 -- skips NA
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. quantity is a Python list with one None. Pandas materialises it as float64 because NumPy cannot represent integer NaN.
  2. Downstream code that expects ints either coerces (.astype("int64") — fails on NaN) or pays a precision tax. The classic incident: customer ID 1234567890123456789 (19 digits) silently loses precision after the float promotion.
  3. pd.array(..., dtype="Int64[pyarrow]") uses Arrow's nullable int64. The NaN is stored as a separate "null bit," not as float NaN. Aggregations (sum, mean) skip nulls by default.
  4. Arrow-backed dtypes (Pandas 2.x) are the bridge between the Pandas ecosystem and the Arrow / Polars / DuckDB world — same memory layout, zero-copy interop.

Output (df.dtypes).

column dtype
order_id int64
quantity float64
quantity_pa int64[pyarrow]

Rule of thumb. Default to Arrow-backed dtypes for nullable columns. Turn on pd.options.future.infer_string = True so string columns become Arrow-backed too. The performance and correctness gains are real; the only cost is requiring Pandas 2.x.

Detailed explanation — the time-series ergonomics moat

Detailed explanation. Pandas's DatetimeIndex and resample are the gold standard for time-series work. Even teams that have migrated their heavy ETL to Polars often keep Pandas for the resample / rolling / shift / pct_change kit.

Question. Given a prices Series indexed by minute timestamps, compute a 5-minute average, then convert to a daily mean, in one chain.

Input (prices, head).

event_ts price
2026-03-01 09:00 100.0
2026-03-01 09:01 100.5
2026-03-01 09:02 100.4
2026-03-01 09:03 100.7

Code.

import pandas as pd

# prices is a Series with DatetimeIndex
five_min = prices.resample("5min").mean()
daily    = five_min.resample("1D").mean()
print(daily.head())
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. resample("5min") groups by 5-minute buckets aligned to wall-clock minute boundaries. .mean() averages within each bucket.
  2. The result is a Series indexed by 5-minute boundaries.
  3. Another resample("1D").mean() averages the 5-minute series into daily means. Same operator, different frequency — the API is uniform across "minute → 5-minute → hour → day → week → month".
  4. Polars has a similar group_by_dynamic, but the Pandas DSL is denser and the pandas alternative of Polars only matches it after a learning curve.

Output (head).

event_ts price
2026-03-01 100.4
2026-03-02 99.8
2026-03-03 101.1

Rule of thumb. Keep Pandas for time series. The resample, rolling, shift, pct_change, ewm family is more concise and more widely documented than the equivalents in Polars or DuckDB. The cost of a one-shot .to_pandas() hand-off is trivial compared to the readability win.

Python interview question on Pandas vs Polars migration

A senior interviewer might frame this as: "Our team's notebooks are all Pandas. Should we migrate to Polars? Walk me through the decision matrix you would present to leadership."

Solution Using a pragmatic Pandas + Polars co-existence plan

# Three-axis decision: workload, audience, time horizon.
# Migrate the ETL hot paths first; keep notebooks Pandas-first; let Arrow be the glue.

# 1) Hot-path nightly ETL -- migrate to Polars-lazy
import polars as pl
nightly = (
    pl.scan_parquet("events_*.parquet")
      .filter(pl.col("event_ts") >= pl.lit("2026-03-01"))
      .group_by(["country", "event_date"])
      .agg(pl.col("amount").sum().alias("revenue"))
      .collect(streaming=True)
)
nightly.write_parquet("daily_revenue.parquet")

# 2) Notebook EDA -- keep Pandas
import pandas as pd
df = pd.read_parquet("daily_revenue.parquet")
df.set_index(["country", "event_date"]).unstack("country")["revenue"].plot()

# 3) Ad-hoc SQL -- DuckDB on the same parquet
import duckdb
duckdb.sql("""
    SELECT country, AVG(revenue) AS avg_rev
    FROM 'daily_revenue.parquet'
    GROUP BY country
    ORDER BY avg_rev DESC
""").show()
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Workload Engine Why Migration cost
Nightly ETL (50 GB) Polars parallel cores + streaming medium (rewrite chain)
Notebook EDA (200 MB) Pandas matplotlib / sklearn ecosystem zero (no change)
Ad-hoc SQL (1 TB) DuckDB SQL + parquet pushdown low (write SQL)
Time series (5 GB) Pandas resample / rolling DSL zero
Production data app (web) Polars low latency on the request path medium

Output:

Migration phase Effort Speed-up Risk
Phase 1 — nightly ETL 2 sprints 10-30× low
Phase 2 — data apps 1 sprint 5-10× medium
Phase 3 — notebooks n/a (keep Pandas) n/a n/a

Why this works — concept by concept:

  • Migrate by workload, not by file — pick the workloads where Polars's lazy + parallel model pays off (large ETL, low-latency apps) and leave the workloads that depend on the Pandas ecosystem (notebooks, ML hand-off) untouched.
  • Arrow is the integration substratedf.to_pandas(use_pyarrow_extension_array=True) is zero-copy. A Polars frame can be a Pandas frame in O(1) — the migration cost is per-codepath, not per-row.
  • Pandas ecosystem moat — matplotlib, scikit-learn, statsmodels, every notebook tutorial. Polars's interop is good, but the native type in those libraries is pd.DataFrame. The cost of converting at the hand-off is negligible.
  • DuckDB for the SQL surface — analysts who only know SQL get a familiar query interface against the same parquet files the ETL produced. No "data team migration" needed.
  • Decision matrix beats religion — leadership wants to know "what do we get, what does it cost, what is the risk?" The matrix above answers all three; "rewrite everything in Polars" answers none.
  • Cost — Phase 1 typically pays for itself in compute savings within one quarter. Phase 2 is a UX win (page latency). Phase 3 is a non-investment — keep Pandas, accept the ecosystem dependency.

Python
Topic — data analysis
Data analysis problems (Python)

Practice →


4. DuckDB — embedded SQL columnar engine

DuckDB is "SQLite for analytics" — embedded, vectorised, columnar, and zero-copy with Arrow

The mental model in one line: DuckDB is an in-process columnar SQL engine with a vectorised executor — duckdb vs pandas is rarely a contest when the workload is SQL-shaped, and the zero-copy Arrow interop means DuckDB is also the SQL surface for Polars and Pandas frames. Once you say "DuckDB is SQLite for analytics," every other property follows: single-file install, no server, ANSI SQL, full window + recursive CTE support, parquet / CSV / JSON / Arrow / Pandas / Polars as first-class sources.

Iconographic bridge scene — a Python list and Polars DataFrame on the left flow through a small 'register' arrow into a DuckDB SQL slab with a SELECT-arch glyph on top, then emit Arrow-tile output back to Pandas/Polars on the right, with a side card titled 'interop' carrying three concept chips on a light PipeCode card.

DuckDB in one paragraph.

  • Embedded. pip install duckdb ships a single in-process library. No server, no daemon, no client-server round-trip. The engine runs in your Python (or R, or Java, or C++, or CLI) process.
  • Columnar storage. Internal data is column-major, just like Arrow and Polars. DuckDB's own format (.duckdb) is a write-optimised B-tree of column chunks; on parquet / CSV input it reads zero-copy into Arrow-style buffers.
  • Vectorised execution. Every operator processes batches of 2048 rows at a time. The fixed batch size keeps the CPU cache hot and lets every operator be implemented as a small kernel rather than a row-at-a-time interpreter loop.
  • Full ANSI SQL. Window functions, recursive CTEs, struct / list / map types, GROUPING SETS, LATERAL, QUALIFY, MATCH_RECOGNIZE — DuckDB tracks the modern SQL spec closely and adds analytical sugar (PIVOT, UNPIVOT, SUMMARIZE).
  • Zero-copy Arrow interop. A Pandas, Polars, or PyArrow table can be queried directly: duckdb.sql("SELECT * FROM df WHERE x > 100"). DuckDB looks up df in the calling Python scope, registers it as an Arrow view, and runs the query without copying bytes.

The "SQL surface for DataFrames" pattern.

  • SELECT ... FROM 'file.parquet' — query parquet on disk; predicate and projection are pushed into the parquet reader.
  • SELECT ... FROM 'file.csv' — same for CSV (with auto-detected dialect).
  • SELECT ... FROM df — where df is a Pandas / Polars / PyArrow object in the calling scope. Zero copy.
  • con.register("name", df) — register a DataFrame under an explicit name; useful when the same frame is queried multiple times.
  • con.sql("...").pl() / .df() / .arrow() / .fetchall() — return the result as Polars, Pandas, Arrow, or Python list. Each .pl() and .arrow() is zero copy.

Where DuckDB beats the DataFrame APIs.

  • Windows and recursive CTEs. Polars has windows but the syntax is over(...) chained on expressions; SQL OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...) is denser and more familiar. Recursive CTEs are SQL-only.
  • Joins across heterogeneous sources. JOIN ... ON ... across a parquet file, a CSV file, a Pandas DataFrame, and a Polars DataFrame — DuckDB handles all four in a single query.
  • Out-of-core spill. DuckDB's executor spills aggregation and join state to disk if the workload exceeds memory. No code change required.
  • ANSI dialect familiarity. Teams that know SQL do not have to learn pl.col / with_columns / over — they write the same SELECT they would on Postgres.

Where DuckDB cedes to Polars.

  • Imperative chain ergonomics. df.filter().select().with_columns().sort() is more readable than a SQL CTE for some teams. The Polars DSL also catches column-name typos at plan time (DuckDB only fails on execute).
  • Embedding inside a wider Python pipeline. Polars expressions can hold Python lambdas, scikit-learn transformers, or custom UDFs more naturally.
  • Strict type-safe expressions. Polars's expression DSL is type-checked at plan time; DuckDB SQL is text and only validated at parse.

Detailed explanation — running SQL against a Polars frame with zero copy

Detailed explanation. Once you start thinking of DuckDB as "SQL glue for any DataFrame," the integration story becomes the most powerful: you keep the Polars-lazy heavy lifting, but expose a SQL surface for ad-hoc analyst queries — no ETL roundtrip.

Question. You have a Polars frame sales produced by a heavy lazy pipeline. Run a SQL window query against it that ranks products by revenue within each region.

Input — sales (Polars DataFrame).

region product revenue
US A 1000
US B 700
US C 800
EU A 600
EU C 900

Code.

import polars as pl
import duckdb

# Heavy lift produced this Polars frame
sales = pl.DataFrame({
    "region":  ["US","US","US","EU","EU"],
    "product": ["A","B","C","A","C"],
    "revenue": [1000,700,800,600,900],
})

# Ad-hoc SQL window directly on the Polars frame (zero copy)
result = duckdb.sql("""
    SELECT
        region,
        product,
        revenue,
        DENSE_RANK() OVER (
            PARTITION BY region
            ORDER BY revenue DESC
        ) AS rev_rank
    FROM sales
    QUALIFY rev_rank <= 2
    ORDER BY region, rev_rank
""").pl()

print(result)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. DuckDB looks up sales in the caller's Python scope. Because sales is a Polars frame whose backing buffers are Arrow, DuckDB registers it as an Arrow view — no bytes copied.
  2. The SQL parser builds a plan: Scan(sales) → Project → Window → QUALIFY → ORDER BY. The vectorised executor runs each operator on 2048-row batches.
  3. DENSE_RANK() OVER (PARTITION BY region ORDER BY revenue DESC) assigns ranks within each region. QUALIFY rev_rank <= 2 filters to the top 2 per region — QUALIFY is SQL sugar for "WHERE on a window result."
  4. The result is returned as a Polars frame via .pl() — another zero-copy hand-off. Total memory cost: O(distinct (region, product) pairs).

Output.

region product revenue rev_rank
EU C 900 1
EU A 600 2
US A 1000 1
US C 800 2

Rule of thumb. Reach for DuckDB whenever the workload is "ad-hoc SQL over an in-memory frame." Zero-copy Arrow interop means there is no serialisation tax; the SQL surface gives you window functions, CTEs, and QUALIFY that read more naturally than the Polars expression equivalents.

Detailed explanation — DuckDB on raw parquet, no Python frame needed

Detailed explanation. DuckDB can read parquet directly with predicate and projection pushdown — often faster than Polars for ad-hoc one-off queries because the SQL surface is denser and the optimiser is more mature on join-heavy workloads.

Question. Given a directory of events_*.parquet files (1 TB total), write a DuckDB query that returns the top 10 countries by event count for March 2026.

Input.

File Size Rows
events_*.parquet 1 TB 40 B

Code.

import duckdb

top10 = duckdb.sql("""
    SELECT
        country,
        COUNT(*) AS events
    FROM 'events_*.parquet'
    WHERE event_ts >= DATE '2026-03-01'
      AND event_ts <  DATE '2026-04-01'
    GROUP BY country
    ORDER BY events DESC
    LIMIT 10
""").pl()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. DuckDB resolves the glob to a list of parquet files and reads each footer to get row group statistics.
  2. The WHERE event_ts BETWEEN ... predicate is pushed into the parquet reader. Row groups whose event_ts min/max do not overlap with March 2026 are skipped entirely.
  3. Only the country and event_ts columns are decoded (projection pushdown).
  4. The grouped count is computed by the vectorised hash-aggregation operator, processing 2048 rows at a time per worker thread.
  5. The final ORDER BY events DESC LIMIT 10 is a tiny top-N sort over the aggregated result.

Output.

country events
US 4 120 333 110
FR 1 200 020 020
DE 880 110 220

Rule of thumb. For "one-shot SQL over parquet on disk," DuckDB is the simplest tool: no Python imports beyond import duckdb, no LazyFrame, no .collect(). Just write the SQL.

Detailed explanation — DuckDB as the ETL glue between heterogeneous sources

Detailed explanation. Real pipelines blend parquet, CSV, and in-memory frames. DuckDB lets you JOIN across all of them in one statement, which is the killer feature when an analyst asks "can you correlate yesterday's parquet log with this Google Sheet I just downloaded?"

Question. Given orders.parquet, a Pandas frame users_df, and a CSV country_lookup.csv, write a single DuckDB query that returns the count of orders per country for the past week, filtering to active users only.

Input.

Source Type Rows
orders.parquet parquet 50 M
users_df Pandas DataFrame 200 K
country_lookup.csv CSV 250

Code.

import duckdb
import pandas as pd

# users_df is in scope; orders.parquet and country_lookup.csv on disk
result = duckdb.sql("""
    SELECT
        cl.country_name,
        COUNT(*) AS order_count
    FROM 'orders.parquet'  AS o
    JOIN users_df          AS u  ON o.user_id = u.user_id
    JOIN 'country_lookup.csv' AS cl ON u.country_code = cl.country_code
    WHERE u.is_active = TRUE
      AND o.order_ts >= CURRENT_DATE - INTERVAL 7 DAY
    GROUP BY cl.country_name
    ORDER BY order_count DESC
""").df()  # return as Pandas
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. DuckDB reads orders.parquet with predicate (order_ts >= now - 7d) and projection pushdown.
  2. users_df (Pandas) is looked up in scope and registered as an Arrow view. The join filter is_active = TRUE is also pushed into the scan.
  3. country_lookup.csv is read with the auto-detected dialect; only country_code and country_name are decoded.
  4. The two joins are planned by the optimiser. Because users_df and country_lookup are small (200 K + 250 rows), they are likely chosen as build sides of hash joins; orders.parquet is streamed past.
  5. The aggregated result is returned as a Pandas frame for downstream plotting / export.

Output.

country_name order_count
United States 1 200 110
France 220 040
Germany 198 002

Rule of thumb. When the query crosses three or more data sources of different types (parquet, CSV, in-memory frame), DuckDB is the lowest-friction option. The alternative — writing Polars scan_parquet, pl.read_csv, pl.from_pandas, and joining manually — is verbose and brittle.

Python interview question on DuckDB + Polars + Pandas interop

A senior interviewer might frame this as: "Walk me through how DuckDB queries a Polars frame without copying. Why is that zero-copy and what is the practical implication?"

Solution Using Arrow as the zero-copy substrate

import polars as pl
import duckdb
import pandas as pd

# 1) Polars frame -- bytes live in Arrow buffers
sales = pl.read_parquet("sales.parquet")
print(type(sales))                          # polars.DataFrame
print(sales.estimated_size())              # bytes in Arrow

# 2) DuckDB registers the Polars frame as an Arrow view (no copy)
con = duckdb.connect()
con.register("sales", sales)
result_arrow = con.sql("""
    SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region
""").arrow()                                # Arrow batch out

# 3) Hand to Pandas without serialising
result_pd = result_arrow.to_pandas(types_mapper=pd.ArrowDtype)
print(result_pd.dtypes)                     # all Arrow-backed
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Memory event Bytes copied
pl.read_parquet parquet decoded into Arrow buffers owned by Polars one decode
con.register("sales", sales) DuckDB takes a pointer to the Arrow buffers; no allocation zero
con.sql(...).arrow() DuckDB executes; emits a new Arrow batch for the (small) aggregated result one allocation for output
result_arrow.to_pandas(types_mapper=pd.ArrowDtype) Pandas wraps the Arrow buffers as extension arrays; no allocation zero

Output:

region rev
US 12 540 000
EU 8 720 000
APAC 4 110 000

Why this works — concept by concept:

  • Arrow as the substrate — Polars, DuckDB, and Pandas-2 all speak Apache Arrow columnar format natively. A pointer-to-Arrow-buffer is the universal currency.
  • con.register is zero-copy — DuckDB's view of the Polars frame is a logical reference to the same bytes. The query executor reads from them in place.
  • .arrow() and .pl() outputs are zero-copy too — the result batch lives in Arrow buffers owned by DuckDB and is wrapped (not copied) by Polars / Pandas.
  • Pandas extension arraystypes_mapper=pd.ArrowDtype wraps Arrow buffers as Pandas-2 extension columns, so downstream Pandas code consumes them with no serialisation.
  • Practical implication — you can mix three engines in the same script without paying serialisation cost. The hot loop runs in whichever engine is best for that step, and the hand-off is free.
  • Costregister is O(1); execution is O(scan + aggregate); hand-off is O(1) pointer assignment. The 5× overhead of "serialise to JSON, deserialise" is gone.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →


5. Benchmark + trade-offs — speed, memory, API for typical DE workloads

A 2026 single-node DataFrame engine benchmark — group-by, join, parquet scan, window — on the same dataset and the same laptop

The mental model in one line: a dataframe benchmark is only useful when you fix the dataset, the workload, and the hardware — and the 2026 numbers below cover the four DE-canonical workloads (group-by, join, parquet scan, window) on a 16 GB MacBook Pro with 8 cores against a TPC-H-style 10 GB parquet dataset. Once you see the radar shape, every other decision falls out: Polars wins speed and ergonomics, DuckDB wins memory and SQL surface, Pandas wins ecosystem.

Iconographic radar chart — a 4-axis radar (speed, memory, API ergonomics, ecosystem) with three overlapping translucent polygons in brand colours (purple Polars, orange Pandas, green DuckDB) and a side legend with three mini-cards summarising 'best for' per engine, on a light PipeCode card.

Benchmark setup.

  • Hardware. MacBook Pro M3, 8 cores, 16 GB RAM, NVMe SSD.
  • Dataset. TPC-H scale-factor 10 (about 10 GB total) stored as parquet with default Zstandard compression. Tables: lineitem (60 M rows), orders (15 M), customer (1.5 M), nation (25), region (5).
  • Workloads. Group-by aggregation (TPC-H Q1 style), large join (lineitem ⋈ orders), parquet scan with predicate, window function (running totals).
  • Engines. Polars 1.5 (lazy + streaming), Pandas 2.2 (with Arrow-backed strings), DuckDB 0.10.
  • Methodology. Cold cache for the first run; warm cache for the second; report the warm median over 5 runs.

Headline numbers.

Workload Pandas Polars (lazy) DuckDB
Group-by lineitem by l_returnflag, l_linestatus with 8 aggregates 95 s / 8.5 GB 5.2 s / 1.6 GB 4.4 s / 1.4 GB
Join lineitem ⋈ orders (60 M ⋈ 15 M) on l_orderkey = o_orderkey 120 s / 12 GB 12 s / 3.1 GB 9.0 s / 2.8 GB
Parquet scan with predicate (o_orderdate >= 2026-01-01) 30 s / 4 GB 1.8 s / 220 MB 1.5 s / 200 MB
Window SUM(l_extendedprice) OVER (PARTITION BY l_orderkey ORDER BY l_linenumber) 200 s / 10 GB 14 s / 2.4 GB 7.5 s / 2.0 GB

The numbers will vary by hardware and workload, but the shape is stable across most published benchmarks (db-benchmark.com, h2o.ai groupby benchmark, and the original Polars benchmark suite): Polars and DuckDB are roughly an order of magnitude faster than Pandas on aggregation, join, and window work, while Pandas keeps a small edge on pure NumPy-numeric loops and a large edge on ecosystem integration.

The four axes scored 1-5.

Axis Pandas Polars DuckDB
Speed (lower latency, higher throughput) 2 5 5
Memory peak (lower is better) 2 4 5
API ergonomics (DSL, error messages, types) 4 5 4
Ecosystem (libraries that accept the frame) 5 3 3

Where each engine wins, in one bullet each.

  • Polars wins speed and ergonomics. Lazy plan + parallel cores + strict expression DSL. Read the plan with .explain(), ship.
  • DuckDB wins memory and SQL surface. Out-of-core spill is automatic; full ANSI SQL is denser than Polars expressions for window-heavy work.
  • Pandas wins ecosystem. Matplotlib, scikit-learn, statsmodels, every notebook tutorial. Speed is the cost, but for small data the cost is moot.

Detailed explanation — group-by benchmark deep dive

Detailed explanation. Group-by aggregation is the canonical DE workload — it shows up in every dashboard, every cohort metric, every "events per minute" query. The benchmark below runs TPC-H Q1 (sum and average of several columns per l_returnflag, l_linestatus group) against the 60 M-row lineitem table.

Question. Run TPC-H Q1 against the 10 GB lineitem.parquet in all three engines. Report wall time, peak RSS, and explain why Polars and DuckDB beat Pandas by ~20×.

Input.

File Size Rows Columns used
lineitem.parquet 7.2 GB 60 M 6 of 16

Code.

# Pandas
import pandas as pd
li = pd.read_parquet(
    "lineitem.parquet",
    columns=["l_returnflag","l_linestatus","l_quantity","l_extendedprice","l_discount","l_tax"],
)
q1_pd = (
    li.assign(
        disc_price=lambda d: d["l_extendedprice"] * (1 - d["l_discount"]),
        charge=lambda d: d["l_extendedprice"] * (1 - d["l_discount"]) * (1 + d["l_tax"]),
    )
    .groupby(["l_returnflag","l_linestatus"], as_index=False)
    .agg(
        sum_qty=("l_quantity","sum"),
        sum_disc_price=("disc_price","sum"),
        sum_charge=("charge","sum"),
        avg_qty=("l_quantity","mean"),
        count=("l_quantity","size"),
    )
)

# Polars
import polars as pl
q1_pl = (
    pl.scan_parquet("lineitem.parquet")
      .group_by(["l_returnflag","l_linestatus"])
      .agg(
          pl.col("l_quantity").sum().alias("sum_qty"),
          (pl.col("l_extendedprice") * (1 - pl.col("l_discount"))).sum().alias("sum_disc_price"),
          (pl.col("l_extendedprice") * (1 - pl.col("l_discount")) * (1 + pl.col("l_tax"))).sum().alias("sum_charge"),
          pl.col("l_quantity").mean().alias("avg_qty"),
          pl.len().alias("count"),
      )
      .collect()
)

# DuckDB
import duckdb
q1_ddb = duckdb.sql("""
    SELECT
        l_returnflag, l_linestatus,
        SUM(l_quantity)                                 AS sum_qty,
        SUM(l_extendedprice * (1 - l_discount))         AS sum_disc_price,
        SUM(l_extendedprice*(1-l_discount)*(1+l_tax))   AS sum_charge,
        AVG(l_quantity)                                 AS avg_qty,
        COUNT(*)                                        AS cnt
    FROM 'lineitem.parquet'
    GROUP BY l_returnflag, l_linestatus
""").pl()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pandas reads the 6 columns into a single DataFrame (~4 GB in memory after string→object conversion). assign materialises two new columns; groupby builds an intermediate frame; agg materialises the result. Three full passes over the 4 GB working set on a single thread.
  2. Polars builds a lazy plan. .collect() runs a single fused pipeline: scan parquet → project 6 columns → compute disc_price and charge inline → hash-aggregate by (l_returnflag, l_linestatus) in parallel across cores. Only the small result (~4 rows) is materialised.
  3. DuckDB parses the SQL into the same fused shape: scan with projection pushdown → hash-aggregate → output. Vectorised execution keeps the CPU cache hot; the parquet reader streams row groups through with no full materialisation.
  4. The wall-time ratio (~20× Polars / DuckDB vs Pandas) is dominated by two factors: (a) single-thread vs parallel — Pandas uses one core, the others use eight, and (b) materialisation count — Pandas materialises three intermediate frames; the others materialise only the small result.

Output.

Engine Wall time Peak RSS Cores used
Pandas 95 s 8.5 GB 1
Polars 5.2 s 1.6 GB 8
DuckDB 4.4 s 1.4 GB 8

Rule of thumb. For TPC-H-style aggregation on parquet, Polars and DuckDB are roughly equivalent and both are 15-30× faster than Pandas. Pandas is the right pick only when the result will be fed directly into a Pandas-native consumer (matplotlib, sklearn) on a small dataset.

Detailed explanation — join benchmark and the small-vs-large pattern

Detailed explanation. Joins are where memory peaks bite hardest. The benchmark below runs the canonical lineitem ⋈ orders join (60 M × 15 M rows). On a 16 GB laptop, the Pandas version is the only one that risks OOM.

Question. Run the lineitem ⋈ orders join in all three engines and explain why Polars and DuckDB peak at ~3 GB while Pandas peaks at ~12 GB.

Input.

File Size Rows
lineitem.parquet 7.2 GB 60 M
orders.parquet 2.1 GB 15 M

Code.

# Pandas -- the OOM risk path on a 16 GB laptop
import pandas as pd
li = pd.read_parquet("lineitem.parquet",
                    columns=["l_orderkey","l_extendedprice","l_discount"])
o  = pd.read_parquet("orders.parquet",
                    columns=["o_orderkey","o_orderdate","o_orderpriority"])
joined_pd = li.merge(o, left_on="l_orderkey", right_on="o_orderkey", how="inner")
result_pd = joined_pd.groupby("o_orderpriority", as_index=False)["l_extendedprice"].sum()

# Polars -- lazy join, parallel hash-build
import polars as pl
result_pl = (
    pl.scan_parquet("lineitem.parquet")
      .join(
          pl.scan_parquet("orders.parquet"),
          left_on="l_orderkey",
          right_on="o_orderkey",
          how="inner",
      )
      .group_by("o_orderpriority")
      .agg(pl.col("l_extendedprice").sum().alias("rev"))
      .collect(streaming=True)
)

# DuckDB -- single SQL, optimiser picks build side
import duckdb
result_ddb = duckdb.sql("""
    SELECT
        o.o_orderpriority,
        SUM(l.l_extendedprice) AS rev
    FROM 'lineitem.parquet' l
    JOIN 'orders.parquet'   o ON l.l_orderkey = o.o_orderkey
    GROUP BY o.o_orderpriority
""").pl()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pandas reads both files fully (≈9 GB combined). merge materialises the joined frame (~12 GB peak) before the group-by even starts. The 16 GB laptop is one bad hash table away from OOM.
  2. Polars with scan_parquet lazy-reads both sides. The planner picks orders (15 M rows, smaller) as the build side of a parallel hash join; lineitem is streamed through the hash table. collect(streaming=True) keeps peak memory bounded by the build side, not the joined output.
  3. DuckDB parses the SQL; the optimiser sees the row-count statistics from the parquet footers and chooses the same build / probe assignment. The vectorised hash-join operator processes 2048 rows at a time per thread.
  4. Both Polars and DuckDB never materialise the joined intermediate — they immediately aggregate. Peak memory is bounded by the build hash table (~2.5 GB) plus the aggregate state (~few MB).

Output.

Engine Wall time Peak RSS
Pandas 120 s 12 GB (risky)
Polars 12 s 3.1 GB
DuckDB 9 s 2.8 GB

Rule of thumb. For joins between two large parquet files, always reach for Polars-lazy (with streaming) or DuckDB. Pandas's merge materialises the joined frame before any downstream operator can shrink it, which is the single most common OOM trigger on a laptop-scale DE workload.

Detailed explanation — when Pandas still wins the benchmark

Detailed explanation. It is fair to call out the workloads where Pandas is not the loser. Tight numeric loops on small arrays, time-series resample, and ML hand-off remain Pandas's home turf.

Question. Show one workload where Pandas's wall-time is within 2× of Polars / DuckDB, and explain why the ecosystem cost makes Pandas the right pick.

Input.

File Size Rows Workload
prices.parquet 80 MB 5 M 5-min resample → daily mean → sklearn fit

Code.

import pandas as pd
import polars as pl

# Pandas -- the natural choice for the resample + sklearn hand-off
prices_pd = pd.read_parquet("prices.parquet")
prices_pd["ts"] = pd.to_datetime(prices_pd["ts"])
daily_pd = (
    prices_pd.set_index("ts")
            .resample("5min")["price"].mean()
            .resample("1D").mean()
            .dropna()
)
# Feed straight into sklearn -- native Pandas type
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.arange(len(daily_pd)).reshape(-1, 1)
y = daily_pd.values
model = LinearRegression().fit(X, y)

# Polars equivalent -- works, but the sklearn hand-off costs a .to_numpy() conversion
daily_pl = (
    pl.read_parquet("prices.parquet")
      .with_columns(pl.col("ts").cast(pl.Datetime))
      .sort("ts")
      .group_by_dynamic("ts", every="5m")
      .agg(pl.col("price").mean())
      .group_by_dynamic("ts", every="1d")
      .agg(pl.col("price").mean())
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For 80 MB of data, all three engines finish in well under a second; the wall-time difference is invisible to the user.
  2. The Pandas chain reads naturally because resample is a single verb with built-in handling for irregular timestamps and missing windows. The Polars group_by_dynamic is functionally equivalent but takes more keystrokes for the same idea.
  3. LinearRegression().fit(X, y) consumes NumPy arrays. The Pandas Series daily_pd exposes .values natively; the Polars frame would need .to_numpy(). On 80 MB the cost is zero; the readability difference is real.
  4. If the same notebook later wants to plot the result, daily_pd.plot() is one call; the Polars frame requires a manual matplotlib stanza or a .to_pandas() first.

Output.

Engine Wall time Code lines Ecosystem hand-off
Pandas 0.3 s 6 direct
Polars 0.2 s 9 needs .to_pandas()
DuckDB 0.2 s 5 needs .df()

Rule of thumb. When the dataset is small (< 1 GB), the downstream consumer is matplotlib / sklearn / statsmodels, and the workload is time-series-shaped, Pandas remains the right choice. The 5× speed difference is invisible at 80 MB, and the ecosystem ergonomics matter.

Python interview question on engine trade-offs

A senior interviewer might frame this as: "Design a decision tree for picking Polars, Pandas, or DuckDB at the start of any new data engineering task. Walk me through three example workloads and which leaf each one lands on."

Solution Using a 3-question decision tree

# Q1: Is the dataset > 5 GB or larger-than-memory?
#   yes -> Q2 (lean toward Polars-lazy or DuckDB)
#   no  -> Q3
#
# Q2: Is the natural query shape SQL (joins across sources, windows, recursive CTE)?
#   yes -> DuckDB
#   no  -> Polars-lazy with streaming
#
# Q3: Will a Pandas-native consumer (sklearn / matplotlib / statsmodels) consume the result?
#   yes -> Pandas
#   no  -> Polars (faster, cleaner DSL, free type checking)

def pick_engine(rows_estimate, sql_shape, downstream_is_pandas):
    if rows_estimate > 50_000_000:        # heuristic for "> 5 GB"
        return "DuckDB" if sql_shape else "Polars (lazy + streaming)"
    if downstream_is_pandas:
        return "Pandas"
    return "Polars"

# Three example workloads
print(pick_engine(60_000_000, sql_shape=False, downstream_is_pandas=False))  # Polars (lazy + streaming)
print(pick_engine(60_000_000, sql_shape=True,  downstream_is_pandas=False))  # DuckDB
print(pick_engine(500_000,    sql_shape=False, downstream_is_pandas=True))   # Pandas
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Workload Rows SQL-shaped? Pandas downstream? Pick
ETL nightly group-by 60 M no no Polars (lazy + streaming)
Ad-hoc 3-source join 60 M yes no DuckDB
Notebook EDA + sklearn 500 K no yes Pandas
Web app low-latency filter 5 M no no Polars
Analyst SQL probe 20 M yes no DuckDB

Output:

Question Default leaf
Q1 yes + Q2 yes DuckDB
Q1 yes + Q2 no Polars (lazy + streaming)
Q1 no + Q3 yes Pandas
Q1 no + Q3 no Polars

Why this works — concept by concept:

  • Three-question tree beats engine religion — most "Polars or Pandas?" debates skip the question "what is the workload?" The tree forces you to answer it first.
  • Q1 (size) — once the dataset exceeds ~5 GB on a typical laptop, Pandas's eager materialisation becomes a memory liability. The fork is between Polars and DuckDB.
  • Q2 (SQL-shape) — heterogeneous sources, windows, recursive CTEs favour DuckDB. Imperative chained transforms favour Polars. Both are correct on the workload; the choice is API ergonomics.
  • Q3 (downstream consumer) — if the result feeds into a Pandas-native library, Pandas avoids one hand-off. For small data the speed cost is moot.
  • Defaults — when in doubt, pick Polars. It is faster than Pandas, has a strict DSL with better error messages, and converts to Pandas in O(1) when a downstream library demands it.
  • Cost — the tree itself takes 30 seconds to evaluate per task. The cost of choosing wrong (a Pandas OOM at 2 a.m.) is much larger than the cost of the deliberation.

Python
Topic — data processing
Data processing problems (Python)

Practice →


Cheat sheet — picking the engine in 60 seconds

  • Default to Polars-lazy for ETL. pl.scan_parquet(...).filter(...).group_by(...).agg(...).collect(streaming=True) — predicate / projection pushdown, parallel cores, streaming spill.
  • Default to DuckDB for ad-hoc SQL. duckdb.sql("SELECT … FROM 'file.parquet'").pl() — same pushdown, full ANSI SQL with windows / CTEs / QUALIFY.
  • Default to Pandas for notebooks + ML hand-off. Below 1 GB, the speed difference is invisible; matplotlib / sklearn ergonomics win.
  • Read the optimised plan. LazyFrame.explain() (Polars), EXPLAIN SELECT ... (DuckDB). If pushdown is not happening, fix the plan before benchmarking.
  • Zero-copy hand-off. Polars → Pandas: df.to_pandas(use_pyarrow_extension_array=True). Polars → DuckDB: con.register("df", polars_df). DuckDB → Polars: .pl(). DuckDB → Pandas: .df(). All Arrow under the hood — no serialisation.
  • group_by cardinality matters. A streaming group_by is bounded by the distinct-keys working set, not the input size. Beware of grouping by user_id over a billion users — even streaming will spill.
  • Joins: pick the build side. Polars and DuckDB do this automatically based on parquet statistics. Pandas does not — if you must use merge, make sure the smaller frame is on the left.
  • Avoid .apply in Pandas. Row-wise Python lambdas in .apply kill performance. Replace with vectorised NumPy or migrate to Polars expressions.
  • Arrow-backed Pandas dtypes. Turn on pd.options.future.infer_string = True so string columns become Arrow-backed by default. 2-3× speed-up on groupby and merge.
  • NaN tax in Pandas. Use Int64, boolean, string extension dtypes (or Int64[pyarrow]) for nullable columns. Stops the silent int → float promotion.
  • DuckDB on a Pandas / Polars frame. Looks up the name in the calling scope: duckdb.sql("SELECT * FROM my_polars_df"). Zero copy; no con.register needed for one-off scripts.
  • polars vs duckdb ergonomics call. Imperative chain → Polars. SQL surface (windows, CTEs, joins across sources) → DuckDB. Both are roughly tied on speed and memory at 10-100 GB scale.
  • Larger-than-memory. Polars collect(streaming=True) and DuckDB out-of-core spill both handle it. Pandas does not — shard manually or pick another tool.
  • Plot from the small frame. Never plot 60 M rows. Always aggregate first; then call .to_pandas() and plot the small result.

Frequently asked questions

Is Polars always faster than Pandas?

No. For small workloads (< 1 GB) on a notebook, the wall-time difference is usually invisible — and the ecosystem ergonomics (matplotlib, scikit-learn, statsmodels) often favour Pandas. Polars's structural advantages — lazy plan, parallel cores, Arrow columnar layout — pay off when the dataset is large enough for pushdown and parallelism to matter, typically above 5 GB and / or when the workload reads parquet with a selective filter. On polars vs pandas benchmarks at TPC-H scale, Polars is 10-30× faster on group-by, join, and window workloads; on a 100 MB CSV with a apply(lambda) chain, both finish in a second.

Should I migrate all my Pandas code to Polars?

Usually no — migrate the workloads where the structural advantages matter and leave the rest. Nightly ETL on multi-GB parquet, low-latency data app filters on warehouse extracts, and long-running batch jobs are the obvious wins. Notebook EDA, time-series resample, sklearn / statsmodels hand-off, and any code where the result is plotted directly with matplotlib should stay Pandas. The Arrow-backed zero-copy df.to_pandas(use_pyarrow_extension_array=True) hand-off means the two engines can co-exist in the same script without serialisation cost.

DuckDB vs Polars for analytics — which one should I pick?

Pick by API shape, not by speed — both are roughly tied at 10-100 GB scale. DuckDB wins when the query is SQL-shaped: heterogeneous sources (parquet + CSV + in-memory frame), window functions with complex OVER clauses, recursive CTEs, GROUPING SETS, MATCH_RECOGNIZE. Polars wins when the query is an imperative chain: scan → filter → group_by → agg → join, with column-name typos caught at plan time by the strict expression DSL. For ad-hoc analyst probes, default to DuckDB. For long-lived production ETL, default to Polars.

Does Polars replace SQL?

No — and trying to make it replace SQL is the wrong framing. Polars is a strict expression DSL that is excellent at chained transforms; SQL is a declarative query language with decades of ergonomic muscle memory. On polars vs duckdb for the same workload, Polars often feels more natural in a Python pipeline (imports, types, tests), while DuckDB SQL feels more natural for analyst probes and joins across heterogeneous sources. Most senior teams use both — Polars in the production code path, DuckDB as the SQL surface for ad-hoc work.

Can I share a frame between Polars, Pandas, and DuckDB without copying?

Yes, via Apache Arrow. All three engines speak Arrow columnar format natively. Polars stores its frame as Arrow buffers; Pandas 2.x extension arrays wrap Arrow buffers; DuckDB reads Arrow batches as zero-copy tables. The mechanics: polars_df.to_pandas(use_pyarrow_extension_array=True) returns a Pandas frame that shares bytes with the Polars frame; duckdb.sql("SELECT ... FROM polars_df") registers the Polars frame as an Arrow view and runs the query in place; duckdb_result.pl() / .df() / .arrow() return the result as Polars / Pandas / Arrow with no serialisation. This is what makes the three-engine workflow practical.

What about Modin / Dask / Vaex — are they relevant?

For single-node DataFrame work in 2026, no. Modin (Ray-backed Pandas) and Dask DataFrame (distributed Pandas) target multi-node scale-out, but for that scale most teams now reach for Spark or BigQuery anyway — and on a single node, Polars and DuckDB beat both on raw throughput because they were designed columnar from day one. Vaex was the early Polars-style alternative but has not kept up with the Arrow ecosystem. If your workload exceeds a single node's RAM and disk budget, the right move is usually Spark / BigQuery / Snowflake — not Modin / Dask. For everything under that ceiling, Polars and DuckDB are the canonical 2026 pandas alternative.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every Polars expression, Pandas idiom, and DuckDB SELECT above ships with hands-on practice rooms where you write the lazy scan, the join, and the windowed aggregation against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `polars vs pandas` choice actually wins on the workload you ship to production.

Practice data manipulation now →
Aggregation drills →

Top comments (0)