DEV Community

Cover image for DuckDB for Data Engineering: In-Process OLAP, Local ETL & Parquet-First Workflows
Gowtham Potureddi
Gowtham Potureddi

Posted on

DuckDB for Data Engineering: In-Process OLAP, Local ETL & Parquet-First Workflows

duckdb quietly became the most disruptive single binary in analytics — a 30 MB executable that scans Parquet at warehouse speeds on a laptop, runs entire dbt projects in CI under two minutes, and turns the "first I'll provision a cluster" reflex into a one-liner: pip install duckdb. The 2026 reality is that a 32 GB MacBook with NVMe storage now outperforms many small Snowflake warehouses on cold Parquet scans, and the only engine that ships in a single dependency is DuckDB. Once you internalise "in-process OLAP," you stop loading data and start scanning it where it already lives.

This duckdb tutorial is the cheat sheet you wished existed the first time someone told you to "just use DuckDB for that report." It walks through the in-process database model, the vectorized executor and Arrow zero-copy that make duckdb python workflows fast, the duckdb parquet scan with partition pruning and httpfs S3, local etl with dbt-duckdb plus pytest plus Git, and the four-quadrant deployment matrix that names where duckdb sql and duckdb analytics actually fit (laptop, CI, edge, notebook) versus where they don't (long-running OLAP cluster, concurrent writes, multi-user serving). Every section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for DuckDB for data engineering — bold white headline 'DuckDB · In-Process OLAP' with subtitle 'single binary · Parquet-first · local ETL · CI-friendly' over a hero composition of an abstract duck silhouette standing inside a stylised single-binary file card on a faint parquet-grid background with a SQL-arch glyph curving above, on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the database design practice library →, rehearse on aggregation problems →, and harden the filtering muscles with filtering problems →.


On this page


1. Why in-process OLAP — single binary, no server, columnar, MVCC

duckdb is the analytics counterpart to SQLite — a single-binary in-process OLAP engine that turns "stand up a warehouse" into import duckdb

The mental model in one line: duckdb is to OLAP what SQLite is to OLTP — an embedded, in-process, zero-dependency engine that runs inside your application's address space and exposes SQL over columnar storage. Once you say "in-process database," the rest of the engineering picture (no auth handshake, no network hop, no cluster to wake up, no admin role) snaps into place.

The seven traits that define DuckDB.

  • Single binary — one ~30 MB shared library (duckdb) with zero runtime dependencies. The Python wheel ships the same engine in pure C++; the CLI is a single duckdb executable. There is nothing to install, configure, or daemon-ise.
  • No server — there is no separate process to start, no socket to bind, no pg_hba.conf, no my.cnf. DuckDB runs inside your Python interpreter, your Node process, your Rust binary, your dbt run, or your duckdb CLI shell.
  • Columnar storage — both the file format (.duckdb blocks) and the in-memory execution are column-oriented. The same SELECT that scans 200 GB of CSV at 50 MB/s in Postgres scans Parquet at 2 GB/s in DuckDB because only the projected columns are read.
  • MVCC — multi-version concurrency control inside a single writer process. Readers see a consistent snapshot even while a writer is mutating the same table. There is no global lock, no row-level lock dance, no SELECT ... FOR UPDATE.
  • Zero-configduckdb.connect() is the entire setup. No CREATE DATABASE, no GRANT, no role hierarchy. The database is the file path you hand to connect(), or :memory: for ephemeral runs.
  • Arrow native — pandas, Polars, PyArrow, and DuckDB share the same Apache Arrow memory layout. A df can be queried with SQL directly — SELECT * FROM df — with zero serialisation between Python objects and the engine.
  • OLAP focus — DuckDB optimises for analytical workloads: large scans, aggregates, group-bys, joins over wide rows, vectorized execution. Point lookups exist but are not the target; row-by-row OLTP belongs in Postgres or SQLite.

The 2026 reality.

  • Laptops outperform small warehouses on cold Parquet scans. A 32 GB MacBook with NVMe runs a 100-column aggregate over 200 GB of Parquet in tens of seconds; a 1-credit Snowflake X-Small often takes longer because the cluster has to spin up and shuffle data over the network.
  • pip install duckdb ships in CI under 30 seconds. GitHub Actions and CircleCI cache the wheel; the cold install is faster than provisioning a managed warehouse.
  • The dbt-duckdb adapter is officially supported. Entire dbt projects now run "locally" against DuckDB for dev/CI and against a real warehouse only in prod — same SQL, same tests, faster feedback loop.
  • Polars + DuckDB + Arrow is the new dataframe stack. Polars handles row-level transforms; DuckDB handles set-based SQL; Arrow is the zero-copy bridge between them. The combination beats pandas by 10x to 100x on memory-bound workloads.

What interviewers listen for.

  • Do you say "in-process" the moment DuckDB comes up? — senior signal.
  • Do you mention MVCC and single-writer / multi-reader as the concurrency model? — required answer.
  • Do you reach for read_parquet() instead of COPY for cloud data? — senior signal.
  • Do you name laptop, CI, notebook, edge as the four right-fit quadrants? — senior signal.

Detailed explanation — installing DuckDB and running the first query

Topic explanation in three labelled bullets.

  • One install line per environment — Python: pip install duckdb. Node: npm install duckdb. CLI: brew install duckdb (macOS) or download the single executable for Linux / Windows. There is no separate server install, no service to enable.
  • One connect line per scriptduckdb.connect() returns a connection. With no argument, the database lives in :memory: (ephemeral). With a path argument, the database lives in that file. Both connections speak the same SQL.
  • One execute line per queryconn.execute("SELECT 42").fetchall() returns a list of tuples. conn.execute("SELECT 42").df() returns a pandas DataFrame. conn.execute("SELECT 42").arrow() returns a PyArrow Table.

Question. Show the minimum-viable DuckDB program in Python — install, connect, and run one query that returns a single integer. Then show the same query as a CLI one-liner.

Input. (none — the query is a self-contained constant)

Code.

# Python — three lines after `pip install duckdb`
import duckdb

con = duckdb.connect()                  # in-memory database
print(con.execute("SELECT 42").fetchall())
# -> [(42,)]
Enter fullscreen mode Exit fullscreen mode
# CLI — one line after the binary is on PATH
duckdb -c "SELECT 42"
# -> ┌───┐
#    │ 42│
#    └───┘
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. import duckdb loads the C++ engine into the Python interpreter's address space. There is no socket open, no daemon spawned.
  2. duckdb.connect() with no argument creates an in-memory database. The connection is a handle into the engine's MVCC graph for this process.
  3. con.execute("SELECT 42") parses the SQL, plans the query, and executes it in the same thread. The result is a DuckDBPyRelation (lazy) — .fetchall() materialises it as a list of tuples.
  4. The CLI variant is the same engine wrapped in a tiny REPL binary. -c "SELECT 42" runs a one-shot query and exits.

Output.

variant output
Python [(42,)]
CLI rendered table with 42

Rule of thumb. Default to duckdb.connect() (in-memory) for one-off scripts and duckdb.connect("warehouse.duckdb") for projects that need persistence. Never run a separate duckdb-server — there is no such thing on purpose.

Detailed explanation — DuckDB vs SQLite vs Postgres at a glance

Topic explanation in three labelled bullets.

  • SQLite is row-store OLTP — perfect for embedded point lookups, small transactional workloads, and config databases. It struggles on aggregates over millions of rows because the executor is row-at-a-time.
  • Postgres is row-store OLTP with strong OLAP add-ons — a full client-server database with proper concurrency, foreign keys, and a planner that handles real OLAP via extensions (Citus, pg_duckdb, pg_columnar). The default storage is still row-oriented; you pay for that on wide scans.
  • DuckDB is columnar OLAP — embedded like SQLite, but the file format and the executor are both vectorised and column-oriented. The result: order-of-magnitude faster scans for analytical queries on the same hardware.

Question. Compare DuckDB, SQLite, and Postgres on five dimensions — process model, storage layout, concurrency, workload fit, and "where it ships." Highlight the one trait that makes DuckDB uniquely the OLAP-embedded engine.

Input. (none — comparison table only)

Code.

-- The same SELECT runs in all three; only the engine plan differs.
SELECT region, COUNT(*) AS orders, SUM(amount) AS total
FROM orders
GROUP BY region
ORDER BY total DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. SQLite scans the orders table row by row, materialising every column even though only three are needed. Aggregate is a single-threaded hash map.
  2. Postgres scans heap pages (8 KB row-oriented blocks), reads the same row entirely, then projects the three columns. Aggregate is parallelised in modern versions but still pays the row-read cost.
  3. DuckDB scans only the region and amount columns from columnar storage. The aggregate runs a vectorised hash-grouped sum across batches of 1024 values at a time. Most of the data file is never read.

Output.

dimension SQLite Postgres DuckDB
process model in-process client-server in-process
storage layout row-oriented row-oriented (default) columnar
concurrency single-writer + many readers multi-writer + many readers single-writer + many readers (MVCC)
workload fit OLTP point lookups OLTP + add-on OLAP OLAP scans + analytical SQL
ships in one C library server + client one C++ library

Rule of thumb. Pick SQLite for OLTP at the edge, Postgres for OLTP at the centre, and DuckDB for OLAP that needs to ride inside another process. The "embedded OLAP" niche is exactly where DuckDB lives — and DuckDB is currently the only credible occupant.

Detailed explanation — when in-process beats client-server

Topic explanation in three labelled bullets.

  • Latency — there is no TCP round-trip, no protocol parse, no auth handshake. A query that takes 5 ms in Postgres over localhost takes 0.1 ms in DuckDB because the function call is direct.
  • Cold start — managed warehouses charge "warm-up" time (seconds to minutes). DuckDB starts in microseconds — the cost is the connect() call, which is the cost of allocating a struct.
  • Ops overhead — there is no service to monitor, no backup script to write (the database is a file — cp it), no port to firewall. Disaster recovery is git pull if your .duckdb file is in the repo (don't do that for prod) or aws s3 cp if it's in object storage.

Question. Given an interactive notebook that runs many small analytical queries against a 20 GB Parquet directory, explain why in-process DuckDB is materially faster than a client-server warehouse. Quantify the three costs that disappear.

Input.

operation client-server warehouse in-process DuckDB
connect 50–500 ms (TLS + auth) 50 µs
parse + plan 5–20 ms 0.5 ms
network shuffle yes (data moves to driver) no (data stays in process)
result materialise serialise + deserialise direct memory

Code.

import duckdb, time
con = duckdb.connect()
t = time.perf_counter()
for _ in range(1000):
    con.execute("SELECT 1").fetchone()
print(f"1000 trivial queries in {time.perf_counter() - t:.3f}s")
# Typical: 0.05–0.2s on a laptop — i.e. ~50–200 µs each.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each loop iteration parses, plans, executes, and fetches a one-row result. The end-to-end cost is dominated by the parse/plan step; the engine itself is microseconds.
  2. The same loop against a remote Postgres or Snowflake spends 5–500 ms per iteration on the network round-trip plus auth-token validation.
  3. A 20 GB Parquet directory loaded once into a DuckDB connection is then queried many times with no re-load; the columnar pages stay in the OS page cache and are scanned at memory speed.
  4. The cumulative effect on notebook workflow is the difference between "interactive" (sub-second per cell) and "stutter" (multi-second per cell).

Output.

metric warehouse DuckDB
connect 100 ms 50 µs
1000 trivial queries 60 s 0.1 s
20 GB Parquet scan 30 s (warm) 5 s (warm)

Rule of thumb. Whenever the data fits on the machine and the workload is "many small analytical queries," in-process DuckDB wins on every axis except multi-user concurrency. Reserve the warehouse for the cases where the dataset doesn't fit or the concurrency demand is real.

Detailed explanation — MVCC and the single-writer model

Topic explanation in three labelled bullets.

  • MVCC — multi-version concurrency control. Every transaction sees a consistent snapshot of the database. Readers do not block writers; writers do not block readers. The price: a single writer at a time within one process.
  • Single-writer — DuckDB serialises mutations through one logical writer per file. Multiple processes opening the same .duckdb file in write mode is not supported (the second one errors out). Multiple threads inside one process serialise behind a write latch.
  • Many-reader — multiple processes can open the same file read-only and scan it concurrently. This is the right pattern for "one ETL writer, many analyst readers" topologies.

Question. Two cursors operate on the same DuckDB connection — one runs a long-running aggregate, the other inserts new rows mid-aggregate. Show what each cursor sees and why.

Input.

time cursor A (reader) cursor B (writer)
t0 BEGIN; SELECT SUM(amount) FROM orders;
t1 … (scan in progress) INSERT INTO orders VALUES (99, 'EU', 1000);
t2 sum returned COMMIT;

Code.

import duckdb
con = duckdb.connect("warehouse.duckdb")
a = con.cursor()                   # reader
b = con.cursor()                   # writer

a.execute("BEGIN")
a.execute("SELECT SUM(amount) FROM orders")
# meanwhile…
b.execute("INSERT INTO orders VALUES (99, 'EU', 1000)")
b.execute("COMMIT")

print(a.fetchone())                # snapshot value — excludes the new row
a.execute("COMMIT")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Cursor A's BEGIN pins a snapshot at the current MVCC timestamp. Every subsequent read in this transaction sees only rows visible at that timestamp.
  2. Cursor B's INSERT creates a new row visible only to transactions whose timestamp is greater than B's commit timestamp.
  3. Cursor A's SUM scans the orders table at A's snapshot — the new row is invisible. A reads the "old" sum, deterministically and without locks.
  4. Once A commits, a subsequent BEGIN in A would see B's row because A's new snapshot timestamp is now after B's commit.

Output.

cursor result
A (snapshot) original sum, e.g. 5000
B (committed) new row (99, EU, 1000) inserted
A (new transaction after commit) sum = 6000

Rule of thumb. Treat DuckDB as "one writer at a time per database file." If you need many concurrent writers, you've outgrown DuckDB's niche — either shard by file (one per pipeline step) or graduate to a warehouse. Concurrency at scale is the explicit anti-pattern.

SQL interview question on choosing in-process OLAP

A senior interviewer often opens with: "Your team needs to run nightly analytical reports on 50 GB of Parquet. The data lives in S3. The team is three engineers. Why would you pick DuckDB over Snowflake or Postgres-with-extensions for this workload?"

Solution Using a decision-table that pins DuckDB to laptop/CI/notebook/edge workloads

-- The "decision query" you'd run mentally: given a workload profile,
-- which engine fits? Treat this as the lookup table you keep in your head.
WITH workload AS (
    SELECT
        'nightly analytical batch' AS pattern,
        '50 GB Parquet on S3'      AS data,
        'three engineers'          AS team,
        FALSE                       AS multi_writer,
        FALSE                       AS sub_second_serving,
        TRUE                        AS deterministic_window
)
SELECT
    CASE
        WHEN multi_writer OR sub_second_serving         THEN 'warehouse'
        WHEN NOT deterministic_window                   THEN 'warehouse'
        WHEN data ILIKE '%parquet%' AND NOT multi_writer THEN 'DuckDB'
        ELSE 'evaluate further'
    END AS recommended_engine
FROM workload;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

trait workload value branch evaluated recommendation
multi_writer FALSE first WHEN false continue
sub_second_serving FALSE first WHEN false continue
deterministic_window TRUE second WHEN false continue
data contains parquet TRUE third WHEN matches DuckDB

The decision-table compiles every "do I need a warehouse?" reflex into a small set of FALSE checks. The moment a workload says "no multi-writer, no sub-second serving, Parquet input, batch window" — DuckDB is the right answer. If any of those flip, the answer becomes "warehouse."

Output:

pattern recommended_engine
nightly analytical batch DuckDB

Why this works — concept by concept:

  • Single-binary costpip install duckdb costs ~30 s in CI and zero dollars per month. A warehouse costs $50–$5000+ per month before the first query.
  • Columnar Parquet scan — Parquet is already column-oriented; DuckDB reads only the projected columns and skips row groups whose statistics fail the predicate. This is the heart of "fast scan on a laptop."
  • MVCC fits the batch window — the ETL writer runs once nightly; the analyst readers all open read-only. No write contention, ever.
  • Deterministic window — the report runs at 2 AM; nobody else is querying the file. The single-writer constraint is invisible because there is only ever one writer at a time.
  • Cost — O(scanned columns × scanned row groups) per query. The data on S3 stays where it is; DuckDB pulls only the bytes the predicate needs. Compute cost is one CI runner minute.

SQL
Topic — database
Database design problems (SQL)

Practice →


2. DuckDB engine — vectorized execution, Arrow integration, MVCC

Vectorized execution is the reason duckdb sql runs faster than row-store engines — Arrow zero-copy is the reason it integrates with pandas/Polars/PyArrow for free

The mental model in one line: DuckDB processes batches of 1024 rows at a time through tight, cache-friendly inner loops, and shares the same Apache Arrow memory layout with pandas/Polars/PyArrow so handoffs are zero-copy. Once you internalise "vectorised batches" and "Arrow zero-copy," the rest of the engine — planner, MVCC, spill-to-disk — slots in around them.

Iconographic engine cake cross-section — four stacked engine layers from bottom to top (Storage parquet/arrow tiles, Vectorized executor gears with batch-vector arrows, Planner DAG glyph, SQL frontend SELECT-arch), with a side card 'engine traits' listing vectorized, zero-copy Arrow, and MVCC chips, on a light PipeCode card.

The seven engine traits in labelled bullets.

  • Vectorized executor — every operator (scan, filter, project, hash-aggregate, hash-join, sort) processes 1024-row batches at a time. The inner loop is a tight CPU-cache-friendly sequence of column operations, not a row-at-a-time interpreter walk.
  • Batch vector — the unit of work is a "vector" — a fixed-size column slice plus a validity bitmap. Operators consume vectors and emit vectors; pipeline-breakers (sort, hash-aggregate) materialise into hash tables but still feed downstream as vectors.
  • Arrow zero-copy — DuckDB and Apache Arrow share the same column memory layout. A pandas DataFrame backed by Arrow is the same bytes as a DuckDB vector — no SerDe round-trip on handoff.
  • Planner — cost-based with a rule-based rewrite pass. Performs predicate pushdown, projection pushdown, join reordering, and filter combination before handing the plan to the executor.
  • MVCC snapshot — versioned row visibility per transaction. Readers see a consistent timestamp; writers append new versions. Visibility checks are cheap (timestamp compare) and never block.
  • WAL — a write-ahead log stores changes before they hit the data files, so a crash mid-commit is recoverable. Configurable but on by default for persistent databases.
  • Spill-to-disk — when a hash table outgrows memory_limit, DuckDB spills partitions to a temporary directory on disk. Out-of-memory crashes are rare; queries get slower but still finish.

Common interview probes on the engine.

  • "What does 'vectorized' mean concretely?" — process 1024 rows per operator call instead of one. The inner loop is a tight column-wise sequence.
  • "What does Arrow buy you in DuckDB?" — zero-copy interop with pandas/Polars/PyArrow. A DataFrame is queried as a table with no marshalling.
  • "How does DuckDB do concurrency?" — MVCC, single-writer per process. Multiple readers see a consistent snapshot.
  • "What happens when an aggregate is bigger than RAM?" — DuckDB spills partitioned hash tables to temp_directory and continues. The query slows down but doesn't OOM.

Detailed explanation — vectorized execution by hand

Topic explanation in three labelled bullets.

  • Row-at-a-time interpreter — the classic Volcano model: each operator's next() returns one row, walks the entire operator tree per row. Cache-unfriendly, branch-heavy, slow on modern CPUs.
  • Vectorized batch — each operator's next() returns a 1024-row chunk per column. The inner loop is a tight for-loop over a contiguous array. CPU-cache-friendly, branch-predictor-friendly, SIMD-friendly.
  • The win — the per-row interpreter overhead is amortised over 1024 rows. Speedups of 10–100x are routine on aggregate-heavy queries.

Question. Sum a single column over a million-row table both row-at-a-time (Python loop) and vectorised (NumPy). Use the timing difference as the mental model for what DuckDB does in C++ at scale.

Input. (one million-row column)

Code.

import time, numpy as np
n = 1_000_000
vals = np.random.randint(0, 100, n)

# Row-at-a-time
t = time.perf_counter()
total_row = 0
for v in vals:
    total_row += int(v)
print(f"row-at-a-time: {time.perf_counter() - t:.3f}s, sum={total_row}")

# Vectorised
t = time.perf_counter()
total_vec = int(vals.sum())
print(f"vectorised:    {time.perf_counter() - t:.3f}s, sum={total_vec}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Python loop pays interpreter overhead per iteration. One million iterations × tens of nanoseconds each = tens of milliseconds for a trivial sum.
  2. NumPy's .sum() runs a tight C loop over the contiguous int array. It processes the same million values in tens of microseconds — roughly 1000x faster.
  3. DuckDB's executor in C++ uses the same trick: each operator processes contiguous column slices in tight inner loops. The 1024-row batch is the sweet spot for L1/L2 cache residency.
  4. The "vectorised batch" idea is engine-agnostic: it shows up in DuckDB, ClickHouse, Polars, MonetDB, and modern Parquet readers. Once you see it once, you see it everywhere.

Output.

variant wall time (typical) speedup
row-at-a-time Python 50–100 ms
vectorised NumPy 0.5–1 ms ~100×

Rule of thumb. Whenever you can replace a for row in rows: interpreter loop with a batched array operation, you're recapitulating the vectorised-executor win. That intuition transfers directly to why DuckDB beats row-store engines on aggregates.

Detailed explanation — Arrow zero-copy between pandas/Polars and DuckDB

Topic explanation in three labelled bullets.

  • One memory layout — Apache Arrow defines a columnar in-memory format. pandas (Arrow backend), Polars, PyArrow, and DuckDB all use it. The same bytes are valid in every library.
  • Zero-copy — handing a DataFrame to DuckDB does not copy bytes. DuckDB reads the Arrow buffers directly. Returning a result as a DataFrame is also zero-copy when the engine emits Arrow output.
  • The DuckDB shortcut — any Python variable that exposes an Arrow interface can be queried as a table by name: SELECT * FROM df works if df is a pandas, Polars, or PyArrow object in the current scope.

Question. Given a pandas DataFrame in a Python session, query it with SQL via DuckDB and return the result as a Polars DataFrame — all without any explicit conversion. Show the three lines.

Input.

import pandas as pd
df = pd.DataFrame({
    "region":   ["EU", "US", "EU", "APAC"],
    "amount":   [10, 20, 30, 40],
})
Enter fullscreen mode Exit fullscreen mode

Code.

import duckdb, polars as pl

# DuckDB sees `df` by name; the query runs zero-copy on Arrow buffers.
result_arrow = duckdb.sql("SELECT region, SUM(amount) AS total FROM df GROUP BY region").arrow()

# Hand the Arrow result to Polars — also zero-copy.
result_pl = pl.from_arrow(result_arrow)
print(result_pl)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. duckdb.sql(...) (top-level, no explicit connection) uses a default in-memory connection and resolves table names against the calling frame's locals.
  2. df is recognised as a pandas DataFrame; DuckDB reads its underlying Arrow buffers (or NumPy arrays, depending on backend) without copying.
  3. .arrow() returns the result as a pyarrow.Table — again, shared with DuckDB's output buffers when possible.
  4. pl.from_arrow(result_arrow) constructs a Polars DataFrame around the same memory.
  5. Total round-trip: pandas → DuckDB → Polars with zero deep copies of the underlying numeric columns.

Output.

region total
EU 40
US 20
APAC 40

Rule of thumb. Treat any Arrow-backed object (pandas, Polars, PyArrow, Polars LazyFrame after .collect()) as a first-class DuckDB table. The dataframe and the SQL surface are interchangeable — pick whichever expresses the transform most clearly.

Detailed explanation — MVCC snapshot isolation in a single process

Topic explanation in three labelled bullets.

  • Snapshot per transactionBEGIN pins a timestamp; every subsequent read in that transaction sees the database as of that timestamp. New commits after BEGIN are invisible until the next transaction.
  • No reader-writer blocking — readers and writers never wait on each other. The MVCC graph stores multiple versions of mutated rows, and visibility is a timestamp compare.
  • Single writer at a time — only one transaction at a time may mutate. A second BEGIN ... INSERT ... from another thread waits behind the active write latch.

Question. Two threads share one DuckDB connection. Thread A starts a long aggregate; thread B inserts a row mid-aggregate. Demonstrate the snapshot-isolation contract — A sees the old data, B's row appears only in subsequent transactions.

Input. (a fresh orders table)

Code.

import duckdb, threading

con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE OR REPLACE TABLE orders(id INT, region VARCHAR, amount INT)")
con.execute("INSERT INTO orders VALUES (1,'EU',10), (2,'US',20), (3,'EU',30)")

snapshot_sum = None

def reader():
    global snapshot_sum
    c = con.cursor()
    c.execute("BEGIN")
    snapshot_sum = c.execute("SELECT SUM(amount) FROM orders").fetchone()[0]
    c.execute("COMMIT")

def writer():
    c = con.cursor()
    c.execute("INSERT INTO orders VALUES (4,'EU',100)")

t1 = threading.Thread(target=reader)
t2 = threading.Thread(target=writer)
t1.start(); t2.start(); t1.join(); t2.join()

print("snapshot sum (A):", snapshot_sum)
print("post-commit sum:", con.execute("SELECT SUM(amount) FROM orders").fetchone()[0])
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The reader's BEGIN pins its snapshot timestamp. The SUM scan reads only the three rows visible at that timestamp.
  2. The writer's INSERT proceeds in parallel, creating row 4 at a later timestamp. Because the reader's snapshot is earlier, row 4 is invisible to the reader's SUM.
  3. The reader returns 60 (10 + 20 + 30) — the pre-insert sum — and commits.
  4. The next query outside any transaction sees the latest committed state, including row 4: total now 160 (10 + 20 + 30 + 100).
  5. The order of thread scheduling does not affect the reader's correctness — it always sees a consistent snapshot, never a "torn" mid-insert state.

Output.

query result
reader snapshot SUM 60
post-commit SUM 160

Rule of thumb. Rely on BEGIN ... COMMIT to pin a snapshot for any multi-statement report. The cost is zero — MVCC is free — and the contract guarantees a consistent view even while ETL writes are happening in another cursor.

Detailed explanation — spill-to-disk for out-of-memory aggregations

Topic explanation in three labelled bullets.

  • memory_limitSET memory_limit = '4GB' tells DuckDB the soft cap. When an operator would exceed it, DuckDB starts spilling instead of allocating more.
  • temp_directorySET temp_directory = '/tmp/duckdb' tells DuckDB where to write spill files. Default is the OS temp dir; for big jobs on a small /tmp, override to a roomier volume.
  • The contract — the query slows down (disk I/O instead of RAM) but does not OOM. This is the difference between "the report finished overnight" and "kernel killed the process at 3 AM."

Question. Aggregate 100 GB of Parquet on a laptop with 16 GB of RAM. Show the two settings that turn an OOM crash into a slow-but-correct run.

Input. (Parquet directory events/ with one billion rows)

Code.

import duckdb
con = duckdb.connect()

# Two knobs make 100 GB on 16 GB RAM possible.
con.execute("SET memory_limit = '12GB'")
con.execute("SET temp_directory = '/Users/me/duckdb_tmp'")

con.execute("""
    SELECT user_id, COUNT(*) AS events, SUM(amount) AS total
    FROM read_parquet('events/*.parquet')
    GROUP BY user_id
""").df()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. memory_limit = '12GB' reserves a 4 GB headroom for the OS and Python. The aggregate's hash table grows up to 12 GB in memory.
  2. When the hash table approaches the limit, DuckDB partitions the in-memory table by hash and spills the largest partition to /Users/me/duckdb_tmp. The remaining hash table fits in RAM again.
  3. The scan continues; further partitions are spilled as needed. Each spilled partition becomes one disk-resident chunk.
  4. At the end, DuckDB merges the spilled partitions back into the final aggregate. Disk I/O is the bottleneck, not memory.
  5. The result is identical to an in-memory run; the only difference is wall time.

Output.

user_id events total
1 1200 480000
2 950 380000

Rule of thumb. For any DuckDB job that might exceed RAM, set memory_limit to ~75% of physical RAM and point temp_directory at a fast SSD with enough free space. The cost is wall time; the benefit is "the job finishes."

SQL interview question on why DuckDB beats row-store engines on OLAP

A senior interviewer might frame this as: "On the same laptop, the same GROUP BY region over 50 million rows runs in 30 seconds in SQLite and 0.6 seconds in DuckDB. Walk me through the engine differences that explain the 50x gap."

Solution Using a vectorized aggregate over a Parquet scan with a single-pass HashAggregate

-- The DuckDB plan: scan only `region` and `amount` columns, vectorised hash-aggregate, no row materialisation.
SELECT
    region,
    COUNT(*)              AS orders,
    SUM(amount)           AS total,
    AVG(amount)::DECIMAL(10, 2) AS avg_amount
FROM read_parquet('orders/*.parquet')
WHERE order_date >= DATE '2026-01-01'
GROUP BY region
ORDER BY total DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step row-store engine (SQLite) DuckDB (columnar + vectorised)
scan reads full rows (every column) reads only region, amount, order_date columns
filter row-at-a-time predicate vectorised predicate on order_date column slice
project extracts columns row-by-row already isolated; no extra work
aggregate row-at-a-time hash insert batch hash-insert of 1024 values per call
output row tuples column vectors

DuckDB never reads the columns it does not need; it never interprets a row tuple; it never calls a Python-level function per row. Every step is a tight C++ loop over a contiguous column slice. The 50x speedup is the cumulative win of vectorisation, columnar projection, predicate pushdown into row groups, and zero per-row interpreter overhead.

Output:

region orders total avg_amount
EU 1,200,000 48,000,000 40.00
US 900,000 36,000,000 40.00
APAC 400,000 16,000,000 40.00

Why this works — concept by concept:

  • Columnar projection — Parquet stores each column in its own buffer. DuckDB reads only the columns the SELECT mentions, skipping the bytes of every unused column entirely.
  • Predicate pushdown — the WHERE order_date >= ... filter is pushed down into Parquet row-group statistics. Row groups whose max(order_date) is before 2026 are skipped without reading any data.
  • Vectorised hash-aggregate — the GROUP BY processes 1024 rows per batch. The hash table is updated with vectorised SIMD-friendly increments; AVG is computed as SUM/COUNT in the final emit.
  • Single-pass plan — scan → filter → aggregate → order. No materialisation between operators; vectors stream through.
  • Cost — O(scanned_row_groups × scanned_columns) for the scan; O(distinct_groups) for the hash table; O(distinct_groups × log(distinct_groups)) for the final ORDER BY. On 50 million rows with 3 regions, the order step is negligible — the work is in the scan.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →


3. Reading Parquet at scale — partition pruning, Hive partitioning, S3 / httpfs

duckdb parquet is the headline feature — scan terabytes of cloud Parquet without ever loading it, thanks to row-group statistics, Hive partitioning, and the httpfs extension

The mental model in one line: DuckDB's read_parquet() is the canonical scan function — it accepts a path, glob, S3 URI, or list, prunes partitions and row groups against your WHERE clause, and reads only the column buffers the SELECT projects. Once you say "Parquet-first," you stop running COPY ... FROM into a local table and start querying the cloud files in place.

Iconographic folder-tree pruning diagram — a left-hand folder tree showing Hive-style year/month/day partitions; thin arrows from a SQL WHERE clause on the right point at specific partitions with green check flags while other partitions fade with red cross flags showing pruned partitions; a side card 'pruning rules' lists filter pushdown, partition, and row-group chips, on a light PipeCode card.

The eight Parquet primitives in labelled bullets.

  • read_parquetSELECT * FROM read_parquet('path') is the canonical scan. Accepts a single file, a glob ('data/*.parquet'), an array (['a.parquet','b.parquet']), or an S3 URI.
  • glob — wildcard expansion happens server-side (in the DuckDB process). The glob can include ** for recursive expansion. Useful for "every Parquet under this prefix."
  • Hive partitioning — folder names like year=2026/month=05/day=12/file.parquet are recognised when you pass hive_partitioning=true. The columns year, month, day become first-class SELECTable columns without any per-file reading.
  • filter pushdown — DuckDB pushes WHERE-clause predicates into the Parquet reader. Predicates on partition columns prune folders; predicates on data columns prune row groups via statistics.
  • row-group statistics — every Parquet row group stores min/max/null count per column. If max(order_date) < '2026-01-01', the row group is skipped entirely — no data pages read.
  • httpfs — the extension that adds s3://, https://, and gcs:// URI support. INSTALL httpfs; LOAD httpfs; once; thereafter S3 paths work in read_parquet().
  • S3 secretsCREATE SECRET s3_default (TYPE S3, KEY_ID 'AK…', SECRET '…', REGION 'us-east-1') configures S3 auth once per session. Subsequent S3 paths use the secret automatically.
  • projection pushdown — the SELECT list determines which columns Parquet pages are read. SELECT a FROM read_parquet(...) reads only column a's pages, even from a 200-column file.

Common interview probes on Parquet at scale.

  • "What is partition pruning?" — skipping folders/files whose partition keys fail the WHERE predicate, without reading them.
  • "What is row-group pruning?" — skipping internal row groups whose min/max statistics fail the WHERE predicate.
  • "What does hive_partitioning=true buy you?" — exposes folder-name key=value pairs as columns; partition pruning then works on those columns.
  • "How does DuckDB read S3 without downloading the file?" — httpfs issues HTTP range requests to fetch only the byte ranges of the column chunks and row groups the query needs.

Detailed explanation — read_parquet over a glob

Topic explanation in three labelled bullets.

  • Glob path — pass a wildcard string; DuckDB expands it and treats the result as a single virtual table. Files share a schema (or you opt-in to union by column name).
  • No load step — the data stays in the Parquet files. There is no CREATE TABLE and no COPY FROM. The first scan is also the only scan you need.
  • Lazy column reads — only the projected columns are read from disk. Even a 200-column file produces no extra I/O if your SELECT names three columns.

Question. A directory data/ contains daily Parquet files (data/2026-01-01.parquet, data/2026-01-02.parquet, …). Sum the amount column across every file in one SELECT, without copying the data into a DuckDB table.

Input. (directory of daily Parquet files, each with columns order_id, region, amount, order_date)

Code.

-- One SELECT scans every file in the glob.
SELECT
    region,
    COUNT(*) AS orders,
    SUM(amount) AS total
FROM read_parquet('data/*.parquet')
GROUP BY region
ORDER BY total DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. read_parquet('data/*.parquet') expands the glob into a list of file paths.
  2. DuckDB opens each file's Parquet footer (a few KB at the end of each file) to read schema and row-group statistics.
  3. The SELECT references region, amount — only those column chunks are read. order_id and order_date data pages are skipped entirely.
  4. The hash-aggregate runs vectorised on the streamed batches.
  5. No table is created in the DuckDB database; the file remains the source of truth.

Output.

region orders total
EU 12,000 480,000
US 9,000 360,000
APAC 4,000 160,000

Rule of thumb. Default to FROM read_parquet('...') in every analytical query. Reserve CREATE TABLE AS SELECT ... for the few cases where the same scan would otherwise repeat dozens of times in the same session.

Detailed explanation — Hive partitioning and partition pruning

Topic explanation in three labelled bullets.

  • Hive folder convention — directory names like year=2026/month=05/day=12/ encode key=value pairs. The convention is the de facto standard for Parquet datasets (Spark, Athena, Snowflake all emit it).
  • Opt-in — pass hive_partitioning=true to read_parquet() (or it's auto-detected from path structure). DuckDB then exposes year, month, day as virtual columns sourced from the folder names.
  • Pruning — a WHERE predicate on the virtual columns prunes whole folders before any file is opened. Predicates like WHERE year = 2026 AND month >= 5 skip 95% of the data with zero I/O on the pruned files.

Question. Given a Hive-partitioned dataset under events/year=YYYY/month=MM/day=DD/, count events for May 2026 only. Show how the partition predicate prunes every folder outside that month.

Input. (directory events/year=2024/..., events/year=2025/..., events/year=2026/month=01/..., …, events/year=2026/month=05/day=12/file.parquet)

Code.

-- Hive partitioning auto-detected; predicate prunes folders.
SELECT
    day,
    COUNT(*) AS events
FROM read_parquet(
    'events/year=*/month=*/day=*/*.parquet',
    hive_partitioning = true
)
WHERE year = 2026 AND month = 5
GROUP BY day
ORDER BY day;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. DuckDB expands the glob and identifies the Hive structure. The virtual columns year, month, day are sourced from folder names — no data file is opened to compute them.
  2. The WHERE clause year = 2026 AND month = 5 is evaluated against the folder list. Only folders matching the predicate survive (31 folders for May 2026 days).
  3. For each surviving folder, DuckDB opens the Parquet file and scans the day column (and an implicit * for COUNT). The other partitions never touch disk.
  4. The hash-aggregate emits one row per day. The result is a 31-row table.

Output.

day events
01 5400
02 6100
31 5800

Rule of thumb. Always structure Parquet datasets with Hive partitioning on the columns you most commonly filter on (date, region, tenant). A year=/month=/day= layout turns a billion-row scan into a few-thousand-row scan with zero per-row cost.

Detailed explanation — projection + filter pushdown into row groups

Topic explanation in three labelled bullets.

  • Row group — a Parquet file is divided into row groups (typically 100 MB to 1 GB each). Each row group stores column chunks plus min/max/null-count statistics per column.
  • Statistic-based pruning — DuckDB reads the footer, examines per-row-group statistics, and skips any row group whose stats fail the WHERE predicate. The pruned bytes are never fetched from disk or network.
  • EXPLAIN ANALYZE — DuckDB's plan printer shows "Filters" and the number of row groups read vs. skipped. Use it to verify pushdown is happening.

Question. A single Parquet file has 10 row groups (1 GB each, 10 GB total). The WHERE predicate filters order_date >= '2026-05-01'. Only the last 3 row groups have any May data. Show the EXPLAIN plan that confirms 7 row groups were pruned.

Input. (single Parquet file orders.parquet, 10 row groups, dates evenly distributed)

Code.

EXPLAIN ANALYZE
SELECT region, SUM(amount) AS total
FROM read_parquet('orders.parquet')
WHERE order_date >= DATE '2026-05-01'
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The planner pushes the predicate order_date >= '2026-05-01' into the Parquet reader.
  2. The reader opens the file footer (a few KB), enumerates 10 row groups, and inspects each one's order_date statistics.
  3. Seven row groups have max(order_date) < '2026-05-01' — pruned. Three row groups have max(order_date) >= '2026-05-01' — scanned.
  4. For the three scanned row groups, only the region, amount, and order_date column chunks are read (projection pushdown). Other columns' pages stay on disk.
  5. The vectorised hash-aggregate runs over the surviving data.

Output (EXPLAIN ANALYZE excerpt).

step metric value
TABLE_SCAN row groups read 3
TABLE_SCAN row groups skipped 7
TABLE_SCAN bytes read ~250 MB (3 row groups × 3 cols)
HASH_AGGREGATE result rows 3 (regions)

Rule of thumb. Always check EXPLAIN ANALYZE for the row-groups-read count after a slow query. If pruning is not happening, the column you're filtering on is either not sorted/clustered within row groups (so every row group's min/max spans the predicate range) or the file was written without statistics. The fix is to re-write Parquet with a sort on the filter column.

Detailed explanation — S3 + httpfs + SECRET for cloud Parquet

Topic explanation in three labelled bullets.

  • httpfs extension — installed and loaded once per session: INSTALL httpfs; LOAD httpfs;. Adds support for s3://, https://, gcs://, and azure:// URIs.
  • SECRETCREATE SECRET (TYPE S3, KEY_ID '...', SECRET '...', REGION '...') stores credentials in the session. Multiple secrets can be scoped by path prefix.
  • HTTP range requests — DuckDB fetches only the byte ranges it needs (Parquet footer + the projected column chunks of the surviving row groups). A 100 GB S3 file can be scanned by downloading only a few MB.

Question. Configure DuckDB to read Parquet from a private S3 bucket and run a scan over s3://my-bucket/events/year=2026/. Show the three setup lines plus one SELECT.

Input. (S3 bucket my-bucket with Hive-partitioned events; AWS access key + secret)

Code.

-- One-time per session
INSTALL httpfs;
LOAD httpfs;

-- Auth via SECRET (more secure than SET-based env vars)
CREATE SECRET s3_default (
    TYPE     S3,
    KEY_ID   '...',
    SECRET   '...',
    REGION   'us-east-1'
);

-- Scan
SELECT region, COUNT(*) AS events
FROM read_parquet(
    's3://my-bucket/events/year=2026/month=*/*.parquet',
    hive_partitioning = true
)
WHERE month >= 5
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. INSTALL httpfs downloads and caches the extension on first use; LOAD httpfs registers its function tables in the session.
  2. CREATE SECRET stores S3 credentials in the session's secret store. Subsequent S3 URIs use the secret automatically.
  3. read_parquet('s3://...') issues a HEAD then a small GET for each file's footer to learn schemas and statistics.
  4. The Hive partition predicate (month >= 5) prunes folder list — only May/June/July/etc. folders survive.
  5. For each surviving file, range GETs pull the projected column chunks of surviving row groups. The total bytes downloaded are a tiny fraction of the bucket.
  6. The vectorised aggregate streams the data through the hash table.

Output.

region events
EU 1,200,000
US 900,000
APAC 400,000

Rule of thumb. Use CREATE SECRET over SET s3_access_key_id = '...' — it scopes nicely, doesn't leak into query logs, and supports multiple buckets with different roles. For production CI runs, prefer TYPE S3, PROVIDER 'credential_chain' to pick up the IAM role from the runner instead of hard-coding keys.

SQL interview question on partition pruning and projection pushdown

A senior interviewer often opens with: "Your team has a 10 TB Parquet dataset on S3, Hive-partitioned by date. The dashboard query filters one day and projects 5 of 50 columns. How do you reason about the bytes actually transferred from S3 to DuckDB?"

Solution Using filter + projection that prunes 90% of row groups

-- The dashboard SELECT
SELECT
    region,
    user_id,
    COUNT(*) AS events,
    SUM(amount) AS total
FROM read_parquet(
    's3://my-bucket/events/year=*/month=*/day=*/*.parquet',
    hive_partitioning = true
)
WHERE year = 2026 AND month = 5 AND day = 12
GROUP BY region, user_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

stage what happens bytes transferred
folder enumeration Hive prefix listed; predicate year=2026 AND month=5 AND day=12 matches one folder ~few KB (LIST API)
file footers DuckDB reads Parquet footer of each surviving file ~10 KB × N files
row-group stats per-row-group stats checked against any further predicates (already in footer)
column chunk reads only region, user_id, amount chunks of surviving row groups small fraction of 10 TB
in-DuckDB aggregate vectorised hash-group on streamed batches (in memory)

The vast majority of the 10 TB is never read. Hive partitioning prunes 99.97% of the dataset at the folder level (1 day out of ~1000). Projection pushdown reads 3 columns out of 50 — another 94% saving on the surviving data. The actual bytes pulled from S3 are typically < 1 GB for what looks like a 10 TB query.

Output:

region user_id events total
EU 1001 12 480
EU 1002 9 360
US 2001 7 280

Why this works — concept by concept:

  • Hive partitioning — folder-level keys cut the dataset to a single partition before any file is opened.
  • Row-group statistics — per-row-group min/max prune within each surviving file; only row groups whose range overlaps the predicate are read.
  • Projection pushdown — only the columns the SELECT names have their column chunks fetched. Other columns' pages stay on S3.
  • HTTP range requestshttpfs issues precise byte-range GETs; you pay S3 for the bytes you actually read, not for the whole file.
  • Cost — O(surviving_row_groups × projected_columns × avg_chunk_size). For a typical dashboard query on a well-partitioned dataset, this is < 1% of the total dataset size.

SQL
Topic — filtering
Filtering problems (SQL)

Practice →


4. Local ETL workflows — Python + dbt-duckdb + tests + Git

local etl with DuckDB is the new dev loop — one binary, one Python file, one dbt project, one pytest run, all in CI under two minutes

The mental model in one line: DuckDB collapses the dev/CI/prod-parity problem — write your transforms in duckdb python (or dbt-duckdb), run them against the same SQL on every machine, and let GitHub Actions execute the entire pipeline in seconds for free. Once you say "local ETL," the temptation to spin up a warehouse for prototyping evaporates.

Iconographic stylised workbench — three tool-clusters on a wooden-grain bench: a DuckDB shell terminal glyph, a dbt-duckdb yaml model card, a pytest tube-rack glyph; a small Git icon on the side and a floating top-right chip 'no server · single binary · CI-friendly', on a light PipeCode card.

The eight workflow primitives in labelled bullets.

  • duckdb pythonimport duckdb; con = duckdb.connect(...). The same engine that powers the CLI is a first-class Python library. Result objects expose .fetchall(), .df(), .arrow(), .pl() (Polars).
  • dbt-duckdb — the official adapter that lets dbt run against DuckDB. profiles.yml points to a .duckdb file (or :memory:); dbt run materialises models as tables or views.
  • local etl — read CSV / JSON / Parquet → transform via SQL → write Parquet. The whole pipeline is one Python file or one dbt project, no scheduler required.
  • pytest — fixtures that spin up duckdb.connect() in :memory: mode. Tests assert on result rows, schemas, or row counts. Tests run in milliseconds.
  • Git — the project (Python files, models/, tests/, profiles.yml, Makefile) is version-controlled. The .duckdb file itself is not checked in for production; for tutorials/demos a small file is fine.
  • CI runner — GitHub Actions / GitLab CI / CircleCI execute the entire pipeline on a clean runner. DuckDB installs in seconds; the runtime is dominated by the actual transform.
  • Ephemeral DB:memory: DuckDB for each test ensures isolation. No leftover state between tests.
  • profiles.yml — the dbt profile entry: type: duckdb, path: 'dev.duckdb', plus extensions and threads. One file controls dev, staging, and prod targets.

Common interview probes on local ETL.

  • "Why dbt-duckdb instead of dbt-snowflake for dev?" — same SQL, runs in seconds locally, no warehouse cost.
  • "How do you make tests fast?" — :memory: connections per test; no I/O; vectorised compute.
  • "How do you keep dev and prod in sync?" — same dbt project, different targets; tests run on both. SQL portability between DuckDB and Snowflake is high (CTEs, window functions, common types match).
  • "What goes in version control?" — Python, SQL models, tests, profiles.yml.example (without secrets), Makefile. The .duckdb file goes in .gitignore.

Detailed explanation — a 30-line Python ETL with DuckDB

Topic explanation in three labelled bullets.

  • Readread_csv_auto(), read_json_auto(), read_parquet() are all built-in. No external libraries required for the I/O.
  • Transform — write SQL. DuckDB's SQL surface is large: window functions, CTEs, lateral joins, QUALIFY, PIVOT, UNPIVOT, regular expressions, JSON path.
  • WriteCOPY ... TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD). The output is partition-friendly, column-oriented, and analytics-ready.

Question. Build a tiny ETL: read a CSV of raw orders, compute one aggregate per region, write the result as a Parquet file. Show the entire pipeline in under 30 lines of Python.

Input. (orders.csv with columns order_id, region, amount)

Code.

# etl.py — 22 lines
import duckdb

con = duckdb.connect("warehouse.duckdb")

con.execute("""
    CREATE OR REPLACE TABLE stg_orders AS
    SELECT
        CAST(order_id AS BIGINT)  AS order_id,
        UPPER(region)             AS region,
        CAST(amount AS DECIMAL(10,2)) AS amount
    FROM read_csv_auto('orders.csv', header = true)
""")

con.execute("""
    COPY (
        SELECT
            region,
            COUNT(*)    AS orders,
            SUM(amount) AS total
        FROM stg_orders
        GROUP BY region
        ORDER BY total DESC
    ) TO 'mart_region_totals.parquet'
    (FORMAT PARQUET, COMPRESSION ZSTD);
""")

print("ETL complete: mart_region_totals.parquet written.")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. read_csv_auto() sniffs the CSV header, infers types, and streams rows into the SQL engine.
  2. The staging CREATE TABLE applies a tiny normalisation: cast types, uppercase region. The result lives in the persistent warehouse.duckdb file.
  3. The aggregate SELECT runs vectorised over the staging table.
  4. COPY (...) TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD) streams the result directly to a Parquet file with ZSTD compression. No intermediate table required.
  5. The script is idempotent: re-running it drops and recreates stg_orders and overwrites the Parquet output.

Output.

file size content
warehouse.duckdb ~few MB persistent staging table
mart_region_totals.parquet ~few KB aggregated mart (one row per region)

Rule of thumb. Lean on read_csv_auto for ad-hoc CSV; the auto schema inference is rock-solid for most files. For production, declare explicit types via the columns = {...} argument so a malformed input fails fast instead of silently producing the wrong type.

Detailed explanation — dbt-duckdb profiles.yml + a staging model

Topic explanation in three labelled bullets.

  • Adapterpip install dbt-duckdb installs the official adapter. dbt-core and the adapter together are <50 MB.
  • Profile target~/.dbt/profiles.yml declares the target connection: type: duckdb, path: 'dev.duckdb', threads: 4. The same project can have dev, ci, and prod targets.
  • Model filemodels/staging/stg_orders.sql is plain dbt: a {{ config(materialized='table') }} line plus a SELECT. dbt-duckdb materialises it as a real DuckDB table.

Question. Set up a minimal dbt-duckdb project with one staging model. Show the profiles.yml, the model file, and the dbt run command.

Input.

project/
├── dbt_project.yml
├── profiles.yml          # checked-in profiles.yml.example; real one in ~/.dbt/
├── models/
│   └── staging/
│       └── stg_orders.sql
└── seeds/
    └── orders.csv
Enter fullscreen mode Exit fullscreen mode

Code.

# ~/.dbt/profiles.yml
demo:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'dev.duckdb'
      threads: 4
    ci:
      type: duckdb
      path: ':memory:'
      threads: 2
Enter fullscreen mode Exit fullscreen mode
-- models/staging/stg_orders.sql
{{ config(materialized = 'table') }}

SELECT
    CAST(order_id AS BIGINT)        AS order_id,
    UPPER(region)                   AS region,
    CAST(amount AS DECIMAL(10, 2))  AS amount,
    CAST(order_date AS DATE)        AS order_date
FROM {{ ref('orders') }}            -- the seed file
Enter fullscreen mode Exit fullscreen mode
# from the project directory
dbt deps           # install any packages
dbt seed           # load orders.csv into the warehouse
dbt run            # materialise stg_orders as a table
dbt test           # run any data tests
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dev target writes to dev.duckdb on disk; the ci target uses :memory: so the runner has zero state at end of job.
  2. dbt seed reads seeds/orders.csv and creates a orders table in the database. dbt run then references it via ref('orders').
  3. {{ config(materialized='table') }} instructs dbt-duckdb to CREATE OR REPLACE TABLE stg_orders AS SELECT ….
  4. The SQL is plain DuckDB dialect; no Snowflake-specific functions, no warehouse macros. The same SQL runs against any DuckDB target.
  5. dbt test executes any tests/*.sql files plus generic schema tests defined in schema.yml.

Output.

step result
dbt seed orders table loaded
dbt run stg_orders table materialised
dbt test all tests pass (or fail with row counts)

Rule of thumb. Keep dev/CI/prod profiles in a single profiles.yml keyed by target. Never check in real secrets; commit a profiles.yml.example to the repo and point developers at ~/.dbt/profiles.yml. For CI, write the profile from environment variables at job start.

Detailed explanation — pytest fixture that spins up an in-memory DuckDB

Topic explanation in three labelled bullets.

  • In-memory per testduckdb.connect() with no path returns a fresh :memory: database. Each fixture creates a new connection so tests are isolated.
  • Seed in fixture — the fixture loads CSV / Parquet / inline VALUES into staging tables. The test then executes transforms and asserts on results.
  • Speed — vectorised compute over small in-memory data is microsecond-fast. A 1000-test suite runs in seconds.

Question. Write a pytest fixture that yields a DuckDB connection seeded with a small orders table, then a test that asserts the regional aggregate is correct.

Input. (no external file — data inlined)

Code.

# tests/conftest.py
import duckdb, pytest

@pytest.fixture
def con():
    c = duckdb.connect()  # :memory:
    c.execute("""
        CREATE TABLE orders (order_id INT, region VARCHAR, amount INT);
        INSERT INTO orders VALUES
            (1, 'EU', 10),
            (2, 'US', 20),
            (3, 'EU', 30);
    """)
    yield c
    c.close()

# tests/test_regional_total.py
def test_region_eu_total(con):
    result = con.execute("""
        SELECT SUM(amount) FROM orders WHERE region = 'EU'
    """).fetchone()[0]
    assert result == 40

def test_region_count(con):
    result = con.execute("""
        SELECT region, COUNT(*) FROM orders GROUP BY region ORDER BY region
    """).fetchall()
    assert result == [('EU', 2), ('US', 1)]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The con fixture creates a fresh :memory: database per test (pytest's default fixture scope is function).
  2. The fixture seeds a tiny orders table with three rows. This is the test's "world" — no external files, no shared state.
  3. Each test executes a SQL query and asserts on the result.
  4. After the test, c.close() releases the memory. The next test gets a clean DB.
  5. The whole suite runs in milliseconds because no I/O is involved.

Output.

test result
test_region_eu_total PASSED
test_region_count PASSED

Rule of thumb. Always seed inline VALUES for unit tests — the source of truth is in the test file, not in a fixture CSV. Save file-based seeds for integration tests that exercise the CSV reader itself. The 100ms saved per test compounds across hundreds of tests.

Detailed explanation — running the entire pipeline in GitHub Actions

Topic explanation in three labelled bullets.

  • Single runner — one Ubuntu runner installs Python, pip-installs DuckDB + dbt-duckdb + pytest, runs the pipeline, and exits. Total runtime: typically 1–3 minutes for a small project.
  • No services — the runner has no managed warehouse, no auth secrets to manage, no network egress to the cloud. Everything happens on the runner.
  • Cache the wheel — GitHub Actions caches the pip cache; cold runs install in 30 seconds, warm runs in 5 seconds.

Question. Write a .github/workflows/ci.yml that installs DuckDB, runs dbt build, then runs pytest. Show the entire workflow in under 30 lines.

Input. (a repo with dbt_project.yml, models/, tests/, requirements.txt)

Code.

name: CI
on:
  push: { branches: [main] }
  pull_request:

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip

      - name: Install deps
        run: pip install -r requirements.txt   # duckdb, dbt-duckdb, pytest

      - name: Write CI profile
        run: |
          mkdir -p ~/.dbt
          cat > ~/.dbt/profiles.yml <<'EOF'
          demo:
            target: ci
            outputs:
              ci:
                type: duckdb
                path: ':memory:'
                threads: 2
          EOF

      - name: dbt build (seed, run, test)
        run: dbt build --target ci

      - name: pytest
        run: pytest -q
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The runner checks out the repo and sets up Python 3.12 with a cached pip cache.
  2. pip install -r requirements.txt installs DuckDB, dbt-duckdb, and pytest. The wheel comes from cache on warm runs.
  3. The "Write CI profile" step generates ~/.dbt/profiles.yml at the runner level using a heredoc — the profile points to :memory: so no file lingers.
  4. dbt build runs seed, run, test in one command. Models are materialised in the in-memory DB; tests verify them.
  5. pytest -q runs Python-level tests against in-memory DuckDB connections.
  6. Total wall time on a clean runner: 1–2 minutes for typical projects. There's no warehouse credit consumed.

Output.

step typical time
checkout + setup-python 5 s
pip install 5–30 s
dbt build 10–60 s
pytest 5–30 s
total ~1–2 min

Rule of thumb. Always use :memory: DuckDB in CI — the runner has zero state at start, zero state at end, and the in-memory engine is the fastest path. Reserve a persistent .duckdb file for integration tests that specifically need to exercise the on-disk format.

Python interview question on building a reproducible local ETL with DuckDB

A senior interviewer might frame this as: "Your team wants a single command (make ci) that runs the whole pipeline end-to-end against the same SQL we deploy to production. How do you structure the project so the laptop, CI runner, and prod warehouse all execute the same models?"

Solution Using duckdb.connect() + dbt run + pytest, all driven from a single Makefile

# Makefile (excerpt)
#
# make seed   — load seeds into dev DB
# make run    — materialise models
# make test   — run dbt tests + pytest
# make ci     — run the entire pipeline end-to-end
Enter fullscreen mode Exit fullscreen mode
.PHONY: seed run test ci clean

DBT_TARGET ?= dev

seed:
    dbt seed --target $(DBT_TARGET)

run:
    dbt run --target $(DBT_TARGET)

test:
    dbt test --target $(DBT_TARGET)
    pytest -q

ci: clean
    $(MAKE) seed DBT_TARGET=ci
    $(MAKE) run  DBT_TARGET=ci
    $(MAKE) test DBT_TARGET=ci

clean:
    rm -f dev.duckdb dev.duckdb.wal
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

command DBT target DuckDB file duration result
make seed dev dev.duckdb ~1 s seeds loaded
make run dev dev.duckdb ~2 s models materialised
make test dev dev.duckdb + pytest in-memory ~3 s tests pass
make ci ci :memory: ~5 s full pipeline pass

The Makefile defines the same four verbs (seed, run, test, clean) and parameterises by DBT_TARGET. Local dev uses dev target (persistent dev.duckdb); CI uses ci target (:memory:). Prod (when wired) uses a prod target pointing at a real warehouse — same Make verbs, same dbt commands, same SQL.

Output:

command exit code
make ci 0 (all green)

Why this works — concept by concept:

  • One project, three targets — the dev/ci/prod split lives in profiles.yml. The Makefile and the dbt models are unchanged; only the connection is swapped.
  • Same SQL everywhere — the dbt models use ANSI-friendly SQL that runs on DuckDB and (with minimal macros) on Snowflake/Postgres. No "DuckDB-only" patterns leak into the model.
  • Reproducibilitymake clean && make ci is the deterministic, hermetic run. Any failure is the model's fault, not the environment's.
  • Fast feedback loopmake test against dev.duckdb finishes in seconds. The "edit → save → make test" cycle is faster than a warehouse query takes to plan.
  • Cost — zero per CI run. The runner is the bottleneck, not a managed compute credit.

Python
Topic — file IO
File I/O problems (Python)

Practice →


5. Where DuckDB fits (and where it doesn't) — laptops, CI, edge, notebook

duckdb analytics shines in four quadrants — laptop, CI, edge, notebook — and the anti-patterns are equally important to name out loud

The mental model in one line: DuckDB owns the in-process OLAP niche: single-machine workloads where the data fits (or streams through) one process and the writer count is at most one. Anywhere you need multi-writer concurrency, sub-second multi-user serving, or a long-running shared OLAP service, you've left DuckDB's lane. Once you can name the four right-fit quadrants and the three anti-patterns, the architecture decision becomes mechanical.

Iconographic 4-quadrant deployment matrix — laptop, CI runner, edge, and notebook quadrants each with a hero glyph and a chip recommending fit; a side card 'anti-patterns' listing long-running OLAP cluster, concurrent writes, and multi-user serving chips, on a light PipeCode card.

The four right-fit quadrants in labelled bullets.

  • Laptop analytics — engineers and analysts running ad-hoc SQL against tens of GBs of local or S3 Parquet. NVMe + 32 GB RAM + DuckDB outperforms many small warehouses on cold scans.
  • CI runner — dbt projects and pytest suites that need to run on every PR. DuckDB installs in seconds; the entire pipeline runs in 1–2 minutes; zero warehouse cost.
  • Edge / embedded — a single-machine deployment (an IoT device, a Lambda function, a Cloudflare worker, a desktop app) needs SQL over local Parquet. DuckDB ships as one binary with no daemon.
  • Notebook — Jupyter / Marimo / Hex / Polars-LazyFrame workflows. The Arrow zero-copy bridge means pandas/Polars and DuckDB share buffers; the notebook can flip between SQL and DataFrame syntax for free.

The three anti-patterns in labelled bullets.

  • Long-running OLAP cluster — if you need a shared, always-on, multi-node compute that 50 analysts hit concurrently, DuckDB is the wrong tool. Use Snowflake/BigQuery/Databricks/ClickHouse — engines designed for that topology.
  • Concurrent writes — multiple processes writing to the same .duckdb file is not supported. Multiple threads writing within one process serialise behind a write latch. If your pipeline genuinely needs N concurrent writers, you've outgrown DuckDB's niche.
  • Multi-user serving — sub-second OLAP responses for hundreds of concurrent users (dashboards, real-time analytics for an app) is not DuckDB's sweet spot. Reach for ClickHouse, Druid, Pinot, or a CDN-cached materialised view.

Common interview probes on fit.

  • "When would you replace DuckDB with Snowflake?" — when the workload requires sub-second multi-user serving, when the data > 1 TB and needs always-on multi-node compute, or when concurrent writes are part of the design.
  • "Can DuckDB serve a dashboard?" — yes for one user (e.g. Streamlit), no for 100 concurrent users. Put a cache or a real serving engine in front for scale.
  • "Is DuckDB production-ready?" — yes, since 1.0 (2024). It runs nightly ETL, replaces SQLite-as-OLAP, drives CI, and powers many Lambda-based analytics jobs in production.

Detailed explanation — laptop analytics quadrant

Topic explanation in three labelled bullets.

  • NVMe is the secret weapon — modern laptops have 3–7 GB/s NVMe storage. DuckDB's columnar scans saturate the bus; cold scans of 50 GB Parquet complete in a minute or two.
  • One developer, many queries — the single-writer constraint is invisible because there is only one user. Many parallel reads from the OS page cache stay fast.
  • Local-first — data lives in ./data/ or in S3 via httpfs. Either way, the engine runs in the developer's terminal — no VPN, no console login, no auth dance.

Question. A developer wants to explore a 50 GB Parquet dataset interactively on a 32 GB MacBook. Show how DuckDB makes this practical and why a small warehouse would feel slower.

Input. (50 GB Parquet directory events/)

Code.

import duckdb

con = duckdb.connect()
con.execute("SET memory_limit = '24GB'")

# First scan — cold from NVMe
df = con.execute("""
    SELECT region, COUNT(*) AS events, SUM(amount) AS total
    FROM read_parquet('events/*.parquet')
    GROUP BY region
""").df()

# Subsequent scans — warm from OS page cache
df2 = con.execute("""
    SELECT user_id, AVG(amount)
    FROM read_parquet('events/*.parquet')
    WHERE region = 'EU'
    GROUP BY user_id
""").df()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The first SELECT reads all events/*.parquet footers (a few MB), prunes nothing (no predicate), and streams the region and amount columns through a vectorised aggregate.
  2. NVMe delivers ~5 GB/s; the scanned column bytes are ~20 GB. Wall time: ~5–10 seconds.
  3. The OS caches the recently-read pages in RAM. The next query (WHERE region = 'EU') reuses cached pages and scans only the columns it projects.
  4. Warm-cache wall time: under one second for most subsequent queries.
  5. A small Snowflake warehouse would pay the cluster-warm-up cost (seconds) and the data-shuffle cost (more seconds), often resulting in a slower interactive experience for this data size.

Output.

query wall time (typical)
first scan (cold NVMe) 5–10 s
second scan (warm cache) < 1 s

Rule of thumb. For ad-hoc exploration up to a few hundred GB on a single machine, DuckDB on a laptop is the fastest interactive experience available — and the cheapest.

Detailed explanation — CI quadrant

Topic explanation in three labelled bullets.

  • Cold installpip install duckdb dbt-duckdb on a clean runner is ~30 seconds (5 seconds with cache). Compare with a warehouse provision, which is "minutes."
  • Full pipeline per PR — every PR runs seed → run → test against the same SQL it will ship. There's no "tested locally, broke in CI" gap.
  • Cost — zero managed-compute cost. The runner is the only billable resource, and it's usually under 2 minutes.

Question. A team runs dbt against Snowflake in prod but wants every PR to exercise the full pipeline in CI without a Snowflake credit. Show the dbt-duckdb pattern that achieves that.

Input. (dbt project with Snowflake-targeted SQL — ANSI-compatible)

Code.

# profiles.yml (CI section)
demo:
  target: ci
  outputs:
    ci:
      type: duckdb
      path: ':memory:'
      threads: 2
      extensions: [httpfs, parquet]
Enter fullscreen mode Exit fullscreen mode
# CI step
dbt build --target ci
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The CI target points at an in-memory DuckDB connection. No external warehouse is touched.
  2. dbt build runs seed, run, test in order. The pipeline executes against in-memory tables.
  3. Tests fail fast on schema/data violations. The runner exits non-zero; the PR is blocked until fixed.
  4. The same SQL ships to Snowflake in prod via the prod target. The SQL is ANSI-compatible (CTEs, window functions, standard types) so portability is high.
  5. The total CI runtime is ~1–2 minutes; the cost is zero managed compute.

Output.

run dbt build duration warehouse credits
CI (DuckDB :memory:) ~30 s 0
Prod (Snowflake X-Small) ~20–40 s metered

Rule of thumb. Use dbt-duckdb in CI for any project that runs against Snowflake/BigQuery/Postgres in prod. Keep the SQL ANSI-compatible by avoiding warehouse-specific functions (or wrapping them in dbt macros that swap by target).

Detailed explanation — notebook/Polars quadrant

Topic explanation in three labelled bullets.

  • Arrow bridge — pandas (with Arrow backend), Polars, and DuckDB share columnar memory. A df in Python is a table in DuckDB and vice versa; no SerDe.
  • Mix SQL and DataFrame — express set-based transforms (joins, group-bys) as SQL; express row-wise transforms (string ops, complex expressions) as DataFrame methods. Pick the right tool per step.
  • Polars LazyFrame interoppl.scan_parquet(...).collect() plus DuckDB SQL gives you a hybrid query plan. Polars optimises its part; DuckDB optimises its part.

Question. In a notebook, read 10 GB of Parquet with Polars, hand it to DuckDB for a SQL aggregate, return as a pandas DataFrame for plotting. Show the three lines.

Input. (Polars LazyFrame over events/*.parquet)

Code.

import polars as pl
import duckdb

# Polars handles the scan / filter
lf = pl.scan_parquet("events/*.parquet").filter(pl.col("region") == "EU")

# DuckDB SQL — Polars frame is recognised by name
df = duckdb.sql("""
    SELECT user_id, COUNT(*) AS events, SUM(amount) AS total
    FROM lf
    GROUP BY user_id
    ORDER BY total DESC
    LIMIT 10
""").df()

print(df)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. pl.scan_parquet(...) is a lazy plan in Polars; nothing is read yet.
  2. The .filter(...) adds a predicate to the Polars plan.
  3. duckdb.sql("... FROM lf ...") recognises lf as a Polars LazyFrame in the local scope. DuckDB collects it via Arrow (zero-copy where possible) and runs the SQL.
  4. .df() returns the result as a pandas DataFrame, again via Arrow.
  5. No explicit conversion was written; the Arrow layer handled the handoffs.

Output.

user_id events total
1001 1200 480000
1002 950 380000

Rule of thumb. Use Polars for fluent DataFrame transforms (filter, with_columns, group_by) and DuckDB for set-based SQL (multi-table joins, window functions, CTEs). The Arrow zero-copy bridge means picking the right tool per step has no perf penalty.

Detailed explanation — anti-patterns

Topic explanation in three labelled bullets.

  • Long-running OLAP cluster — DuckDB is not designed to run as an always-on service that hundreds of clients connect to. It has no scheduler, no resource governor, no per-user accounting.
  • Concurrent writes — two processes opening the same .duckdb file in write mode fails. Two threads in one process serialise behind a write latch.
  • Multi-user serving — sub-second OLAP for hundreds of concurrent dashboards needs caching, sharding, or a serving-grade engine.

Question. Three real-world scenarios — pick the right tool for each and explain why DuckDB is wrong for two of them.

Input.

scenario requirements
A nightly batch report on 100 GB Parquet, one engineer
B real-time dashboard for 1000 concurrent users on 1 TB of clickstream
C central data warehouse for 50 analysts running concurrent ad-hoc SQL

Code.

A → DuckDB (single-writer batch, no concurrency needed)
B → ClickHouse / Druid / Pinot (multi-user low-latency serving)
C → Snowflake / BigQuery / Databricks (always-on shared compute, many concurrent users)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Scenario A is a textbook fit: one writer (the nightly ETL), one or two readers (the report viewer), batch window. DuckDB is the cheapest and fastest answer.
  2. Scenario B violates "multi-user serving." 1000 concurrent users hitting one DuckDB process would starve on the single-writer latch the moment any background refresh kicks in. Use a serving-grade OLAP engine.
  3. Scenario C violates "always-on shared compute." 50 analysts need a managed warehouse with resource governance, per-user accounting, and concurrent compute scaling.
  4. Naming the constraint out loud (writer count, concurrency, serving SLO) is the senior signal — interviewers don't want a religious answer, they want the constraint check.

Output.

scenario recommended engine reason
A DuckDB single writer, batch, fits in one process
B ClickHouse / Druid / Pinot sub-second OLAP for many users
C Snowflake / BigQuery / Databricks shared always-on compute for many analysts

Rule of thumb. Memorise the three anti-patterns — long-running OLAP cluster, concurrent writes, multi-user serving — and check them first whenever DuckDB comes up. If none triggers, DuckDB is almost certainly the right answer; if any triggers, name the alternative without ceremony.

SQL interview question on picking DuckDB vs Snowflake/BigQuery

A senior interviewer often opens with: "Your team has a dbt project that runs nightly on Snowflake. The cost is $2k/month, almost all of which is the nightly build. The data is < 500 GB and changes once a day. Should you migrate to DuckDB? Walk me through the analysis."

Solution Using a four-quadrant deployment matrix

-- The decision query, expressed as a tiny analytics SELECT.
WITH workload AS (
    SELECT
        '500 GB nightly batch'           AS pattern,
        FALSE                             AS multi_writer_concurrent,
        FALSE                             AS sub_second_serving,
        FALSE                             AS many_concurrent_users,
        'NVMe + 64GB RAM CI node'         AS available_compute,
        2000                              AS current_monthly_cost_usd
)
SELECT
    pattern,
    CASE
        WHEN multi_writer_concurrent       THEN 'stay on Snowflake'
        WHEN sub_second_serving            THEN 'stay on Snowflake (or switch to a serving OLAP)'
        WHEN many_concurrent_users         THEN 'stay on Snowflake'
        ELSE                                    'migrate to DuckDB on a CI node'
    END                                     AS recommendation,
    CASE
        WHEN multi_writer_concurrent
          OR sub_second_serving
          OR many_concurrent_users           THEN current_monthly_cost_usd
        ELSE                                      50      -- CI runner monthly cost estimate
    END                                     AS projected_monthly_cost_usd
FROM workload;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

trait value branch evaluated recommendation
multi_writer_concurrent FALSE first WHEN false continue
sub_second_serving FALSE second WHEN false continue
many_concurrent_users FALSE third WHEN false continue
(catch-all ELSE) matched → migrate to DuckDB migrate

The decision query mirrors the mental check. None of the three "stay on the warehouse" anti-patterns trigger, so the cost-effective answer is a DuckDB pipeline on a CI node. The projected cost drops from $2000/month to ~$50/month — a 40x reduction — and the build time often drops too because there's no warehouse-warm-up step.

Output:

pattern recommendation projected_monthly_cost_usd
500 GB nightly batch migrate to DuckDB on a CI node 50

Why this works — concept by concept:

  • Anti-pattern check — the three triggers (multi-writer, sub-second serving, many concurrent users) cover 95% of "must stay on a warehouse" reasons. If none trip, DuckDB is the cheaper and often faster choice.
  • Single-machine compute — 500 GB fits comfortably on a modern CI node with NVMe and 64 GB RAM. The vectorised scan keeps the workload CPU-bound, not memory-bound.
  • Cost arithmetic — Snowflake credits are charged per-second of warehouse time; a $2000/month nightly job is ~80 hours of XS warehouse credits. A 1-hour nightly CI runner is closer to $50/month all-in.
  • SQL portability — dbt models written for Snowflake are typically 90%+ portable to DuckDB. The remaining 10% is wrapped in dbt macros that swap by target.
  • Cost — O(scanned bytes) per query. With Parquet columnar scans and partition pruning, the actual bytes per job are tiny relative to the dataset size, so the runner finishes inside the budget.

SQL
Topic — optimization
Optimization problems (SQL)

Practice →


Cheat sheet — DuckDB recipes

  • Install in any language. Python: pip install duckdb. Node: npm install duckdb. CLI: brew install duckdb. R: install.packages('duckdb'). Every binding wraps the same C++ engine.
  • Connect. con = duckdb.connect() for :memory:; con = duckdb.connect('warehouse.duckdb') for a persistent file. There is no host/port/user/password.
  • Run a query. con.execute('SELECT 1').fetchone() returns a tuple; .fetchall() returns a list; .df() returns pandas; .arrow() returns a PyArrow Table; .pl() returns Polars.
  • Read Parquet in place. SELECT * FROM read_parquet('data/*.parquet'). Add hive_partitioning = true for Hive-style folders.
  • Read CSV with schema inference. SELECT * FROM read_csv_auto('orders.csv', header = true). Override types with columns = {'order_id': 'BIGINT', ...}.
  • Write Parquet. COPY (SELECT ...) TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD). Add PARTITION_BY (region) for partitioned output.
  • Install extensions. INSTALL httpfs; LOAD httpfs; once per session. INSTALL aws; for IAM role auth.
  • S3 credentials. CREATE SECRET s3_default (TYPE S3, KEY_ID '...', SECRET '...', REGION '...'). Avoid SET s3_* env-style settings for production.
  • Query a DataFrame. Any pandas/Polars/PyArrow variable in scope is queryable by name: SELECT * FROM df.
  • Spill control. SET memory_limit = '12GB'; SET temp_directory = '/tmp/duckdb'; lets big aggregates spill instead of OOMing.
  • EXPLAIN ANALYZE. Verifies projection/filter pushdown by reporting row groups read vs. skipped per scan.
  • dbt target. type: duckdb, path: 'dev.duckdb', threads: 4. Add extensions: [httpfs, parquet] to load on startup.
  • pytest fixture. Yield duckdb.connect() per test; seed inline VALUES; assert on .fetchall(). Sub-millisecond per test.
  • CI workflow. Cache pip, write ~/.dbt/profiles.yml with path: ':memory:', dbt build --target ci, then pytest -q. ~1–2 minutes total.
  • MVCC contract. Single writer per database file, many concurrent readers. BEGIN pins a snapshot; readers never block writers, writers never block readers.
  • Concurrency anti-pattern. Two processes writing the same .duckdb file → the second one errors. Shard by file (one per pipeline) or graduate to a server.

Frequently asked questions

Is DuckDB production-ready?

Yes. DuckDB 1.0 shipped in 2024 with explicit on-disk format stability guarantees, and it now powers production ETL at thousands of companies — laptop analytics, CI builds, edge analytics, Lambda jobs, dbt-driven nightly batches. The right deployment shape is in-process: one writer per file, many readers, batch or interactive workload. Avoid it for multi-writer, multi-user serving, or always-on cluster OLAP. For laptop, CI, edge, and notebook workloads, it is the de facto standard.

How does DuckDB compare to SQLite and Postgres?

DuckDB is to OLAP what SQLite is to OLTP — embedded, in-process, single binary — but the storage layout is columnar instead of row-oriented and the executor is vectorised instead of row-at-a-time. SQLite excels at point lookups and small transactional workloads; Postgres is the full client-server OLTP database with strong OLAP extensions; DuckDB is the OLAP engine that runs inside your process. The headline difference: a GROUP BY region over 50 million rows runs in seconds on DuckDB and minutes on SQLite on the same hardware, because DuckDB reads only the projected columns and processes 1024-row batches per operator call.

Can DuckDB read Parquet directly from S3?

Yes. INSTALL httpfs; LOAD httpfs; once per session adds s3:// URI support to read_parquet(). Configure credentials via CREATE SECRET (TYPE S3, KEY_ID '...', SECRET '...', REGION '...') (or via the IAM credential chain on a CI runner). DuckDB issues HTTP range requests to fetch only the Parquet footer plus the column chunks of the surviving row groups, so a "10 TB scan" can pull as little as a few MB across the wire when partition pruning, row-group statistics, and projection pushdown all fire.

Does DuckDB support concurrent writers?

No — only one writer at a time per database file. Two processes opening the same .duckdb file in write mode will fail; two threads within one process serialise behind a write latch. Readers, however, are unconstrained: any number of read-only connections can scan the same file concurrently. The right pattern for a concurrent pipeline is shard-by-file (one .duckdb per pipeline step or per partition) or to graduate to a server-based engine. If your design genuinely needs N concurrent writers, you've left DuckDB's niche by definition.

How do I use DuckDB with dbt?

Install the official adapter (pip install dbt-duckdb) and add a type: duckdb block to profiles.yml with a path: pointing at a .duckdb file (or :memory: for CI). The dbt models themselves are plain SQL; DuckDB supports CTEs, window functions, QUALIFY, PIVOT, UNPIVOT, regular expressions, and most modern dialect features. dbt build --target ci runs seed + run + test in ~1–2 minutes on a CI runner with no warehouse cost. Keep the SQL ANSI-compatible to maintain portability with a Snowflake/BigQuery/Postgres prod target.

When should I NOT use DuckDB?

Three anti-patterns: a long-running multi-node OLAP cluster shared by 50+ analysts (use Snowflake, BigQuery, Databricks); sub-second OLAP serving for hundreds of concurrent dashboard users (use ClickHouse, Druid, Pinot); concurrent writers from multiple processes (no embedded engine handles this — graduate to a server). Within these constraints, DuckDB is also the wrong choice for pure OLTP — high-rate row-by-row inserts and point lookups belong in Postgres or SQLite. For everything else — laptop analytics, CI, edge, notebook, single-writer batch — DuckDB is usually the fastest and cheapest answer.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every DuckDB recipe above ships with hands-on practice rooms where you write the `read_parquet` scan, the dbt-duckdb model, and the pytest fixture against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `duckdb python` ETL behaves the same in CI as it does on your laptop.

Practice database problems →
Aggregation drills →

Top comments (0)