duckdb quietly became the most disruptive single binary in analytics — a 30 MB executable that scans Parquet at warehouse speeds on a laptop, runs entire dbt projects in CI under two minutes, and turns the "first I'll provision a cluster" reflex into a one-liner: pip install duckdb. The 2026 reality is that a 32 GB MacBook with NVMe storage now outperforms many small Snowflake warehouses on cold Parquet scans, and the only engine that ships in a single dependency is DuckDB. Once you internalise "in-process OLAP," you stop loading data and start scanning it where it already lives.
This duckdb tutorial is the cheat sheet you wished existed the first time someone told you to "just use DuckDB for that report." It walks through the in-process database model, the vectorized executor and Arrow zero-copy that make duckdb python workflows fast, the duckdb parquet scan with partition pruning and httpfs S3, local etl with dbt-duckdb plus pytest plus Git, and the four-quadrant deployment matrix that names where duckdb sql and duckdb analytics actually fit (laptop, CI, edge, notebook) versus where they don't (long-running OLAP cluster, concurrent writes, multi-user serving). Every section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the database design practice library →, rehearse on aggregation problems →, and harden the filtering muscles with filtering problems →.
On this page
- Why in-process OLAP — single binary, no server, columnar, MVCC
- DuckDB engine — vectorized execution, Arrow integration, MVCC
- Reading Parquet at scale — partition pruning, Hive partitioning, S3 / httpfs
- Local ETL workflows — Python + dbt-duckdb + tests + Git
- Where DuckDB fits (and where it doesn't) — laptops, CI, edge, notebook
- Cheat sheet — DuckDB recipes
- Frequently asked questions
- Practice on PipeCode
1. Why in-process OLAP — single binary, no server, columnar, MVCC
duckdb is the analytics counterpart to SQLite — a single-binary in-process OLAP engine that turns "stand up a warehouse" into import duckdb
The mental model in one line: duckdb is to OLAP what SQLite is to OLTP — an embedded, in-process, zero-dependency engine that runs inside your application's address space and exposes SQL over columnar storage. Once you say "in-process database," the rest of the engineering picture (no auth handshake, no network hop, no cluster to wake up, no admin role) snaps into place.
The seven traits that define DuckDB.
-
Single binary — one ~30 MB shared library (
duckdb) with zero runtime dependencies. The Python wheel ships the same engine in pure C++; the CLI is a singleduckdbexecutable. There is nothing to install, configure, or daemon-ise. -
No server — there is no separate process to start, no socket to bind, no
pg_hba.conf, nomy.cnf. DuckDB runs inside your Python interpreter, your Node process, your Rust binary, your dbt run, or yourduckdbCLI shell. -
Columnar storage — both the file format (
.duckdbblocks) and the in-memory execution are column-oriented. The same SELECT that scans 200 GB of CSV at 50 MB/s in Postgres scans Parquet at 2 GB/s in DuckDB because only the projected columns are read. -
MVCC — multi-version concurrency control inside a single writer process. Readers see a consistent snapshot even while a writer is mutating the same table. There is no global lock, no row-level lock dance, no
SELECT ... FOR UPDATE. -
Zero-config —
duckdb.connect()is the entire setup. NoCREATE DATABASE, noGRANT, no role hierarchy. The database is the file path you hand toconnect(), or:memory:for ephemeral runs. -
Arrow native — pandas, Polars, PyArrow, and DuckDB share the same Apache Arrow memory layout. A
dfcan be queried with SQL directly —SELECT * FROM df— with zero serialisation between Python objects and the engine. - OLAP focus — DuckDB optimises for analytical workloads: large scans, aggregates, group-bys, joins over wide rows, vectorized execution. Point lookups exist but are not the target; row-by-row OLTP belongs in Postgres or SQLite.
The 2026 reality.
- Laptops outperform small warehouses on cold Parquet scans. A 32 GB MacBook with NVMe runs a 100-column aggregate over 200 GB of Parquet in tens of seconds; a 1-credit Snowflake X-Small often takes longer because the cluster has to spin up and shuffle data over the network.
-
pip install duckdbships in CI under 30 seconds. GitHub Actions and CircleCI cache the wheel; the cold install is faster than provisioning a managed warehouse. - The dbt-duckdb adapter is officially supported. Entire dbt projects now run "locally" against DuckDB for dev/CI and against a real warehouse only in prod — same SQL, same tests, faster feedback loop.
- Polars + DuckDB + Arrow is the new dataframe stack. Polars handles row-level transforms; DuckDB handles set-based SQL; Arrow is the zero-copy bridge between them. The combination beats pandas by 10x to 100x on memory-bound workloads.
What interviewers listen for.
- Do you say "in-process" the moment DuckDB comes up? — senior signal.
- Do you mention MVCC and single-writer / multi-reader as the concurrency model? — required answer.
- Do you reach for
read_parquet()instead ofCOPYfor cloud data? — senior signal. - Do you name laptop, CI, notebook, edge as the four right-fit quadrants? — senior signal.
Detailed explanation — installing DuckDB and running the first query
Topic explanation in three labelled bullets.
-
One install line per environment — Python:
pip install duckdb. Node:npm install duckdb. CLI:brew install duckdb(macOS) or download the single executable for Linux / Windows. There is no separate server install, no service to enable. -
One connect line per script —
duckdb.connect()returns a connection. With no argument, the database lives in:memory:(ephemeral). With a path argument, the database lives in that file. Both connections speak the same SQL. -
One execute line per query —
conn.execute("SELECT 42").fetchall()returns a list of tuples.conn.execute("SELECT 42").df()returns a pandas DataFrame.conn.execute("SELECT 42").arrow()returns a PyArrow Table.
Question. Show the minimum-viable DuckDB program in Python — install, connect, and run one query that returns a single integer. Then show the same query as a CLI one-liner.
Input. (none — the query is a self-contained constant)
Code.
# Python — three lines after `pip install duckdb`
import duckdb
con = duckdb.connect() # in-memory database
print(con.execute("SELECT 42").fetchall())
# -> [(42,)]
# CLI — one line after the binary is on PATH
duckdb -c "SELECT 42"
# -> ┌───┐
# │ 42│
# └───┘
Step-by-step explanation.
-
import duckdbloads the C++ engine into the Python interpreter's address space. There is no socket open, no daemon spawned. -
duckdb.connect()with no argument creates an in-memory database. The connection is a handle into the engine's MVCC graph for this process. -
con.execute("SELECT 42")parses the SQL, plans the query, and executes it in the same thread. The result is aDuckDBPyRelation(lazy) —.fetchall()materialises it as a list of tuples. - The CLI variant is the same engine wrapped in a tiny REPL binary.
-c "SELECT 42"runs a one-shot query and exits.
Output.
| variant | output |
|---|---|
| Python | [(42,)] |
| CLI | rendered table with 42
|
Rule of thumb. Default to duckdb.connect() (in-memory) for one-off scripts and duckdb.connect("warehouse.duckdb") for projects that need persistence. Never run a separate duckdb-server — there is no such thing on purpose.
Detailed explanation — DuckDB vs SQLite vs Postgres at a glance
Topic explanation in three labelled bullets.
- SQLite is row-store OLTP — perfect for embedded point lookups, small transactional workloads, and config databases. It struggles on aggregates over millions of rows because the executor is row-at-a-time.
- Postgres is row-store OLTP with strong OLAP add-ons — a full client-server database with proper concurrency, foreign keys, and a planner that handles real OLAP via extensions (Citus, pg_duckdb, pg_columnar). The default storage is still row-oriented; you pay for that on wide scans.
- DuckDB is columnar OLAP — embedded like SQLite, but the file format and the executor are both vectorised and column-oriented. The result: order-of-magnitude faster scans for analytical queries on the same hardware.
Question. Compare DuckDB, SQLite, and Postgres on five dimensions — process model, storage layout, concurrency, workload fit, and "where it ships." Highlight the one trait that makes DuckDB uniquely the OLAP-embedded engine.
Input. (none — comparison table only)
Code.
-- The same SELECT runs in all three; only the engine plan differs.
SELECT region, COUNT(*) AS orders, SUM(amount) AS total
FROM orders
GROUP BY region
ORDER BY total DESC;
Step-by-step explanation.
- SQLite scans the
orderstable row by row, materialising every column even though only three are needed. Aggregate is a single-threaded hash map. - Postgres scans heap pages (8 KB row-oriented blocks), reads the same row entirely, then projects the three columns. Aggregate is parallelised in modern versions but still pays the row-read cost.
- DuckDB scans only the
regionandamountcolumns from columnar storage. The aggregate runs a vectorised hash-grouped sum across batches of 1024 values at a time. Most of the data file is never read.
Output.
| dimension | SQLite | Postgres | DuckDB |
|---|---|---|---|
| process model | in-process | client-server | in-process |
| storage layout | row-oriented | row-oriented (default) | columnar |
| concurrency | single-writer + many readers | multi-writer + many readers | single-writer + many readers (MVCC) |
| workload fit | OLTP point lookups | OLTP + add-on OLAP | OLAP scans + analytical SQL |
| ships in | one C library | server + client | one C++ library |
Rule of thumb. Pick SQLite for OLTP at the edge, Postgres for OLTP at the centre, and DuckDB for OLAP that needs to ride inside another process. The "embedded OLAP" niche is exactly where DuckDB lives — and DuckDB is currently the only credible occupant.
Detailed explanation — when in-process beats client-server
Topic explanation in three labelled bullets.
- Latency — there is no TCP round-trip, no protocol parse, no auth handshake. A query that takes 5 ms in Postgres over localhost takes 0.1 ms in DuckDB because the function call is direct.
-
Cold start — managed warehouses charge "warm-up" time (seconds to minutes). DuckDB starts in microseconds — the cost is the
connect()call, which is the cost of allocating a struct. -
Ops overhead — there is no service to monitor, no backup script to write (the database is a file —
cpit), no port to firewall. Disaster recovery isgit pullif your.duckdbfile is in the repo (don't do that for prod) oraws s3 cpif it's in object storage.
Question. Given an interactive notebook that runs many small analytical queries against a 20 GB Parquet directory, explain why in-process DuckDB is materially faster than a client-server warehouse. Quantify the three costs that disappear.
Input.
| operation | client-server warehouse | in-process DuckDB |
|---|---|---|
| connect | 50–500 ms (TLS + auth) | 50 µs |
| parse + plan | 5–20 ms | 0.5 ms |
| network shuffle | yes (data moves to driver) | no (data stays in process) |
| result materialise | serialise + deserialise | direct memory |
Code.
import duckdb, time
con = duckdb.connect()
t = time.perf_counter()
for _ in range(1000):
con.execute("SELECT 1").fetchone()
print(f"1000 trivial queries in {time.perf_counter() - t:.3f}s")
# Typical: 0.05–0.2s on a laptop — i.e. ~50–200 µs each.
Step-by-step explanation.
- Each loop iteration parses, plans, executes, and fetches a one-row result. The end-to-end cost is dominated by the parse/plan step; the engine itself is microseconds.
- The same loop against a remote Postgres or Snowflake spends 5–500 ms per iteration on the network round-trip plus auth-token validation.
- A 20 GB Parquet directory loaded once into a DuckDB connection is then queried many times with no re-load; the columnar pages stay in the OS page cache and are scanned at memory speed.
- The cumulative effect on notebook workflow is the difference between "interactive" (sub-second per cell) and "stutter" (multi-second per cell).
Output.
| metric | warehouse | DuckDB |
|---|---|---|
| connect | 100 ms | 50 µs |
| 1000 trivial queries | 60 s | 0.1 s |
| 20 GB Parquet scan | 30 s (warm) | 5 s (warm) |
Rule of thumb. Whenever the data fits on the machine and the workload is "many small analytical queries," in-process DuckDB wins on every axis except multi-user concurrency. Reserve the warehouse for the cases where the dataset doesn't fit or the concurrency demand is real.
Detailed explanation — MVCC and the single-writer model
Topic explanation in three labelled bullets.
- MVCC — multi-version concurrency control. Every transaction sees a consistent snapshot of the database. Readers do not block writers; writers do not block readers. The price: a single writer at a time within one process.
-
Single-writer — DuckDB serialises mutations through one logical writer per file. Multiple processes opening the same
.duckdbfile in write mode is not supported (the second one errors out). Multiple threads inside one process serialise behind a write latch. - Many-reader — multiple processes can open the same file read-only and scan it concurrently. This is the right pattern for "one ETL writer, many analyst readers" topologies.
Question. Two cursors operate on the same DuckDB connection — one runs a long-running aggregate, the other inserts new rows mid-aggregate. Show what each cursor sees and why.
Input.
| time | cursor A (reader) | cursor B (writer) |
|---|---|---|
| t0 | BEGIN; SELECT SUM(amount) FROM orders; |
— |
| t1 | … (scan in progress) | INSERT INTO orders VALUES (99, 'EU', 1000); |
| t2 | sum returned | COMMIT; |
Code.
import duckdb
con = duckdb.connect("warehouse.duckdb")
a = con.cursor() # reader
b = con.cursor() # writer
a.execute("BEGIN")
a.execute("SELECT SUM(amount) FROM orders")
# meanwhile…
b.execute("INSERT INTO orders VALUES (99, 'EU', 1000)")
b.execute("COMMIT")
print(a.fetchone()) # snapshot value — excludes the new row
a.execute("COMMIT")
Step-by-step explanation.
- Cursor A's
BEGINpins a snapshot at the current MVCC timestamp. Every subsequent read in this transaction sees only rows visible at that timestamp. - Cursor B's INSERT creates a new row visible only to transactions whose timestamp is greater than B's commit timestamp.
- Cursor A's SUM scans the orders table at A's snapshot — the new row is invisible. A reads the "old" sum, deterministically and without locks.
- Once A commits, a subsequent
BEGINin A would see B's row because A's new snapshot timestamp is now after B's commit.
Output.
| cursor | result |
|---|---|
| A (snapshot) | original sum, e.g. 5000 |
| B (committed) | new row (99, EU, 1000) inserted |
| A (new transaction after commit) | sum = 6000 |
Rule of thumb. Treat DuckDB as "one writer at a time per database file." If you need many concurrent writers, you've outgrown DuckDB's niche — either shard by file (one per pipeline step) or graduate to a warehouse. Concurrency at scale is the explicit anti-pattern.
SQL interview question on choosing in-process OLAP
A senior interviewer often opens with: "Your team needs to run nightly analytical reports on 50 GB of Parquet. The data lives in S3. The team is three engineers. Why would you pick DuckDB over Snowflake or Postgres-with-extensions for this workload?"
Solution Using a decision-table that pins DuckDB to laptop/CI/notebook/edge workloads
-- The "decision query" you'd run mentally: given a workload profile,
-- which engine fits? Treat this as the lookup table you keep in your head.
WITH workload AS (
SELECT
'nightly analytical batch' AS pattern,
'50 GB Parquet on S3' AS data,
'three engineers' AS team,
FALSE AS multi_writer,
FALSE AS sub_second_serving,
TRUE AS deterministic_window
)
SELECT
CASE
WHEN multi_writer OR sub_second_serving THEN 'warehouse'
WHEN NOT deterministic_window THEN 'warehouse'
WHEN data ILIKE '%parquet%' AND NOT multi_writer THEN 'DuckDB'
ELSE 'evaluate further'
END AS recommended_engine
FROM workload;
Step-by-step trace.
| trait | workload value | branch evaluated | recommendation |
|---|---|---|---|
| multi_writer | FALSE | first WHEN false | continue |
| sub_second_serving | FALSE | first WHEN false | continue |
| deterministic_window | TRUE | second WHEN false | continue |
| data contains parquet | TRUE | third WHEN matches | DuckDB |
The decision-table compiles every "do I need a warehouse?" reflex into a small set of FALSE checks. The moment a workload says "no multi-writer, no sub-second serving, Parquet input, batch window" — DuckDB is the right answer. If any of those flip, the answer becomes "warehouse."
Output:
| pattern | recommended_engine |
|---|---|
| nightly analytical batch | DuckDB |
Why this works — concept by concept:
-
Single-binary cost —
pip install duckdbcosts ~30 s in CI and zero dollars per month. A warehouse costs $50–$5000+ per month before the first query. - Columnar Parquet scan — Parquet is already column-oriented; DuckDB reads only the projected columns and skips row groups whose statistics fail the predicate. This is the heart of "fast scan on a laptop."
- MVCC fits the batch window — the ETL writer runs once nightly; the analyst readers all open read-only. No write contention, ever.
- Deterministic window — the report runs at 2 AM; nobody else is querying the file. The single-writer constraint is invisible because there is only ever one writer at a time.
- Cost — O(scanned columns × scanned row groups) per query. The data on S3 stays where it is; DuckDB pulls only the bytes the predicate needs. Compute cost is one CI runner minute.
SQL
Topic — database
Database design problems (SQL)
2. DuckDB engine — vectorized execution, Arrow integration, MVCC
Vectorized execution is the reason duckdb sql runs faster than row-store engines — Arrow zero-copy is the reason it integrates with pandas/Polars/PyArrow for free
The mental model in one line: DuckDB processes batches of 1024 rows at a time through tight, cache-friendly inner loops, and shares the same Apache Arrow memory layout with pandas/Polars/PyArrow so handoffs are zero-copy. Once you internalise "vectorised batches" and "Arrow zero-copy," the rest of the engine — planner, MVCC, spill-to-disk — slots in around them.
The seven engine traits in labelled bullets.
- Vectorized executor — every operator (scan, filter, project, hash-aggregate, hash-join, sort) processes 1024-row batches at a time. The inner loop is a tight CPU-cache-friendly sequence of column operations, not a row-at-a-time interpreter walk.
- Batch vector — the unit of work is a "vector" — a fixed-size column slice plus a validity bitmap. Operators consume vectors and emit vectors; pipeline-breakers (sort, hash-aggregate) materialise into hash tables but still feed downstream as vectors.
- Arrow zero-copy — DuckDB and Apache Arrow share the same column memory layout. A pandas DataFrame backed by Arrow is the same bytes as a DuckDB vector — no SerDe round-trip on handoff.
- Planner — cost-based with a rule-based rewrite pass. Performs predicate pushdown, projection pushdown, join reordering, and filter combination before handing the plan to the executor.
- MVCC snapshot — versioned row visibility per transaction. Readers see a consistent timestamp; writers append new versions. Visibility checks are cheap (timestamp compare) and never block.
- WAL — a write-ahead log stores changes before they hit the data files, so a crash mid-commit is recoverable. Configurable but on by default for persistent databases.
-
Spill-to-disk — when a hash table outgrows
memory_limit, DuckDB spills partitions to a temporary directory on disk. Out-of-memory crashes are rare; queries get slower but still finish.
Common interview probes on the engine.
- "What does 'vectorized' mean concretely?" — process 1024 rows per operator call instead of one. The inner loop is a tight column-wise sequence.
- "What does Arrow buy you in DuckDB?" — zero-copy interop with pandas/Polars/PyArrow. A DataFrame is queried as a table with no marshalling.
- "How does DuckDB do concurrency?" — MVCC, single-writer per process. Multiple readers see a consistent snapshot.
- "What happens when an aggregate is bigger than RAM?" — DuckDB spills partitioned hash tables to
temp_directoryand continues. The query slows down but doesn't OOM.
Detailed explanation — vectorized execution by hand
Topic explanation in three labelled bullets.
-
Row-at-a-time interpreter — the classic Volcano model: each operator's
next()returns one row, walks the entire operator tree per row. Cache-unfriendly, branch-heavy, slow on modern CPUs. -
Vectorized batch — each operator's
next()returns a 1024-row chunk per column. The inner loop is a tight for-loop over a contiguous array. CPU-cache-friendly, branch-predictor-friendly, SIMD-friendly. - The win — the per-row interpreter overhead is amortised over 1024 rows. Speedups of 10–100x are routine on aggregate-heavy queries.
Question. Sum a single column over a million-row table both row-at-a-time (Python loop) and vectorised (NumPy). Use the timing difference as the mental model for what DuckDB does in C++ at scale.
Input. (one million-row column)
Code.
import time, numpy as np
n = 1_000_000
vals = np.random.randint(0, 100, n)
# Row-at-a-time
t = time.perf_counter()
total_row = 0
for v in vals:
total_row += int(v)
print(f"row-at-a-time: {time.perf_counter() - t:.3f}s, sum={total_row}")
# Vectorised
t = time.perf_counter()
total_vec = int(vals.sum())
print(f"vectorised: {time.perf_counter() - t:.3f}s, sum={total_vec}")
Step-by-step explanation.
- The Python loop pays interpreter overhead per iteration. One million iterations × tens of nanoseconds each = tens of milliseconds for a trivial sum.
- NumPy's
.sum()runs a tight C loop over the contiguous int array. It processes the same million values in tens of microseconds — roughly 1000x faster. - DuckDB's executor in C++ uses the same trick: each operator processes contiguous column slices in tight inner loops. The 1024-row batch is the sweet spot for L1/L2 cache residency.
- The "vectorised batch" idea is engine-agnostic: it shows up in DuckDB, ClickHouse, Polars, MonetDB, and modern Parquet readers. Once you see it once, you see it everywhere.
Output.
| variant | wall time (typical) | speedup |
|---|---|---|
| row-at-a-time Python | 50–100 ms | 1× |
| vectorised NumPy | 0.5–1 ms | ~100× |
Rule of thumb. Whenever you can replace a for row in rows: interpreter loop with a batched array operation, you're recapitulating the vectorised-executor win. That intuition transfers directly to why DuckDB beats row-store engines on aggregates.
Detailed explanation — Arrow zero-copy between pandas/Polars and DuckDB
Topic explanation in three labelled bullets.
- One memory layout — Apache Arrow defines a columnar in-memory format. pandas (Arrow backend), Polars, PyArrow, and DuckDB all use it. The same bytes are valid in every library.
- Zero-copy — handing a DataFrame to DuckDB does not copy bytes. DuckDB reads the Arrow buffers directly. Returning a result as a DataFrame is also zero-copy when the engine emits Arrow output.
-
The DuckDB shortcut — any Python variable that exposes an Arrow interface can be queried as a table by name:
SELECT * FROM dfworks ifdfis a pandas, Polars, or PyArrow object in the current scope.
Question. Given a pandas DataFrame in a Python session, query it with SQL via DuckDB and return the result as a Polars DataFrame — all without any explicit conversion. Show the three lines.
Input.
import pandas as pd
df = pd.DataFrame({
"region": ["EU", "US", "EU", "APAC"],
"amount": [10, 20, 30, 40],
})
Code.
import duckdb, polars as pl
# DuckDB sees `df` by name; the query runs zero-copy on Arrow buffers.
result_arrow = duckdb.sql("SELECT region, SUM(amount) AS total FROM df GROUP BY region").arrow()
# Hand the Arrow result to Polars — also zero-copy.
result_pl = pl.from_arrow(result_arrow)
print(result_pl)
Step-by-step explanation.
-
duckdb.sql(...)(top-level, no explicit connection) uses a default in-memory connection and resolves table names against the calling frame's locals. -
dfis recognised as a pandas DataFrame; DuckDB reads its underlying Arrow buffers (or NumPy arrays, depending on backend) without copying. -
.arrow()returns the result as apyarrow.Table— again, shared with DuckDB's output buffers when possible. -
pl.from_arrow(result_arrow)constructs a Polars DataFrame around the same memory. - Total round-trip: pandas → DuckDB → Polars with zero deep copies of the underlying numeric columns.
Output.
| region | total |
|---|---|
| EU | 40 |
| US | 20 |
| APAC | 40 |
Rule of thumb. Treat any Arrow-backed object (pandas, Polars, PyArrow, Polars LazyFrame after .collect()) as a first-class DuckDB table. The dataframe and the SQL surface are interchangeable — pick whichever expresses the transform most clearly.
Detailed explanation — MVCC snapshot isolation in a single process
Topic explanation in three labelled bullets.
-
Snapshot per transaction —
BEGINpins a timestamp; every subsequent read in that transaction sees the database as of that timestamp. New commits afterBEGINare invisible until the next transaction. - No reader-writer blocking — readers and writers never wait on each other. The MVCC graph stores multiple versions of mutated rows, and visibility is a timestamp compare.
-
Single writer at a time — only one transaction at a time may mutate. A second
BEGIN ... INSERT ...from another thread waits behind the active write latch.
Question. Two threads share one DuckDB connection. Thread A starts a long aggregate; thread B inserts a row mid-aggregate. Demonstrate the snapshot-isolation contract — A sees the old data, B's row appears only in subsequent transactions.
Input. (a fresh orders table)
Code.
import duckdb, threading
con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE OR REPLACE TABLE orders(id INT, region VARCHAR, amount INT)")
con.execute("INSERT INTO orders VALUES (1,'EU',10), (2,'US',20), (3,'EU',30)")
snapshot_sum = None
def reader():
global snapshot_sum
c = con.cursor()
c.execute("BEGIN")
snapshot_sum = c.execute("SELECT SUM(amount) FROM orders").fetchone()[0]
c.execute("COMMIT")
def writer():
c = con.cursor()
c.execute("INSERT INTO orders VALUES (4,'EU',100)")
t1 = threading.Thread(target=reader)
t2 = threading.Thread(target=writer)
t1.start(); t2.start(); t1.join(); t2.join()
print("snapshot sum (A):", snapshot_sum)
print("post-commit sum:", con.execute("SELECT SUM(amount) FROM orders").fetchone()[0])
Step-by-step explanation.
- The reader's
BEGINpins its snapshot timestamp. The SUM scan reads only the three rows visible at that timestamp. - The writer's INSERT proceeds in parallel, creating row 4 at a later timestamp. Because the reader's snapshot is earlier, row 4 is invisible to the reader's SUM.
- The reader returns 60 (
10 + 20 + 30) — the pre-insert sum — and commits. - The next query outside any transaction sees the latest committed state, including row 4: total now 160 (
10 + 20 + 30 + 100). - The order of thread scheduling does not affect the reader's correctness — it always sees a consistent snapshot, never a "torn" mid-insert state.
Output.
| query | result |
|---|---|
| reader snapshot SUM | 60 |
| post-commit SUM | 160 |
Rule of thumb. Rely on BEGIN ... COMMIT to pin a snapshot for any multi-statement report. The cost is zero — MVCC is free — and the contract guarantees a consistent view even while ETL writes are happening in another cursor.
Detailed explanation — spill-to-disk for out-of-memory aggregations
Topic explanation in three labelled bullets.
-
memory_limit —
SET memory_limit = '4GB'tells DuckDB the soft cap. When an operator would exceed it, DuckDB starts spilling instead of allocating more. -
temp_directory —
SET temp_directory = '/tmp/duckdb'tells DuckDB where to write spill files. Default is the OS temp dir; for big jobs on a small/tmp, override to a roomier volume. - The contract — the query slows down (disk I/O instead of RAM) but does not OOM. This is the difference between "the report finished overnight" and "kernel killed the process at 3 AM."
Question. Aggregate 100 GB of Parquet on a laptop with 16 GB of RAM. Show the two settings that turn an OOM crash into a slow-but-correct run.
Input. (Parquet directory events/ with one billion rows)
Code.
import duckdb
con = duckdb.connect()
# Two knobs make 100 GB on 16 GB RAM possible.
con.execute("SET memory_limit = '12GB'")
con.execute("SET temp_directory = '/Users/me/duckdb_tmp'")
con.execute("""
SELECT user_id, COUNT(*) AS events, SUM(amount) AS total
FROM read_parquet('events/*.parquet')
GROUP BY user_id
""").df()
Step-by-step explanation.
-
memory_limit = '12GB'reserves a 4 GB headroom for the OS and Python. The aggregate's hash table grows up to 12 GB in memory. - When the hash table approaches the limit, DuckDB partitions the in-memory table by hash and spills the largest partition to
/Users/me/duckdb_tmp. The remaining hash table fits in RAM again. - The scan continues; further partitions are spilled as needed. Each spilled partition becomes one disk-resident chunk.
- At the end, DuckDB merges the spilled partitions back into the final aggregate. Disk I/O is the bottleneck, not memory.
- The result is identical to an in-memory run; the only difference is wall time.
Output.
| user_id | events | total |
|---|---|---|
| 1 | 1200 | 480000 |
| 2 | 950 | 380000 |
| … | … | … |
Rule of thumb. For any DuckDB job that might exceed RAM, set memory_limit to ~75% of physical RAM and point temp_directory at a fast SSD with enough free space. The cost is wall time; the benefit is "the job finishes."
SQL interview question on why DuckDB beats row-store engines on OLAP
A senior interviewer might frame this as: "On the same laptop, the same GROUP BY region over 50 million rows runs in 30 seconds in SQLite and 0.6 seconds in DuckDB. Walk me through the engine differences that explain the 50x gap."
Solution Using a vectorized aggregate over a Parquet scan with a single-pass HashAggregate
-- The DuckDB plan: scan only `region` and `amount` columns, vectorised hash-aggregate, no row materialisation.
SELECT
region,
COUNT(*) AS orders,
SUM(amount) AS total,
AVG(amount)::DECIMAL(10, 2) AS avg_amount
FROM read_parquet('orders/*.parquet')
WHERE order_date >= DATE '2026-01-01'
GROUP BY region
ORDER BY total DESC;
Step-by-step trace.
| step | row-store engine (SQLite) | DuckDB (columnar + vectorised) |
|---|---|---|
| scan | reads full rows (every column) | reads only region, amount, order_date columns |
| filter | row-at-a-time predicate | vectorised predicate on order_date column slice |
| project | extracts columns row-by-row | already isolated; no extra work |
| aggregate | row-at-a-time hash insert | batch hash-insert of 1024 values per call |
| output | row tuples | column vectors |
DuckDB never reads the columns it does not need; it never interprets a row tuple; it never calls a Python-level function per row. Every step is a tight C++ loop over a contiguous column slice. The 50x speedup is the cumulative win of vectorisation, columnar projection, predicate pushdown into row groups, and zero per-row interpreter overhead.
Output:
| region | orders | total | avg_amount |
|---|---|---|---|
| EU | 1,200,000 | 48,000,000 | 40.00 |
| US | 900,000 | 36,000,000 | 40.00 |
| APAC | 400,000 | 16,000,000 | 40.00 |
Why this works — concept by concept:
- Columnar projection — Parquet stores each column in its own buffer. DuckDB reads only the columns the SELECT mentions, skipping the bytes of every unused column entirely.
-
Predicate pushdown — the
WHERE order_date >= ...filter is pushed down into Parquet row-group statistics. Row groups whosemax(order_date)is before 2026 are skipped without reading any data. - Vectorised hash-aggregate — the GROUP BY processes 1024 rows per batch. The hash table is updated with vectorised SIMD-friendly increments; AVG is computed as SUM/COUNT in the final emit.
- Single-pass plan — scan → filter → aggregate → order. No materialisation between operators; vectors stream through.
- Cost — O(scanned_row_groups × scanned_columns) for the scan; O(distinct_groups) for the hash table; O(distinct_groups × log(distinct_groups)) for the final ORDER BY. On 50 million rows with 3 regions, the order step is negligible — the work is in the scan.
SQL
Topic — aggregation
Aggregation problems (SQL)
3. Reading Parquet at scale — partition pruning, Hive partitioning, S3 / httpfs
duckdb parquet is the headline feature — scan terabytes of cloud Parquet without ever loading it, thanks to row-group statistics, Hive partitioning, and the httpfs extension
The mental model in one line: DuckDB's read_parquet() is the canonical scan function — it accepts a path, glob, S3 URI, or list, prunes partitions and row groups against your WHERE clause, and reads only the column buffers the SELECT projects. Once you say "Parquet-first," you stop running COPY ... FROM into a local table and start querying the cloud files in place.
The eight Parquet primitives in labelled bullets.
-
read_parquet —
SELECT * FROM read_parquet('path')is the canonical scan. Accepts a single file, a glob ('data/*.parquet'), an array (['a.parquet','b.parquet']), or an S3 URI. -
glob — wildcard expansion happens server-side (in the DuckDB process). The glob can include
**for recursive expansion. Useful for "every Parquet under this prefix." -
Hive partitioning — folder names like
year=2026/month=05/day=12/file.parquetare recognised when you passhive_partitioning=true. The columnsyear,month,daybecome first-class SELECTable columns without any per-file reading. - filter pushdown — DuckDB pushes WHERE-clause predicates into the Parquet reader. Predicates on partition columns prune folders; predicates on data columns prune row groups via statistics.
-
row-group statistics — every Parquet row group stores min/max/null count per column. If
max(order_date) < '2026-01-01', the row group is skipped entirely — no data pages read. -
httpfs — the extension that adds
s3://,https://, andgcs://URI support.INSTALL httpfs; LOAD httpfs;once; thereafter S3 paths work inread_parquet(). -
S3 secrets —
CREATE SECRET s3_default (TYPE S3, KEY_ID 'AK…', SECRET '…', REGION 'us-east-1')configures S3 auth once per session. Subsequent S3 paths use the secret automatically. -
projection pushdown — the SELECT list determines which columns Parquet pages are read.
SELECT a FROM read_parquet(...)reads only columna's pages, even from a 200-column file.
Common interview probes on Parquet at scale.
- "What is partition pruning?" — skipping folders/files whose partition keys fail the WHERE predicate, without reading them.
- "What is row-group pruning?" — skipping internal row groups whose min/max statistics fail the WHERE predicate.
- "What does
hive_partitioning=truebuy you?" — exposes folder-name key=value pairs as columns; partition pruning then works on those columns. - "How does DuckDB read S3 without downloading the file?" —
httpfsissues HTTP range requests to fetch only the byte ranges of the column chunks and row groups the query needs.
Detailed explanation — read_parquet over a glob
Topic explanation in three labelled bullets.
- Glob path — pass a wildcard string; DuckDB expands it and treats the result as a single virtual table. Files share a schema (or you opt-in to union by column name).
-
No load step — the data stays in the Parquet files. There is no
CREATE TABLEand noCOPY FROM. The first scan is also the only scan you need. - Lazy column reads — only the projected columns are read from disk. Even a 200-column file produces no extra I/O if your SELECT names three columns.
Question. A directory data/ contains daily Parquet files (data/2026-01-01.parquet, data/2026-01-02.parquet, …). Sum the amount column across every file in one SELECT, without copying the data into a DuckDB table.
Input. (directory of daily Parquet files, each with columns order_id, region, amount, order_date)
Code.
-- One SELECT scans every file in the glob.
SELECT
region,
COUNT(*) AS orders,
SUM(amount) AS total
FROM read_parquet('data/*.parquet')
GROUP BY region
ORDER BY total DESC;
Step-by-step explanation.
-
read_parquet('data/*.parquet')expands the glob into a list of file paths. - DuckDB opens each file's Parquet footer (a few KB at the end of each file) to read schema and row-group statistics.
- The SELECT references
region,amount— only those column chunks are read.order_idandorder_datedata pages are skipped entirely. - The hash-aggregate runs vectorised on the streamed batches.
- No table is created in the DuckDB database; the file remains the source of truth.
Output.
| region | orders | total |
|---|---|---|
| EU | 12,000 | 480,000 |
| US | 9,000 | 360,000 |
| APAC | 4,000 | 160,000 |
Rule of thumb. Default to FROM read_parquet('...') in every analytical query. Reserve CREATE TABLE AS SELECT ... for the few cases where the same scan would otherwise repeat dozens of times in the same session.
Detailed explanation — Hive partitioning and partition pruning
Topic explanation in three labelled bullets.
-
Hive folder convention — directory names like
year=2026/month=05/day=12/encode key=value pairs. The convention is the de facto standard for Parquet datasets (Spark, Athena, Snowflake all emit it). -
Opt-in — pass
hive_partitioning=truetoread_parquet()(or it's auto-detected from path structure). DuckDB then exposesyear,month,dayas virtual columns sourced from the folder names. -
Pruning — a WHERE predicate on the virtual columns prunes whole folders before any file is opened. Predicates like
WHERE year = 2026 AND month >= 5skip 95% of the data with zero I/O on the pruned files.
Question. Given a Hive-partitioned dataset under events/year=YYYY/month=MM/day=DD/, count events for May 2026 only. Show how the partition predicate prunes every folder outside that month.
Input. (directory events/year=2024/..., events/year=2025/..., events/year=2026/month=01/..., …, events/year=2026/month=05/day=12/file.parquet)
Code.
-- Hive partitioning auto-detected; predicate prunes folders.
SELECT
day,
COUNT(*) AS events
FROM read_parquet(
'events/year=*/month=*/day=*/*.parquet',
hive_partitioning = true
)
WHERE year = 2026 AND month = 5
GROUP BY day
ORDER BY day;
Step-by-step explanation.
- DuckDB expands the glob and identifies the Hive structure. The virtual columns
year,month,dayare sourced from folder names — no data file is opened to compute them. - The WHERE clause
year = 2026 AND month = 5is evaluated against the folder list. Only folders matching the predicate survive (31 folders for May 2026 days). - For each surviving folder, DuckDB opens the Parquet file and scans the
daycolumn (and an implicit*for COUNT). The other partitions never touch disk. - The hash-aggregate emits one row per day. The result is a 31-row table.
Output.
| day | events |
|---|---|
| 01 | 5400 |
| 02 | 6100 |
| … | … |
| 31 | 5800 |
Rule of thumb. Always structure Parquet datasets with Hive partitioning on the columns you most commonly filter on (date, region, tenant). A year=/month=/day= layout turns a billion-row scan into a few-thousand-row scan with zero per-row cost.
Detailed explanation — projection + filter pushdown into row groups
Topic explanation in three labelled bullets.
- Row group — a Parquet file is divided into row groups (typically 100 MB to 1 GB each). Each row group stores column chunks plus min/max/null-count statistics per column.
- Statistic-based pruning — DuckDB reads the footer, examines per-row-group statistics, and skips any row group whose stats fail the WHERE predicate. The pruned bytes are never fetched from disk or network.
- EXPLAIN ANALYZE — DuckDB's plan printer shows "Filters" and the number of row groups read vs. skipped. Use it to verify pushdown is happening.
Question. A single Parquet file has 10 row groups (1 GB each, 10 GB total). The WHERE predicate filters order_date >= '2026-05-01'. Only the last 3 row groups have any May data. Show the EXPLAIN plan that confirms 7 row groups were pruned.
Input. (single Parquet file orders.parquet, 10 row groups, dates evenly distributed)
Code.
EXPLAIN ANALYZE
SELECT region, SUM(amount) AS total
FROM read_parquet('orders.parquet')
WHERE order_date >= DATE '2026-05-01'
GROUP BY region;
Step-by-step explanation.
- The planner pushes the predicate
order_date >= '2026-05-01'into the Parquet reader. - The reader opens the file footer (a few KB), enumerates 10 row groups, and inspects each one's
order_datestatistics. - Seven row groups have
max(order_date) < '2026-05-01'— pruned. Three row groups havemax(order_date) >= '2026-05-01'— scanned. - For the three scanned row groups, only the
region,amount, andorder_datecolumn chunks are read (projection pushdown). Other columns' pages stay on disk. - The vectorised hash-aggregate runs over the surviving data.
Output (EXPLAIN ANALYZE excerpt).
| step | metric | value |
|---|---|---|
| TABLE_SCAN | row groups read | 3 |
| TABLE_SCAN | row groups skipped | 7 |
| TABLE_SCAN | bytes read | ~250 MB (3 row groups × 3 cols) |
| HASH_AGGREGATE | result rows | 3 (regions) |
Rule of thumb. Always check EXPLAIN ANALYZE for the row-groups-read count after a slow query. If pruning is not happening, the column you're filtering on is either not sorted/clustered within row groups (so every row group's min/max spans the predicate range) or the file was written without statistics. The fix is to re-write Parquet with a sort on the filter column.
Detailed explanation — S3 + httpfs + SECRET for cloud Parquet
Topic explanation in three labelled bullets.
-
httpfs extension — installed and loaded once per session:
INSTALL httpfs; LOAD httpfs;. Adds support fors3://,https://,gcs://, andazure://URIs. -
SECRET —
CREATE SECRET (TYPE S3, KEY_ID '...', SECRET '...', REGION '...')stores credentials in the session. Multiple secrets can be scoped by path prefix. - HTTP range requests — DuckDB fetches only the byte ranges it needs (Parquet footer + the projected column chunks of the surviving row groups). A 100 GB S3 file can be scanned by downloading only a few MB.
Question. Configure DuckDB to read Parquet from a private S3 bucket and run a scan over s3://my-bucket/events/year=2026/. Show the three setup lines plus one SELECT.
Input. (S3 bucket my-bucket with Hive-partitioned events; AWS access key + secret)
Code.
-- One-time per session
INSTALL httpfs;
LOAD httpfs;
-- Auth via SECRET (more secure than SET-based env vars)
CREATE SECRET s3_default (
TYPE S3,
KEY_ID '...',
SECRET '...',
REGION 'us-east-1'
);
-- Scan
SELECT region, COUNT(*) AS events
FROM read_parquet(
's3://my-bucket/events/year=2026/month=*/*.parquet',
hive_partitioning = true
)
WHERE month >= 5
GROUP BY region;
Step-by-step explanation.
-
INSTALL httpfsdownloads and caches the extension on first use;LOAD httpfsregisters its function tables in the session. -
CREATE SECRETstores S3 credentials in the session's secret store. Subsequent S3 URIs use the secret automatically. -
read_parquet('s3://...')issues aHEADthen a smallGETfor each file's footer to learn schemas and statistics. - The Hive partition predicate (
month >= 5) prunes folder list — only May/June/July/etc. folders survive. - For each surviving file, range
GETs pull the projected column chunks of surviving row groups. The total bytes downloaded are a tiny fraction of the bucket. - The vectorised aggregate streams the data through the hash table.
Output.
| region | events |
|---|---|
| EU | 1,200,000 |
| US | 900,000 |
| APAC | 400,000 |
Rule of thumb. Use CREATE SECRET over SET s3_access_key_id = '...' — it scopes nicely, doesn't leak into query logs, and supports multiple buckets with different roles. For production CI runs, prefer TYPE S3, PROVIDER 'credential_chain' to pick up the IAM role from the runner instead of hard-coding keys.
SQL interview question on partition pruning and projection pushdown
A senior interviewer often opens with: "Your team has a 10 TB Parquet dataset on S3, Hive-partitioned by date. The dashboard query filters one day and projects 5 of 50 columns. How do you reason about the bytes actually transferred from S3 to DuckDB?"
Solution Using filter + projection that prunes 90% of row groups
-- The dashboard SELECT
SELECT
region,
user_id,
COUNT(*) AS events,
SUM(amount) AS total
FROM read_parquet(
's3://my-bucket/events/year=*/month=*/day=*/*.parquet',
hive_partitioning = true
)
WHERE year = 2026 AND month = 5 AND day = 12
GROUP BY region, user_id;
Step-by-step trace.
| stage | what happens | bytes transferred |
|---|---|---|
| folder enumeration | Hive prefix listed; predicate year=2026 AND month=5 AND day=12 matches one folder |
~few KB (LIST API) |
| file footers | DuckDB reads Parquet footer of each surviving file | ~10 KB × N files |
| row-group stats | per-row-group stats checked against any further predicates | (already in footer) |
| column chunk reads | only region, user_id, amount chunks of surviving row groups |
small fraction of 10 TB |
| in-DuckDB aggregate | vectorised hash-group on streamed batches | (in memory) |
The vast majority of the 10 TB is never read. Hive partitioning prunes 99.97% of the dataset at the folder level (1 day out of ~1000). Projection pushdown reads 3 columns out of 50 — another 94% saving on the surviving data. The actual bytes pulled from S3 are typically < 1 GB for what looks like a 10 TB query.
Output:
| region | user_id | events | total |
|---|---|---|---|
| EU | 1001 | 12 | 480 |
| EU | 1002 | 9 | 360 |
| US | 2001 | 7 | 280 |
| … | … | … | … |
Why this works — concept by concept:
- Hive partitioning — folder-level keys cut the dataset to a single partition before any file is opened.
- Row-group statistics — per-row-group min/max prune within each surviving file; only row groups whose range overlaps the predicate are read.
- Projection pushdown — only the columns the SELECT names have their column chunks fetched. Other columns' pages stay on S3.
-
HTTP range requests —
httpfsissues precise byte-rangeGETs; you pay S3 for the bytes you actually read, not for the whole file. - Cost — O(surviving_row_groups × projected_columns × avg_chunk_size). For a typical dashboard query on a well-partitioned dataset, this is < 1% of the total dataset size.
SQL
Topic — filtering
Filtering problems (SQL)
4. Local ETL workflows — Python + dbt-duckdb + tests + Git
local etl with DuckDB is the new dev loop — one binary, one Python file, one dbt project, one pytest run, all in CI under two minutes
The mental model in one line: DuckDB collapses the dev/CI/prod-parity problem — write your transforms in duckdb python (or dbt-duckdb), run them against the same SQL on every machine, and let GitHub Actions execute the entire pipeline in seconds for free. Once you say "local ETL," the temptation to spin up a warehouse for prototyping evaporates.
The eight workflow primitives in labelled bullets.
-
duckdb python —
import duckdb; con = duckdb.connect(...). The same engine that powers the CLI is a first-class Python library. Result objects expose.fetchall(),.df(),.arrow(),.pl()(Polars). -
dbt-duckdb — the official adapter that lets dbt run against DuckDB.
profiles.ymlpoints to a.duckdbfile (or:memory:);dbt runmaterialises models as tables or views. - local etl — read CSV / JSON / Parquet → transform via SQL → write Parquet. The whole pipeline is one Python file or one dbt project, no scheduler required.
-
pytest — fixtures that spin up
duckdb.connect()in:memory:mode. Tests assert on result rows, schemas, or row counts. Tests run in milliseconds. -
Git — the project (Python files,
models/,tests/,profiles.yml,Makefile) is version-controlled. The.duckdbfile itself is not checked in for production; for tutorials/demos a small file is fine. - CI runner — GitHub Actions / GitLab CI / CircleCI execute the entire pipeline on a clean runner. DuckDB installs in seconds; the runtime is dominated by the actual transform.
-
Ephemeral DB —
:memory:DuckDB for each test ensures isolation. No leftover state between tests. -
profiles.yml — the dbt profile entry:
type: duckdb,path: 'dev.duckdb', plus extensions and threads. One file controls dev, staging, and prod targets.
Common interview probes on local ETL.
- "Why dbt-duckdb instead of dbt-snowflake for dev?" — same SQL, runs in seconds locally, no warehouse cost.
- "How do you make tests fast?" —
:memory:connections per test; no I/O; vectorised compute. - "How do you keep dev and prod in sync?" — same dbt project, different targets; tests run on both. SQL portability between DuckDB and Snowflake is high (CTEs, window functions, common types match).
- "What goes in version control?" — Python, SQL models, tests,
profiles.yml.example(without secrets),Makefile. The.duckdbfile goes in.gitignore.
Detailed explanation — a 30-line Python ETL with DuckDB
Topic explanation in three labelled bullets.
-
Read —
read_csv_auto(),read_json_auto(),read_parquet()are all built-in. No external libraries required for the I/O. -
Transform — write SQL. DuckDB's SQL surface is large: window functions, CTEs, lateral joins,
QUALIFY,PIVOT,UNPIVOT, regular expressions, JSON path. -
Write —
COPY ... TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD). The output is partition-friendly, column-oriented, and analytics-ready.
Question. Build a tiny ETL: read a CSV of raw orders, compute one aggregate per region, write the result as a Parquet file. Show the entire pipeline in under 30 lines of Python.
Input. (orders.csv with columns order_id, region, amount)
Code.
# etl.py — 22 lines
import duckdb
con = duckdb.connect("warehouse.duckdb")
con.execute("""
CREATE OR REPLACE TABLE stg_orders AS
SELECT
CAST(order_id AS BIGINT) AS order_id,
UPPER(region) AS region,
CAST(amount AS DECIMAL(10,2)) AS amount
FROM read_csv_auto('orders.csv', header = true)
""")
con.execute("""
COPY (
SELECT
region,
COUNT(*) AS orders,
SUM(amount) AS total
FROM stg_orders
GROUP BY region
ORDER BY total DESC
) TO 'mart_region_totals.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);
""")
print("ETL complete: mart_region_totals.parquet written.")
Step-by-step explanation.
-
read_csv_auto()sniffs the CSV header, infers types, and streams rows into the SQL engine. - The staging CREATE TABLE applies a tiny normalisation: cast types, uppercase region. The result lives in the persistent
warehouse.duckdbfile. - The aggregate SELECT runs vectorised over the staging table.
-
COPY (...) TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)streams the result directly to a Parquet file with ZSTD compression. No intermediate table required. - The script is idempotent: re-running it drops and recreates
stg_ordersand overwrites the Parquet output.
Output.
| file | size | content |
|---|---|---|
warehouse.duckdb |
~few MB | persistent staging table |
mart_region_totals.parquet |
~few KB | aggregated mart (one row per region) |
Rule of thumb. Lean on read_csv_auto for ad-hoc CSV; the auto schema inference is rock-solid for most files. For production, declare explicit types via the columns = {...} argument so a malformed input fails fast instead of silently producing the wrong type.
Detailed explanation — dbt-duckdb profiles.yml + a staging model
Topic explanation in three labelled bullets.
-
Adapter —
pip install dbt-duckdbinstalls the official adapter.dbt-coreand the adapter together are <50 MB. -
Profile target —
~/.dbt/profiles.ymldeclares the target connection:type: duckdb,path: 'dev.duckdb',threads: 4. The same project can havedev,ci, andprodtargets. -
Model file —
models/staging/stg_orders.sqlis plain dbt: a{{ config(materialized='table') }}line plus a SELECT. dbt-duckdb materialises it as a real DuckDB table.
Question. Set up a minimal dbt-duckdb project with one staging model. Show the profiles.yml, the model file, and the dbt run command.
Input.
project/
├── dbt_project.yml
├── profiles.yml # checked-in profiles.yml.example; real one in ~/.dbt/
├── models/
│ └── staging/
│ └── stg_orders.sql
└── seeds/
└── orders.csv
Code.
# ~/.dbt/profiles.yml
demo:
target: dev
outputs:
dev:
type: duckdb
path: 'dev.duckdb'
threads: 4
ci:
type: duckdb
path: ':memory:'
threads: 2
-- models/staging/stg_orders.sql
{{ config(materialized = 'table') }}
SELECT
CAST(order_id AS BIGINT) AS order_id,
UPPER(region) AS region,
CAST(amount AS DECIMAL(10, 2)) AS amount,
CAST(order_date AS DATE) AS order_date
FROM {{ ref('orders') }} -- the seed file
# from the project directory
dbt deps # install any packages
dbt seed # load orders.csv into the warehouse
dbt run # materialise stg_orders as a table
dbt test # run any data tests
Step-by-step explanation.
- The
devtarget writes todev.duckdbon disk; thecitarget uses:memory:so the runner has zero state at end of job. -
dbt seedreadsseeds/orders.csvand creates aorderstable in the database.dbt runthen references it viaref('orders'). -
{{ config(materialized='table') }}instructs dbt-duckdb toCREATE OR REPLACE TABLE stg_orders AS SELECT …. - The SQL is plain DuckDB dialect; no Snowflake-specific functions, no warehouse macros. The same SQL runs against any DuckDB target.
-
dbt testexecutes anytests/*.sqlfiles plus generic schema tests defined inschema.yml.
Output.
| step | result |
|---|---|
dbt seed |
orders table loaded |
dbt run |
stg_orders table materialised |
dbt test |
all tests pass (or fail with row counts) |
Rule of thumb. Keep dev/CI/prod profiles in a single profiles.yml keyed by target. Never check in real secrets; commit a profiles.yml.example to the repo and point developers at ~/.dbt/profiles.yml. For CI, write the profile from environment variables at job start.
Detailed explanation — pytest fixture that spins up an in-memory DuckDB
Topic explanation in three labelled bullets.
-
In-memory per test —
duckdb.connect()with no path returns a fresh:memory:database. Each fixture creates a new connection so tests are isolated. - Seed in fixture — the fixture loads CSV / Parquet / inline VALUES into staging tables. The test then executes transforms and asserts on results.
- Speed — vectorised compute over small in-memory data is microsecond-fast. A 1000-test suite runs in seconds.
Question. Write a pytest fixture that yields a DuckDB connection seeded with a small orders table, then a test that asserts the regional aggregate is correct.
Input. (no external file — data inlined)
Code.
# tests/conftest.py
import duckdb, pytest
@pytest.fixture
def con():
c = duckdb.connect() # :memory:
c.execute("""
CREATE TABLE orders (order_id INT, region VARCHAR, amount INT);
INSERT INTO orders VALUES
(1, 'EU', 10),
(2, 'US', 20),
(3, 'EU', 30);
""")
yield c
c.close()
# tests/test_regional_total.py
def test_region_eu_total(con):
result = con.execute("""
SELECT SUM(amount) FROM orders WHERE region = 'EU'
""").fetchone()[0]
assert result == 40
def test_region_count(con):
result = con.execute("""
SELECT region, COUNT(*) FROM orders GROUP BY region ORDER BY region
""").fetchall()
assert result == [('EU', 2), ('US', 1)]
Step-by-step explanation.
- The
confixture creates a fresh:memory:database per test (pytest's default fixture scope isfunction). - The fixture seeds a tiny
orderstable with three rows. This is the test's "world" — no external files, no shared state. - Each test executes a SQL query and asserts on the result.
- After the test,
c.close()releases the memory. The next test gets a clean DB. - The whole suite runs in milliseconds because no I/O is involved.
Output.
| test | result |
|---|---|
test_region_eu_total |
PASSED |
test_region_count |
PASSED |
Rule of thumb. Always seed inline VALUES for unit tests — the source of truth is in the test file, not in a fixture CSV. Save file-based seeds for integration tests that exercise the CSV reader itself. The 100ms saved per test compounds across hundreds of tests.
Detailed explanation — running the entire pipeline in GitHub Actions
Topic explanation in three labelled bullets.
- Single runner — one Ubuntu runner installs Python, pip-installs DuckDB + dbt-duckdb + pytest, runs the pipeline, and exits. Total runtime: typically 1–3 minutes for a small project.
- No services — the runner has no managed warehouse, no auth secrets to manage, no network egress to the cloud. Everything happens on the runner.
- Cache the wheel — GitHub Actions caches the pip cache; cold runs install in 30 seconds, warm runs in 5 seconds.
Question. Write a .github/workflows/ci.yml that installs DuckDB, runs dbt build, then runs pytest. Show the entire workflow in under 30 lines.
Input. (a repo with dbt_project.yml, models/, tests/, requirements.txt)
Code.
name: CI
on:
push: { branches: [main] }
pull_request:
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
- name: Install deps
run: pip install -r requirements.txt # duckdb, dbt-duckdb, pytest
- name: Write CI profile
run: |
mkdir -p ~/.dbt
cat > ~/.dbt/profiles.yml <<'EOF'
demo:
target: ci
outputs:
ci:
type: duckdb
path: ':memory:'
threads: 2
EOF
- name: dbt build (seed, run, test)
run: dbt build --target ci
- name: pytest
run: pytest -q
Step-by-step explanation.
- The runner checks out the repo and sets up Python 3.12 with a cached pip cache.
-
pip install -r requirements.txtinstalls DuckDB, dbt-duckdb, and pytest. The wheel comes from cache on warm runs. - The "Write CI profile" step generates
~/.dbt/profiles.ymlat the runner level using a heredoc — the profile points to:memory:so no file lingers. -
dbt buildrunsseed,run,testin one command. Models are materialised in the in-memory DB; tests verify them. -
pytest -qruns Python-level tests against in-memory DuckDB connections. - Total wall time on a clean runner: 1–2 minutes for typical projects. There's no warehouse credit consumed.
Output.
| step | typical time |
|---|---|
| checkout + setup-python | 5 s |
| pip install | 5–30 s |
| dbt build | 10–60 s |
| pytest | 5–30 s |
| total | ~1–2 min |
Rule of thumb. Always use :memory: DuckDB in CI — the runner has zero state at start, zero state at end, and the in-memory engine is the fastest path. Reserve a persistent .duckdb file for integration tests that specifically need to exercise the on-disk format.
Python interview question on building a reproducible local ETL with DuckDB
A senior interviewer might frame this as: "Your team wants a single command (make ci) that runs the whole pipeline end-to-end against the same SQL we deploy to production. How do you structure the project so the laptop, CI runner, and prod warehouse all execute the same models?"
Solution Using duckdb.connect() + dbt run + pytest, all driven from a single Makefile
# Makefile (excerpt)
#
# make seed — load seeds into dev DB
# make run — materialise models
# make test — run dbt tests + pytest
# make ci — run the entire pipeline end-to-end
.PHONY: seed run test ci clean
DBT_TARGET ?= dev
seed:
dbt seed --target $(DBT_TARGET)
run:
dbt run --target $(DBT_TARGET)
test:
dbt test --target $(DBT_TARGET)
pytest -q
ci: clean
$(MAKE) seed DBT_TARGET=ci
$(MAKE) run DBT_TARGET=ci
$(MAKE) test DBT_TARGET=ci
clean:
rm -f dev.duckdb dev.duckdb.wal
Step-by-step trace.
| command | DBT target | DuckDB file | duration | result |
|---|---|---|---|---|
make seed |
dev | dev.duckdb |
~1 s | seeds loaded |
make run |
dev | dev.duckdb |
~2 s | models materialised |
make test |
dev |
dev.duckdb + pytest in-memory |
~3 s | tests pass |
make ci |
ci | :memory: |
~5 s | full pipeline pass |
The Makefile defines the same four verbs (seed, run, test, clean) and parameterises by DBT_TARGET. Local dev uses dev target (persistent dev.duckdb); CI uses ci target (:memory:). Prod (when wired) uses a prod target pointing at a real warehouse — same Make verbs, same dbt commands, same SQL.
Output:
| command | exit code |
|---|---|
make ci |
0 (all green) |
Why this works — concept by concept:
-
One project, three targets — the
dev/ci/prodsplit lives inprofiles.yml. The Makefile and the dbt models are unchanged; only the connection is swapped. - Same SQL everywhere — the dbt models use ANSI-friendly SQL that runs on DuckDB and (with minimal macros) on Snowflake/Postgres. No "DuckDB-only" patterns leak into the model.
-
Reproducibility —
make clean && make ciis the deterministic, hermetic run. Any failure is the model's fault, not the environment's. -
Fast feedback loop —
make testagainstdev.duckdbfinishes in seconds. The "edit → save → make test" cycle is faster than a warehouse query takes to plan. - Cost — zero per CI run. The runner is the bottleneck, not a managed compute credit.
Python
Topic — file IO
File I/O problems (Python)
5. Where DuckDB fits (and where it doesn't) — laptops, CI, edge, notebook
duckdb analytics shines in four quadrants — laptop, CI, edge, notebook — and the anti-patterns are equally important to name out loud
The mental model in one line: DuckDB owns the in-process OLAP niche: single-machine workloads where the data fits (or streams through) one process and the writer count is at most one. Anywhere you need multi-writer concurrency, sub-second multi-user serving, or a long-running shared OLAP service, you've left DuckDB's lane. Once you can name the four right-fit quadrants and the three anti-patterns, the architecture decision becomes mechanical.
The four right-fit quadrants in labelled bullets.
- Laptop analytics — engineers and analysts running ad-hoc SQL against tens of GBs of local or S3 Parquet. NVMe + 32 GB RAM + DuckDB outperforms many small warehouses on cold scans.
- CI runner — dbt projects and pytest suites that need to run on every PR. DuckDB installs in seconds; the entire pipeline runs in 1–2 minutes; zero warehouse cost.
- Edge / embedded — a single-machine deployment (an IoT device, a Lambda function, a Cloudflare worker, a desktop app) needs SQL over local Parquet. DuckDB ships as one binary with no daemon.
- Notebook — Jupyter / Marimo / Hex / Polars-LazyFrame workflows. The Arrow zero-copy bridge means pandas/Polars and DuckDB share buffers; the notebook can flip between SQL and DataFrame syntax for free.
The three anti-patterns in labelled bullets.
- Long-running OLAP cluster — if you need a shared, always-on, multi-node compute that 50 analysts hit concurrently, DuckDB is the wrong tool. Use Snowflake/BigQuery/Databricks/ClickHouse — engines designed for that topology.
-
Concurrent writes — multiple processes writing to the same
.duckdbfile is not supported. Multiple threads writing within one process serialise behind a write latch. If your pipeline genuinely needs N concurrent writers, you've outgrown DuckDB's niche. - Multi-user serving — sub-second OLAP responses for hundreds of concurrent users (dashboards, real-time analytics for an app) is not DuckDB's sweet spot. Reach for ClickHouse, Druid, Pinot, or a CDN-cached materialised view.
Common interview probes on fit.
- "When would you replace DuckDB with Snowflake?" — when the workload requires sub-second multi-user serving, when the data > 1 TB and needs always-on multi-node compute, or when concurrent writes are part of the design.
- "Can DuckDB serve a dashboard?" — yes for one user (e.g. Streamlit), no for 100 concurrent users. Put a cache or a real serving engine in front for scale.
- "Is DuckDB production-ready?" — yes, since 1.0 (2024). It runs nightly ETL, replaces SQLite-as-OLAP, drives CI, and powers many Lambda-based analytics jobs in production.
Detailed explanation — laptop analytics quadrant
Topic explanation in three labelled bullets.
- NVMe is the secret weapon — modern laptops have 3–7 GB/s NVMe storage. DuckDB's columnar scans saturate the bus; cold scans of 50 GB Parquet complete in a minute or two.
- One developer, many queries — the single-writer constraint is invisible because there is only one user. Many parallel reads from the OS page cache stay fast.
-
Local-first — data lives in
./data/or in S3 via httpfs. Either way, the engine runs in the developer's terminal — no VPN, no console login, no auth dance.
Question. A developer wants to explore a 50 GB Parquet dataset interactively on a 32 GB MacBook. Show how DuckDB makes this practical and why a small warehouse would feel slower.
Input. (50 GB Parquet directory events/)
Code.
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit = '24GB'")
# First scan — cold from NVMe
df = con.execute("""
SELECT region, COUNT(*) AS events, SUM(amount) AS total
FROM read_parquet('events/*.parquet')
GROUP BY region
""").df()
# Subsequent scans — warm from OS page cache
df2 = con.execute("""
SELECT user_id, AVG(amount)
FROM read_parquet('events/*.parquet')
WHERE region = 'EU'
GROUP BY user_id
""").df()
Step-by-step explanation.
- The first SELECT reads all
events/*.parquetfooters (a few MB), prunes nothing (no predicate), and streams theregionandamountcolumns through a vectorised aggregate. - NVMe delivers ~5 GB/s; the scanned column bytes are ~20 GB. Wall time: ~5–10 seconds.
- The OS caches the recently-read pages in RAM. The next query (
WHERE region = 'EU') reuses cached pages and scans only the columns it projects. - Warm-cache wall time: under one second for most subsequent queries.
- A small Snowflake warehouse would pay the cluster-warm-up cost (seconds) and the data-shuffle cost (more seconds), often resulting in a slower interactive experience for this data size.
Output.
| query | wall time (typical) |
|---|---|
| first scan (cold NVMe) | 5–10 s |
| second scan (warm cache) | < 1 s |
Rule of thumb. For ad-hoc exploration up to a few hundred GB on a single machine, DuckDB on a laptop is the fastest interactive experience available — and the cheapest.
Detailed explanation — CI quadrant
Topic explanation in three labelled bullets.
-
Cold install —
pip install duckdb dbt-duckdbon a clean runner is ~30 seconds (5 seconds with cache). Compare with a warehouse provision, which is "minutes." -
Full pipeline per PR — every PR runs
seed → run → testagainst the same SQL it will ship. There's no "tested locally, broke in CI" gap. - Cost — zero managed-compute cost. The runner is the only billable resource, and it's usually under 2 minutes.
Question. A team runs dbt against Snowflake in prod but wants every PR to exercise the full pipeline in CI without a Snowflake credit. Show the dbt-duckdb pattern that achieves that.
Input. (dbt project with Snowflake-targeted SQL — ANSI-compatible)
Code.
# profiles.yml (CI section)
demo:
target: ci
outputs:
ci:
type: duckdb
path: ':memory:'
threads: 2
extensions: [httpfs, parquet]
# CI step
dbt build --target ci
Step-by-step explanation.
- The CI target points at an in-memory DuckDB connection. No external warehouse is touched.
-
dbt buildrunsseed,run,testin order. The pipeline executes against in-memory tables. - Tests fail fast on schema/data violations. The runner exits non-zero; the PR is blocked until fixed.
- The same SQL ships to Snowflake in prod via the
prodtarget. The SQL is ANSI-compatible (CTEs, window functions, standard types) so portability is high. - The total CI runtime is ~1–2 minutes; the cost is zero managed compute.
Output.
| run | dbt build duration | warehouse credits |
|---|---|---|
| CI (DuckDB :memory:) | ~30 s | 0 |
| Prod (Snowflake X-Small) | ~20–40 s | metered |
Rule of thumb. Use dbt-duckdb in CI for any project that runs against Snowflake/BigQuery/Postgres in prod. Keep the SQL ANSI-compatible by avoiding warehouse-specific functions (or wrapping them in dbt macros that swap by target).
Detailed explanation — notebook/Polars quadrant
Topic explanation in three labelled bullets.
-
Arrow bridge — pandas (with Arrow backend), Polars, and DuckDB share columnar memory. A
dfin Python is a table in DuckDB and vice versa; no SerDe. - Mix SQL and DataFrame — express set-based transforms (joins, group-bys) as SQL; express row-wise transforms (string ops, complex expressions) as DataFrame methods. Pick the right tool per step.
-
Polars LazyFrame interop —
pl.scan_parquet(...).collect()plus DuckDB SQL gives you a hybrid query plan. Polars optimises its part; DuckDB optimises its part.
Question. In a notebook, read 10 GB of Parquet with Polars, hand it to DuckDB for a SQL aggregate, return as a pandas DataFrame for plotting. Show the three lines.
Input. (Polars LazyFrame over events/*.parquet)
Code.
import polars as pl
import duckdb
# Polars handles the scan / filter
lf = pl.scan_parquet("events/*.parquet").filter(pl.col("region") == "EU")
# DuckDB SQL — Polars frame is recognised by name
df = duckdb.sql("""
SELECT user_id, COUNT(*) AS events, SUM(amount) AS total
FROM lf
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
""").df()
print(df)
Step-by-step explanation.
-
pl.scan_parquet(...)is a lazy plan in Polars; nothing is read yet. - The
.filter(...)adds a predicate to the Polars plan. -
duckdb.sql("... FROM lf ...")recogniseslfas a Polars LazyFrame in the local scope. DuckDB collects it via Arrow (zero-copy where possible) and runs the SQL. -
.df()returns the result as a pandas DataFrame, again via Arrow. - No explicit conversion was written; the Arrow layer handled the handoffs.
Output.
| user_id | events | total |
|---|---|---|
| 1001 | 1200 | 480000 |
| 1002 | 950 | 380000 |
| … | … | … |
Rule of thumb. Use Polars for fluent DataFrame transforms (filter, with_columns, group_by) and DuckDB for set-based SQL (multi-table joins, window functions, CTEs). The Arrow zero-copy bridge means picking the right tool per step has no perf penalty.
Detailed explanation — anti-patterns
Topic explanation in three labelled bullets.
- Long-running OLAP cluster — DuckDB is not designed to run as an always-on service that hundreds of clients connect to. It has no scheduler, no resource governor, no per-user accounting.
-
Concurrent writes — two processes opening the same
.duckdbfile in write mode fails. Two threads in one process serialise behind a write latch. - Multi-user serving — sub-second OLAP for hundreds of concurrent dashboards needs caching, sharding, or a serving-grade engine.
Question. Three real-world scenarios — pick the right tool for each and explain why DuckDB is wrong for two of them.
Input.
| scenario | requirements |
|---|---|
| A | nightly batch report on 100 GB Parquet, one engineer |
| B | real-time dashboard for 1000 concurrent users on 1 TB of clickstream |
| C | central data warehouse for 50 analysts running concurrent ad-hoc SQL |
Code.
A → DuckDB (single-writer batch, no concurrency needed)
B → ClickHouse / Druid / Pinot (multi-user low-latency serving)
C → Snowflake / BigQuery / Databricks (always-on shared compute, many concurrent users)
Step-by-step explanation.
- Scenario A is a textbook fit: one writer (the nightly ETL), one or two readers (the report viewer), batch window. DuckDB is the cheapest and fastest answer.
- Scenario B violates "multi-user serving." 1000 concurrent users hitting one DuckDB process would starve on the single-writer latch the moment any background refresh kicks in. Use a serving-grade OLAP engine.
- Scenario C violates "always-on shared compute." 50 analysts need a managed warehouse with resource governance, per-user accounting, and concurrent compute scaling.
- Naming the constraint out loud (writer count, concurrency, serving SLO) is the senior signal — interviewers don't want a religious answer, they want the constraint check.
Output.
| scenario | recommended engine | reason |
|---|---|---|
| A | DuckDB | single writer, batch, fits in one process |
| B | ClickHouse / Druid / Pinot | sub-second OLAP for many users |
| C | Snowflake / BigQuery / Databricks | shared always-on compute for many analysts |
Rule of thumb. Memorise the three anti-patterns — long-running OLAP cluster, concurrent writes, multi-user serving — and check them first whenever DuckDB comes up. If none triggers, DuckDB is almost certainly the right answer; if any triggers, name the alternative without ceremony.
SQL interview question on picking DuckDB vs Snowflake/BigQuery
A senior interviewer often opens with: "Your team has a dbt project that runs nightly on Snowflake. The cost is $2k/month, almost all of which is the nightly build. The data is < 500 GB and changes once a day. Should you migrate to DuckDB? Walk me through the analysis."
Solution Using a four-quadrant deployment matrix
-- The decision query, expressed as a tiny analytics SELECT.
WITH workload AS (
SELECT
'500 GB nightly batch' AS pattern,
FALSE AS multi_writer_concurrent,
FALSE AS sub_second_serving,
FALSE AS many_concurrent_users,
'NVMe + 64GB RAM CI node' AS available_compute,
2000 AS current_monthly_cost_usd
)
SELECT
pattern,
CASE
WHEN multi_writer_concurrent THEN 'stay on Snowflake'
WHEN sub_second_serving THEN 'stay on Snowflake (or switch to a serving OLAP)'
WHEN many_concurrent_users THEN 'stay on Snowflake'
ELSE 'migrate to DuckDB on a CI node'
END AS recommendation,
CASE
WHEN multi_writer_concurrent
OR sub_second_serving
OR many_concurrent_users THEN current_monthly_cost_usd
ELSE 50 -- CI runner monthly cost estimate
END AS projected_monthly_cost_usd
FROM workload;
Step-by-step trace.
| trait | value | branch evaluated | recommendation |
|---|---|---|---|
| multi_writer_concurrent | FALSE | first WHEN false | continue |
| sub_second_serving | FALSE | second WHEN false | continue |
| many_concurrent_users | FALSE | third WHEN false | continue |
| (catch-all ELSE) | matched | → migrate to DuckDB | migrate |
The decision query mirrors the mental check. None of the three "stay on the warehouse" anti-patterns trigger, so the cost-effective answer is a DuckDB pipeline on a CI node. The projected cost drops from $2000/month to ~$50/month — a 40x reduction — and the build time often drops too because there's no warehouse-warm-up step.
Output:
| pattern | recommendation | projected_monthly_cost_usd |
|---|---|---|
| 500 GB nightly batch | migrate to DuckDB on a CI node | 50 |
Why this works — concept by concept:
- Anti-pattern check — the three triggers (multi-writer, sub-second serving, many concurrent users) cover 95% of "must stay on a warehouse" reasons. If none trip, DuckDB is the cheaper and often faster choice.
- Single-machine compute — 500 GB fits comfortably on a modern CI node with NVMe and 64 GB RAM. The vectorised scan keeps the workload CPU-bound, not memory-bound.
- Cost arithmetic — Snowflake credits are charged per-second of warehouse time; a $2000/month nightly job is ~80 hours of XS warehouse credits. A 1-hour nightly CI runner is closer to $50/month all-in.
- SQL portability — dbt models written for Snowflake are typically 90%+ portable to DuckDB. The remaining 10% is wrapped in dbt macros that swap by target.
- Cost — O(scanned bytes) per query. With Parquet columnar scans and partition pruning, the actual bytes per job are tiny relative to the dataset size, so the runner finishes inside the budget.
SQL
Topic — optimization
Optimization problems (SQL)
Cheat sheet — DuckDB recipes
-
Install in any language. Python:
pip install duckdb. Node:npm install duckdb. CLI:brew install duckdb. R:install.packages('duckdb'). Every binding wraps the same C++ engine. -
Connect.
con = duckdb.connect()for:memory:;con = duckdb.connect('warehouse.duckdb')for a persistent file. There is no host/port/user/password. -
Run a query.
con.execute('SELECT 1').fetchone()returns a tuple;.fetchall()returns a list;.df()returns pandas;.arrow()returns a PyArrow Table;.pl()returns Polars. -
Read Parquet in place.
SELECT * FROM read_parquet('data/*.parquet'). Addhive_partitioning = truefor Hive-style folders. -
Read CSV with schema inference.
SELECT * FROM read_csv_auto('orders.csv', header = true). Override types withcolumns = {'order_id': 'BIGINT', ...}. -
Write Parquet.
COPY (SELECT ...) TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD). AddPARTITION_BY (region)for partitioned output. -
Install extensions.
INSTALL httpfs; LOAD httpfs;once per session.INSTALL aws;for IAM role auth. -
S3 credentials.
CREATE SECRET s3_default (TYPE S3, KEY_ID '...', SECRET '...', REGION '...'). AvoidSET s3_*env-style settings for production. -
Query a DataFrame. Any pandas/Polars/PyArrow variable in scope is queryable by name:
SELECT * FROM df. -
Spill control.
SET memory_limit = '12GB'; SET temp_directory = '/tmp/duckdb';lets big aggregates spill instead of OOMing. - EXPLAIN ANALYZE. Verifies projection/filter pushdown by reporting row groups read vs. skipped per scan.
-
dbt target.
type: duckdb,path: 'dev.duckdb',threads: 4. Addextensions: [httpfs, parquet]to load on startup. -
pytest fixture. Yield
duckdb.connect()per test; seed inline VALUES; assert on.fetchall(). Sub-millisecond per test. -
CI workflow. Cache pip, write
~/.dbt/profiles.ymlwithpath: ':memory:',dbt build --target ci, thenpytest -q. ~1–2 minutes total. -
MVCC contract. Single writer per database file, many concurrent readers.
BEGINpins a snapshot; readers never block writers, writers never block readers. -
Concurrency anti-pattern. Two processes writing the same
.duckdbfile → the second one errors. Shard by file (one per pipeline) or graduate to a server.
Frequently asked questions
Is DuckDB production-ready?
Yes. DuckDB 1.0 shipped in 2024 with explicit on-disk format stability guarantees, and it now powers production ETL at thousands of companies — laptop analytics, CI builds, edge analytics, Lambda jobs, dbt-driven nightly batches. The right deployment shape is in-process: one writer per file, many readers, batch or interactive workload. Avoid it for multi-writer, multi-user serving, or always-on cluster OLAP. For laptop, CI, edge, and notebook workloads, it is the de facto standard.
How does DuckDB compare to SQLite and Postgres?
DuckDB is to OLAP what SQLite is to OLTP — embedded, in-process, single binary — but the storage layout is columnar instead of row-oriented and the executor is vectorised instead of row-at-a-time. SQLite excels at point lookups and small transactional workloads; Postgres is the full client-server OLTP database with strong OLAP extensions; DuckDB is the OLAP engine that runs inside your process. The headline difference: a GROUP BY region over 50 million rows runs in seconds on DuckDB and minutes on SQLite on the same hardware, because DuckDB reads only the projected columns and processes 1024-row batches per operator call.
Can DuckDB read Parquet directly from S3?
Yes. INSTALL httpfs; LOAD httpfs; once per session adds s3:// URI support to read_parquet(). Configure credentials via CREATE SECRET (TYPE S3, KEY_ID '...', SECRET '...', REGION '...') (or via the IAM credential chain on a CI runner). DuckDB issues HTTP range requests to fetch only the Parquet footer plus the column chunks of the surviving row groups, so a "10 TB scan" can pull as little as a few MB across the wire when partition pruning, row-group statistics, and projection pushdown all fire.
Does DuckDB support concurrent writers?
No — only one writer at a time per database file. Two processes opening the same .duckdb file in write mode will fail; two threads within one process serialise behind a write latch. Readers, however, are unconstrained: any number of read-only connections can scan the same file concurrently. The right pattern for a concurrent pipeline is shard-by-file (one .duckdb per pipeline step or per partition) or to graduate to a server-based engine. If your design genuinely needs N concurrent writers, you've left DuckDB's niche by definition.
How do I use DuckDB with dbt?
Install the official adapter (pip install dbt-duckdb) and add a type: duckdb block to profiles.yml with a path: pointing at a .duckdb file (or :memory: for CI). The dbt models themselves are plain SQL; DuckDB supports CTEs, window functions, QUALIFY, PIVOT, UNPIVOT, regular expressions, and most modern dialect features. dbt build --target ci runs seed + run + test in ~1–2 minutes on a CI runner with no warehouse cost. Keep the SQL ANSI-compatible to maintain portability with a Snowflake/BigQuery/Postgres prod target.
When should I NOT use DuckDB?
Three anti-patterns: a long-running multi-node OLAP cluster shared by 50+ analysts (use Snowflake, BigQuery, Databricks); sub-second OLAP serving for hundreds of concurrent dashboard users (use ClickHouse, Druid, Pinot); concurrent writers from multiple processes (no embedded engine handles this — graduate to a server). Within these constraints, DuckDB is also the wrong choice for pure OLTP — high-rate row-by-row inserts and point lookups belong in Postgres or SQLite. For everything else — laptop analytics, CI, edge, notebook, single-writer batch — DuckDB is usually the fastest and cheapest answer.
Practice on PipeCode
- Drill the database design practice library → for the schema-and-engine choice probes.
- Rehearse on aggregation problems → for the COUNT / SUM / GROUP BY family that DuckDB vectorises so well.
- Sharpen filtering drills → for the partition-pruning and predicate-pushdown intuition.
- Layer the file I/O library → for the Python side of local ETL.
- Stack the optimization practice library → for the EXPLAIN ANALYZE and plan-reading muscles.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the SQL axis with the SQL for data engineering interviews course →.
- Layer the Python axis with the Python for data engineering interviews course →.
- For end-to-end pipeline craft, work through ETL system design for data engineering interviews →.
Pipecode.ai is Leetcode for Data Engineering — every DuckDB recipe above ships with hands-on practice rooms where you write the `read_parquet` scan, the dbt-duckdb model, and the pytest fixture against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `duckdb python` ETL behaves the same in CI as it does on your laptop.





Top comments (0)