DEV Community

Cover image for This embedded database runs SQL on dataframes meet DuckDB
<devtips/>
<devtips/>

Posted on

This embedded database runs SQL on dataframes meet DuckDB

Why every Python developer should have DuckDB in their toolkit (especially if you’re tired of juggling Pandas and SQL separately)

What if you could run SQL directly on your Pandas DataFrames without setting up a full database server?
DuckDB makes that possible it’s an embedded, analytical SQL engine that works like SQLite but speaks fluent pandas. Whether you’re analyzing CSVs, querying Parquet files, or just tired of writing awkward DataFrame filtering logic, DuckDB lets you write clean SQL inside your Python scripts. In this article, I’ll show you how it works, what it can do, and why it’s worth a shot.

Covering in this article:

  1. What is DuckDB?
  2. SQL on DataFrames the magic explained
  3. In-memory vs persistent mode
  4. Using DuckDB in notebooks, threads, and real code
  5. Extended SQL features you’ll actually use
  6. Final thoughts: When to use it and why

What is DuckDB?

Think of DuckDB as SQLite’s nerdy cousin who’s obsessed with analytics.

It’s an embedded, relational database designed for OLAP (analytics) workloads stuff like filtering huge CSVs, aggregating metrics, or quickly exploring Parquet files. Just like SQLite, it runs inside your program. No servers. No setup. Just import and go.

But unlike SQLite, DuckDB is built to handle columnar data, vectorized execution, and modern formats like:

  • CSV
  • Parquet
  • JSON
  • Even cloud storage (S3, HTTP)

It supports full SQL-92, plus some nice extras (more on that later), and has APIs for Python, R, and more. The core library is MIT-licensed, actively developed by DuckDB Labs and the DuckDB Foundation, with backing from the same crew that built database engines at CWI (Amsterdam).

You don’t need Docker, you don’t need Postgres, and you definitely don’t need to copy-paste Pandas code five times in a row. DuckDB handles real analytics workloads without the usual pain.

SQL on DataFrames the magic explained

Let’s be honest querying data with Pandas can get messy fast. .loc, .query(), .groupby(), .apply()… it’s powerful, but when you’re just trying to answer a question like “what are the top 5 earners?”, it feels like overkill. Enter DuckDB.

DuckDB lets you write actual SQL right on top of your Pandas DataFrames. No exporting to Postgres. No spinning up SQLite. You just write SQL like this:

import duckdb
import pandas as pd

df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"salary": [100000, 120000, 140000]
})
result = duckdb.query("SELECT * FROM df WHERE salary > 110000").to_df()
print(result)

You don’t need to “register” the dataframe DuckDB just looks at the current scope, finds the df variable, and lets SQL do its thing. It feels like sorcery at first, but it’s incredibly useful.

Now, there’s some Python magic going on here. Behind the scenes, DuckDB inspects your local variables, grabs df, and treats it like a table. Cool, right?

Well… your IDE doesn’t always agree. PyLance and linters might scream “unused variable!” even though DuckDB is using it via string-based SQL. This is where explicit registration helps:

conn = duckdb.connect()
conn.register("employees", df)
result = conn.execute("SELECT * FROM employees").fetchdf()

Clean. IDE-friendly. No mysterious variable lookups. Use this pattern in real projects it saves headaches when refactoring later.

Press enter or click to view image in full size

In-memory vs persistent mode

DuckDB gives you two modes to work with: in-memory (fast and temporary) or persistent (saved to disk).

If you don’t specify anything, it defaults to in-memory. That’s awesome for:

  • quick data experiments
  • testing pipelines
  • throwing away state after every run

But there’s a catch: once the script ends, everything’s gone including any tables or inserts you did.

conn = duckdb.connect()  # in-memory
conn.execute("CREATE TABLE users (id INT, name TEXT)")
conn.execute("INSERT INTO users VALUES (1, 'Sonic')")
# Close this conn? Boom. Table gone.

If you want to keep your data between runs, just give it a path:

conn = duckdb.connect("users.db.duckdb")

Now it’s backed by a file. You can:

  • Query it later
  • Insert new rows
  • Even inspect it with CLI tools or notebooks

And yeah, it’s still embedded no server is spun up in the background.

⚠️ Heads up: if you’re loading huge CSVs into memory each time, in-memory mode might slow down your script. For serious workloads, go persistent and thank yourself later.

Press enter or click to view image in full size

Using DuckDB in notebooks, threads, and real code

DuckDB fits naturally into your workflow whether you’re prototyping in Jupyter, running scripts, or scaling with threads and processes.

In Jupyter Notebooks

Just call duckdb.connect() in a cell and you’re ready to go:

import duckdb
conn = duckdb.connect()

Load a CSV, register it, and run SQL in the next cell it’s that simple. You’ll see query results right there in the notebook like magic.

Pro tip: Always run the connection cell first. DuckDB isn’t psychic it needs the context to exist first.

Threading and multiprocessing

Need to load data and run queries at the same time? DuckDB supports:

  • Multithreading: multiple threads accessing the same connection
  • Multiprocessing: separate processes reading and writing
from threading import Thread

def run_query():
conn = duckdb.connect()
conn.execute("SELECT * FROM employees").fetchall()
t1 = Thread(target=run_query)
t2 = Thread(target=run_query)
t1.start(); t2.start()

For CPU-heavy stuff, multiprocessing is better. For IO-bound tasks like querying CSVs or Parquet, threading usually wins.

Extended SQL features you’ll actually use

DuckDB doesn’t just do basic SQL it sneaks in some surprisingly powerful extras that make pandas feel ancient.

Here are 3 features I actually use:


DESCRIBE quick schema look

Need to know what columns you’re working with?

DESCRIBE employees;

It’ll show column names, types, nullability faster than digging through the CSV or printing df.head() 20 times.


EXPLAIN query performance breakdown

Want to see how DuckDB runs your SQL under the hood?

conn.execute("EXPLAIN SELECT * FROM employees WHERE salary > 120000").fetchall()

It tells you how it filters, scans, and groups great for catching inefficient queries or just flexing like a database wizard.

Better yet, use .explain() in Python to format it nicely:

duckdb.query("SELECT * FROM employees").explain()


SUMMARIZE column stats in one line

Get quick insights on your data min, max, percentiles, counts, and nulls all in one command.

SUMMARIZE employees;

Way faster than writing five lines of df.describe() and parsing it with your eyes.

Press enter or click to view image in full size

Final thoughts when to use it and why

DuckDB is what I reach for when I want:

  • SQL on DataFrames without setting up a server
  • Clean analytics scripts that don’t fight me
  • One tool that speaks both Python and SQL fluently

It’s not trying to replace Postgres or your full data warehouse. But for local analytics, prototyping, testing, and quick exploration, it’s ridiculously good.

If you’re juggling Pandas, SQL, CSVs, and still writing spaghetti queries by hand DuckDB will clean up your life a bit. It’s fast, it feels modern, and most importantly… it just works.

Helpful links

Press enter or click to view image in full size

Top comments (0)