What if you could run SQL on a CSV file without importing it into a database first? That's DuckDB.
What is DuckDB?
DuckDB is an in-process analytical database. Think of it as "SQLite for analytics" — it runs inside your application with zero external dependencies, but it's optimized for analytical queries on millions of rows.
Why DuckDB Is Everywhere Now
1. Query Any File Format
-- Query CSV directly
SELECT * FROM 'sales_2025.csv' WHERE revenue > 10000;
-- Query Parquet
SELECT product, SUM(revenue) FROM 'data/*.parquet' GROUP BY product;
-- Query JSON
SELECT * FROM read_json('events.json');
-- Query multiple files with glob
SELECT * FROM 'logs/2025-*.csv' WHERE level = 'ERROR';
No ETL. No import. Just point and query.
2. Blazing Fast Analytics
-- Aggregating 100M rows
SELECT
date_trunc('month', created_at) as month,
COUNT(*) as total,
AVG(amount) as avg_amount,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) as p95
FROM read_parquet('transactions/*.parquet')
GROUP BY 1
ORDER BY 1;
-- Completes in ~2 seconds on a laptop
Columnar storage + vectorized execution = analytical queries at blazing speed.
3. Zero Dependencies
import duckdb
# That's it. No server, no config, no docker-compose
con = duckdb.connect()
# Query a pandas DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
result = con.execute("SELECT * FROM df WHERE score > 90").fetchdf()
import { Database } from 'duckdb-async';
const db = await Database.create(':memory:');
const result = await db.all("SELECT 42 as answer");
4. Pandas Integration
import duckdb
import pandas as pd
# Load data
sales = pd.read_csv('sales.csv')
# Query DataFrame with SQL
result = duckdb.sql("""
SELECT
category,
SUM(revenue) as total_revenue,
COUNT(DISTINCT customer_id) as unique_customers
FROM sales
WHERE date >= '2025-01-01'
GROUP BY category
HAVING total_revenue > 100000
ORDER BY total_revenue DESC
""").df()
# 100x faster than pandas groupby for large datasets
5. Export to Any Format
-- CSV
COPY (SELECT * FROM analysis) TO 'output.csv';
-- Parquet (compressed, columnar)
COPY (SELECT * FROM analysis) TO 'output.parquet' (FORMAT PARQUET);
-- JSON
COPY (SELECT * FROM analysis) TO 'output.json' (FORMAT JSON);
-- Excel
INSTALL spatial; LOAD spatial;
COPY (SELECT * FROM analysis) TO 'output.xlsx' (FORMAT GDAL, DRIVER 'xlsx');
DuckDB vs SQLite vs Pandas
| DuckDB | SQLite | Pandas | |
|---|---|---|---|
| Optimized for | Analytics (OLAP) | Transactions (OLTP) | Data manipulation |
| 100M row aggregation | 2 seconds | 60+ seconds | 30+ seconds |
| File queries | CSV, Parquet, JSON | No | CSV only |
| SQL support | Full analytical | Standard | No (df operations) |
| Memory usage | Streaming (low) | Low | High (loads all data) |
Use Cases
- Data analysis — Replace Pandas for SQL-friendly analytics
- ETL pipelines — Transform CSV → Parquet with SQL
- Log analysis — Query log files directly
- BI dashboards — Backend for analytical queries
- Data validation — Check data quality with SQL assertions
Getting Started
# Python
pip install duckdb
# Node.js
npm install duckdb
# CLI
brew install duckdb
# CLI usage
duckdb
> SELECT * FROM 'mydata.csv' LIMIT 10;
The Bottom Line
DuckDB is the missing piece between "I have data files" and "I have insights." Run SQL on anything, anywhere, with zero setup.
Need data extraction tools? I build web scraping solutions. Check my Apify actors or email spinov001@gmail.com.
Top comments (0)