DuckDB just changed how I process CSV files. No more pandas for simple analysis. No more importing into PostgreSQL.
You just query the file directly.
The Basics
-- Query a CSV file. No import, no schema definition.
SELECT * FROM read_csv_auto('sales.csv') LIMIT 5;
DuckDB auto-detects column types, handles headers, and deals with messy data. It takes 0.3 seconds for a 1M row file. pandas takes 4 seconds for the same file.
Install
pip install duckdb
Or use the CLI:
brew install duckdb # macOS
# Then just run: duckdb
7 Queries That Replace pandas
1. Basic Aggregation
-- pandas: df.groupby('country')['revenue'].sum().sort_values(ascending=False)
SELECT country, SUM(revenue) as total
FROM read_csv_auto('sales.csv')
GROUP BY country
ORDER BY total DESC;
2. Filter and Transform
-- pandas: df[df['status'] == 'active'].assign(tax=df['price'] * 0.2)
SELECT *, price * 0.2 as tax
FROM read_csv_auto('products.csv')
WHERE status = 'active';
3. Join Two CSVs
-- pandas: pd.merge(orders, customers, on='customer_id')
SELECT o.*, c.name, c.email
FROM read_csv_auto('orders.csv') o
JOIN read_csv_auto('customers.csv') c ON o.customer_id = c.id;
4. Window Functions
-- pandas: df['rank'] = df.groupby('category')['sales'].rank(ascending=False)
SELECT *,
RANK() OVER (PARTITION BY category ORDER BY sales DESC) as rank
FROM read_csv_auto('products.csv');
5. Export to Parquet
-- Convert CSV to Parquet (10x smaller, 100x faster to query later)
COPY (SELECT * FROM read_csv_auto('huge_file.csv'))
TO 'output.parquet' (FORMAT PARQUET);
6. Query JSON Files
SELECT title, score
FROM read_json_auto('hackernews.json')
WHERE score > 100
ORDER BY score DESC
LIMIT 20;
7. Query Remote Files
-- Query a CSV from a URL directly
SELECT COUNT(*) FROM read_csv_auto('https://raw.githubusercontent.com/datasets/co2-ppm/master/data/co2-mm-mlo.csv');
Python Integration
import duckdb
# Query directly
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') WHERE amount > 1000").fetchdf()
# Works with existing DataFrames too
import pandas as pd
df = pd.read_csv("data.csv")
result = duckdb.sql("SELECT country, AVG(amount) FROM df GROUP BY country").fetchdf()
When to Use DuckDB vs pandas
| Task | DuckDB | pandas |
|---|---|---|
| Quick CSV analysis | Use DuckDB | Overkill |
| Data transformation pipeline | Use DuckDB | Fine |
| ML feature engineering | Fine | Use pandas |
| Files > 1GB | Use DuckDB | pandas crashes |
| Complex joins across files | Use DuckDB | Painful |
| Interactive exploration | Use DuckDB CLI | Jupyter |
Performance
On a 10M row CSV (1.2 GB):
- DuckDB: 0.8 seconds for GROUP BY aggregation
- pandas: 12 seconds (if it doesn't crash)
- SQLite: 45 seconds (after importing)
DuckDB is columnar and uses vectorized execution. For analytical queries, it destroys everything else in its weight class.
📧 spinov001@gmail.com — I build data pipelines and analytics tools.
Related: 10 Dev Tools I Use Daily | 150+ Free APIs | 77 Scrapers on a Schedule
Top comments (0)