DuckDB Runs SQL on CSV and Parquet Without a Server
You have a 5GB CSV file. Pandas loads it all into memory and crashes. DuckDB queries it with SQL — streaming, fast, using barely any RAM.
What Makes DuckDB Special
- In-process — runs inside your Python/Node/R script
- No server — zero setup, zero dependencies
- Columnar engine — vectorized execution for fast analytics
- Direct file queries — SQL on CSV, Parquet, JSON, Excel
- PostgreSQL compatible — familiar SQL dialect
- Extensions — httpfs, spatial, iceberg, delta
Quick Start
import duckdb
result = duckdb.sql("""
SELECT city, COUNT(*) as orders, SUM(amount) as revenue
FROM 'orders.csv'
GROUP BY city
ORDER BY revenue DESC
LIMIT 10
""").fetchdf()
# Query Parquet on S3
duckdb.sql("SELECT * FROM read_parquet('s3://bucket/data/*.parquet')")
# Query JSON
duckdb.sql("SELECT * FROM read_json_auto('events.json')")
DuckDB vs Pandas
| Task | DuckDB | Pandas |
|---|---|---|
| 5GB CSV aggregation | 3 sec | OOM crash |
| Memory usage | Streaming | Full load |
| Syntax | SQL | Method chains |
| Joins | Fast hash | Slow merge |
📧 spinov001@gmail.com — Data engineering consulting
Follow for more data tool reviews.
Top comments (0)