DuckDB is an in-process analytical database — run complex SQL queries on files (CSV, Parquet, JSON) without any server.
What You Get for Free
- No server — embedded database, runs in your process
- File queries — SQL directly on CSV, Parquet, JSON, Excel files
- Blazing fast — columnar engine optimized for analytics
- Python/R/Node — native bindings for data science languages
- Standard SQL — window functions, CTEs, subqueries, joins
- Arrow integration — zero-copy data exchange with Pandas/Polars
- Extensions — PostgreSQL scanner, HTTP, spatial, and more
- WASM — runs in the browser via WebAssembly
Quick Start
brew install duckdb
duckdb
-- Query CSV directly (no import!)
SELECT country, COUNT(*) as users, AVG(age) as avg_age
FROM 'users.csv'
GROUP BY country
ORDER BY users DESC;
-- Query Parquet from S3
SELECT * FROM 's3://my-bucket/events/*.parquet'
WHERE event_date > '2026-01-01';
-- Query JSON
SELECT json_extract(data, '$.name') as name
FROM read_json_auto('data.json');
Why Developers Switch from Pandas
Pandas loads everything into memory and has its own API:
- SQL — query with SQL, not method chains
- Larger-than-RAM — streaming execution for big files
- Faster — 10-100x faster than Pandas for aggregations
- No import step — query files directly
A data scientist's Pandas notebook took 15 minutes to load and process a 5GB CSV. After DuckDB: same query, 30 seconds, without loading the entire file into memory.
Need Custom Data Solutions?
I build production-grade scrapers and data pipelines for startups, agencies, and research teams.
Browse 88+ ready-made scrapers on Apify → — Reddit, HN, LinkedIn, Google, Amazon, and more.
Custom project? Email me: spinov001@gmail.com — fast turnaround, fair pricing.
Top comments (0)