DEV Community

Alex Spinov
Alex Spinov

Posted on

DuckDB Has a Free Embedded Analytics Database — Run SQL on CSV, Parquet, and JSON Files

What if you could run SQL on a CSV file without importing it into a database first? That's DuckDB.

What is DuckDB?

DuckDB is an in-process analytical database. Think of it as "SQLite for analytics" — it runs inside your application with zero external dependencies, but it's optimized for analytical queries on millions of rows.

Why DuckDB Is Everywhere Now

1. Query Any File Format

-- Query CSV directly
SELECT * FROM 'sales_2025.csv' WHERE revenue > 10000;

-- Query Parquet
SELECT product, SUM(revenue) FROM 'data/*.parquet' GROUP BY product;

-- Query JSON
SELECT * FROM read_json('events.json');

-- Query multiple files with glob
SELECT * FROM 'logs/2025-*.csv' WHERE level = 'ERROR';
Enter fullscreen mode Exit fullscreen mode

No ETL. No import. Just point and query.

2. Blazing Fast Analytics

-- Aggregating 100M rows
SELECT
  date_trunc('month', created_at) as month,
  COUNT(*) as total,
  AVG(amount) as avg_amount,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) as p95
FROM read_parquet('transactions/*.parquet')
GROUP BY 1
ORDER BY 1;
-- Completes in ~2 seconds on a laptop
Enter fullscreen mode Exit fullscreen mode

Columnar storage + vectorized execution = analytical queries at blazing speed.

3. Zero Dependencies

import duckdb

# That's it. No server, no config, no docker-compose
con = duckdb.connect()

# Query a pandas DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
result = con.execute("SELECT * FROM df WHERE score > 90").fetchdf()
Enter fullscreen mode Exit fullscreen mode
import { Database } from 'duckdb-async';
const db = await Database.create(':memory:');
const result = await db.all("SELECT 42 as answer");
Enter fullscreen mode Exit fullscreen mode

4. Pandas Integration

import duckdb
import pandas as pd

# Load data
sales = pd.read_csv('sales.csv')

# Query DataFrame with SQL
result = duckdb.sql("""
    SELECT
        category,
        SUM(revenue) as total_revenue,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM sales
    WHERE date >= '2025-01-01'
    GROUP BY category
    HAVING total_revenue > 100000
    ORDER BY total_revenue DESC
""").df()

# 100x faster than pandas groupby for large datasets
Enter fullscreen mode Exit fullscreen mode

5. Export to Any Format

-- CSV
COPY (SELECT * FROM analysis) TO 'output.csv';

-- Parquet (compressed, columnar)
COPY (SELECT * FROM analysis) TO 'output.parquet' (FORMAT PARQUET);

-- JSON
COPY (SELECT * FROM analysis) TO 'output.json' (FORMAT JSON);

-- Excel
INSTALL spatial; LOAD spatial;
COPY (SELECT * FROM analysis) TO 'output.xlsx' (FORMAT GDAL, DRIVER 'xlsx');
Enter fullscreen mode Exit fullscreen mode

DuckDB vs SQLite vs Pandas

DuckDB SQLite Pandas
Optimized for Analytics (OLAP) Transactions (OLTP) Data manipulation
100M row aggregation 2 seconds 60+ seconds 30+ seconds
File queries CSV, Parquet, JSON No CSV only
SQL support Full analytical Standard No (df operations)
Memory usage Streaming (low) Low High (loads all data)

Use Cases

  • Data analysis — Replace Pandas for SQL-friendly analytics
  • ETL pipelines — Transform CSV → Parquet with SQL
  • Log analysis — Query log files directly
  • BI dashboards — Backend for analytical queries
  • Data validation — Check data quality with SQL assertions

Getting Started

# Python
pip install duckdb

# Node.js
npm install duckdb

# CLI
brew install duckdb
Enter fullscreen mode Exit fullscreen mode
# CLI usage
duckdb
> SELECT * FROM 'mydata.csv' LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

The Bottom Line

DuckDB is the missing piece between "I have data files" and "I have insights." Run SQL on anything, anywhere, with zero setup.


Need data extraction tools? I build web scraping solutions. Check my Apify actors or email spinov001@gmail.com.

Top comments (0)