DEV Community

Alex Spinov
Alex Spinov

Posted on

DuckDB Just Made SQLite Look Slow — Run Analytics on CSVs Without a Database

The Problem with Data Analysis in 2026

You download a CSV. You open Python. You write pd.read_csv(). Your laptop freezes because the file is 2GB.

DuckDB solves this. It's an embedded analytical database that:

  • Runs SQL directly on CSV/Parquet/JSON files
  • Processes data 10-100x faster than pandas for large datasets
  • Uses zero configuration — just pip install duckdb
  • Works in Python, R, Node.js, Java, Rust, Go, and CLI

Quick Start

pip install duckdb
Enter fullscreen mode Exit fullscreen mode
import duckdb

# Query a CSV file directly — no import step needed
result = duckdb.sql("""
    SELECT country, COUNT(*) as users, AVG(revenue) as avg_revenue
    FROM 'users.csv'
    GROUP BY country
    ORDER BY avg_revenue DESC
    LIMIT 10
""")

print(result)
Enter fullscreen mode Exit fullscreen mode

That's it. No CREATE TABLE. No COPY. No schema definition. DuckDB reads the CSV, infers types, and runs your query.

DuckDB vs Pandas — When Speed Matters

import duckdb
import pandas as pd
import time

# Generate test data (1M rows)
import random
with open("/tmp/test.csv", "w") as f:
    f.write("id,category,value\n")
    for i in range(1_000_000):
        f.write(f"{i},{random.choice(['A','B','C'])},{random.random()*100:.2f}\n")

# Pandas
start = time.time()
df = pd.read_csv("/tmp/test.csv")
result = df.groupby("category")["value"].agg(["mean", "count", "sum"])
pandas_time = time.time() - start

# DuckDB
start = time.time()
result = duckdb.sql("""
    SELECT category, AVG(value), COUNT(*), SUM(value)
    FROM '/tmp/test.csv'
    GROUP BY category
""")
duckdb_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s")
print(f"DuckDB: {duckdb_time:.2f}s")
print(f"DuckDB is {pandas_time/duckdb_time:.1f}x faster")
Enter fullscreen mode Exit fullscreen mode

Typical result: DuckDB is 5-20x faster for aggregation queries on CSV files. The gap widens with larger files.

Query Parquet Files (Even from S3)

# Local Parquet
duckdb.sql("SELECT * FROM 'data.parquet' LIMIT 10")

# Remote Parquet (S3, HTTP)
duckdb.sql("""
    SELECT * FROM read_parquet(
        'https://example.com/dataset.parquet'
    ) LIMIT 10
""")

# Multiple files with glob
duckdb.sql("SELECT * FROM 'data/logs_*.csv'")
Enter fullscreen mode Exit fullscreen mode

Join CSV with JSON

duckdb.sql("""
    SELECT c.name, j.metadata->>'source' as source, c.revenue
    FROM 'customers.csv' c
    JOIN 'enrichment.json' j ON c.id = j.customer_id
    WHERE c.revenue > 1000
    ORDER BY c.revenue DESC
""")
Enter fullscreen mode Exit fullscreen mode

Export Results Anywhere

# To pandas DataFrame
df = duckdb.sql("SELECT * FROM 'data.csv' WHERE value > 50").df()

# To Parquet (10x smaller than CSV)
duckdb.sql("COPY (SELECT * FROM 'data.csv') TO 'data.parquet'")

# To new CSV
duckdb.sql("COPY (SELECT * FROM 'data.csv' WHERE category='A') TO 'filtered.csv'")

# To JSON
duckdb.sql("COPY (SELECT * FROM 'data.csv') TO 'data.json' (FORMAT JSON)")
Enter fullscreen mode Exit fullscreen mode

CLI Mode — SQL in Your Terminal

# Install CLI
brew install duckdb  # or download from duckdb.org

# Query CSV from terminal
duckdb -c "SELECT COUNT(*) FROM 'access.log.csv' WHERE status >= 400"

# Interactive mode
duckdb
D SELECT * FROM 'sales.csv' WHERE date >= '2026-01-01' LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Real Use Cases

  1. Log analysis — Query gigabytes of CSV logs with SQL
  2. Data cleaning — Filter, transform, deduplicate without pandas memory issues
  3. ETL pipelines — Convert CSV → Parquet with transformations
  4. Ad-hoc analytics — Quick questions about any data file
  5. Jupyter notebooks — Faster alternative to pandas for large datasets

When to Use What

Task Best Tool
Quick CSV exploration DuckDB
Large file aggregation DuckDB
Complex data manipulation pandas
Machine learning prep pandas + DuckDB
Production database PostgreSQL
Embedded storage SQLite
Embedded analytics DuckDB

What's your go-to tool for working with large CSV files? Still using pandas? Tried DuckDB? Share your experience.

I write about practical developer tools every week. Follow for more.


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs

Top comments (0)