Alex Spinov

Posted on Mar 25

DuckDB Just Made SQLite Look Slow — Run Analytics on CSVs Without a Database

#python #database #tutorial #productivity

The Problem with Data Analysis in 2026

You download a CSV. You open Python. You write pd.read_csv(). Your laptop freezes because the file is 2GB.

DuckDB solves this. It's an embedded analytical database that:

Runs SQL directly on CSV/Parquet/JSON files
Processes data 10-100x faster than pandas for large datasets
Uses zero configuration — just pip install duckdb
Works in Python, R, Node.js, Java, Rust, Go, and CLI

Quick Start

pip install duckdb

import duckdb

# Query a CSV file directly — no import step needed
result = duckdb.sql("""
    SELECT country, COUNT(*) as users, AVG(revenue) as avg_revenue
    FROM 'users.csv'
    GROUP BY country
    ORDER BY avg_revenue DESC
    LIMIT 10
""")

print(result)

That's it. No CREATE TABLE. No COPY. No schema definition. DuckDB reads the CSV, infers types, and runs your query.

DuckDB vs Pandas — When Speed Matters

import duckdb
import pandas as pd
import time

# Generate test data (1M rows)
import random
with open("/tmp/test.csv", "w") as f:
    f.write("id,category,value\n")
    for i in range(1_000_000):
        f.write(f"{i},{random.choice(['A','B','C'])},{random.random()*100:.2f}\n")

# Pandas
start = time.time()
df = pd.read_csv("/tmp/test.csv")
result = df.groupby("category")["value"].agg(["mean", "count", "sum"])
pandas_time = time.time() - start

# DuckDB
start = time.time()
result = duckdb.sql("""
    SELECT category, AVG(value), COUNT(*), SUM(value)
    FROM '/tmp/test.csv'
    GROUP BY category
""")
duckdb_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s")
print(f"DuckDB: {duckdb_time:.2f}s")
print(f"DuckDB is {pandas_time/duckdb_time:.1f}x faster")

Typical result: DuckDB is 5-20x faster for aggregation queries on CSV files. The gap widens with larger files.

Query Parquet Files (Even from S3)

# Local Parquet
duckdb.sql("SELECT * FROM 'data.parquet' LIMIT 10")

# Remote Parquet (S3, HTTP)
duckdb.sql("""
    SELECT * FROM read_parquet(
        'https://example.com/dataset.parquet'
    ) LIMIT 10
""")

# Multiple files with glob
duckdb.sql("SELECT * FROM 'data/logs_*.csv'")

Join CSV with JSON

duckdb.sql("""
    SELECT c.name, j.metadata->>'source' as source, c.revenue
    FROM 'customers.csv' c
    JOIN 'enrichment.json' j ON c.id = j.customer_id
    WHERE c.revenue > 1000
    ORDER BY c.revenue DESC
""")

Export Results Anywhere

# To pandas DataFrame
df = duckdb.sql("SELECT * FROM 'data.csv' WHERE value > 50").df()

# To Parquet (10x smaller than CSV)
duckdb.sql("COPY (SELECT * FROM 'data.csv') TO 'data.parquet'")

# To new CSV
duckdb.sql("COPY (SELECT * FROM 'data.csv' WHERE category='A') TO 'filtered.csv'")

# To JSON
duckdb.sql("COPY (SELECT * FROM 'data.csv') TO 'data.json' (FORMAT JSON)")

CLI Mode — SQL in Your Terminal

# Install CLI
brew install duckdb  # or download from duckdb.org

# Query CSV from terminal
duckdb -c "SELECT COUNT(*) FROM 'access.log.csv' WHERE status >= 400"

# Interactive mode
duckdb
D SELECT * FROM 'sales.csv' WHERE date >= '2026-01-01' LIMIT 5;

Real Use Cases

Log analysis — Query gigabytes of CSV logs with SQL
Data cleaning — Filter, transform, deduplicate without pandas memory issues
ETL pipelines — Convert CSV → Parquet with transformations
Ad-hoc analytics — Quick questions about any data file
Jupyter notebooks — Faster alternative to pandas for large datasets

When to Use What

Task	Best Tool
Quick CSV exploration	DuckDB
Large file aggregation	DuckDB
Complex data manipulation	pandas
Machine learning prep	pandas + DuckDB
Production database	PostgreSQL
Embedded storage	SQLite
Embedded analytics	DuckDB

What's your go-to tool for working with large CSV files? Still using pandas? Tried DuckDB? Share your experience.

I write about practical developer tools every week. Follow for more.

Need to extract data from any website or API? I build custom web scrapers and data pipelines. Check out my ready-made scrapers on Apify or email me at spinov001@gmail.com for a custom solution.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*

DEV Community