Alex Spinov

Posted on Mar 26

DuckDB Changed How I Process CSV Files — 7 Queries That Replace pandas

#python #datascience #database #tutorial

DuckDB just changed how I process CSV files. No more pandas for simple analysis. No more importing into PostgreSQL.

You just query the file directly.

The Basics

-- Query a CSV file. No import, no schema definition.
SELECT * FROM read_csv_auto('sales.csv') LIMIT 5;

DuckDB auto-detects column types, handles headers, and deals with messy data. It takes 0.3 seconds for a 1M row file. pandas takes 4 seconds for the same file.

Install

pip install duckdb

Or use the CLI:

brew install duckdb  # macOS
# Then just run: duckdb

7 Queries That Replace pandas

1. Basic Aggregation

-- pandas: df.groupby('country')['revenue'].sum().sort_values(ascending=False)
SELECT country, SUM(revenue) as total
FROM read_csv_auto('sales.csv')
GROUP BY country
ORDER BY total DESC;

2. Filter and Transform

-- pandas: df[df['status'] == 'active'].assign(tax=df['price'] * 0.2)
SELECT *, price * 0.2 as tax
FROM read_csv_auto('products.csv')
WHERE status = 'active';

3. Join Two CSVs

-- pandas: pd.merge(orders, customers, on='customer_id')
SELECT o.*, c.name, c.email
FROM read_csv_auto('orders.csv') o
JOIN read_csv_auto('customers.csv') c ON o.customer_id = c.id;

4. Window Functions

-- pandas: df['rank'] = df.groupby('category')['sales'].rank(ascending=False)
SELECT *,
    RANK() OVER (PARTITION BY category ORDER BY sales DESC) as rank
FROM read_csv_auto('products.csv');

5. Export to Parquet

-- Convert CSV to Parquet (10x smaller, 100x faster to query later)
COPY (SELECT * FROM read_csv_auto('huge_file.csv'))
TO 'output.parquet' (FORMAT PARQUET);

6. Query JSON Files

SELECT title, score
FROM read_json_auto('hackernews.json')
WHERE score > 100
ORDER BY score DESC
LIMIT 20;

7. Query Remote Files

-- Query a CSV from a URL directly
SELECT COUNT(*) FROM read_csv_auto('https://raw.githubusercontent.com/datasets/co2-ppm/master/data/co2-mm-mlo.csv');

Python Integration

import duckdb

# Query directly
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') WHERE amount > 1000").fetchdf()

# Works with existing DataFrames too
import pandas as pd
df = pd.read_csv("data.csv")
result = duckdb.sql("SELECT country, AVG(amount) FROM df GROUP BY country").fetchdf()

When to Use DuckDB vs pandas

Task	DuckDB	pandas
Quick CSV analysis	Use DuckDB	Overkill
Data transformation pipeline	Use DuckDB	Fine
ML feature engineering	Fine	Use pandas
Files > 1GB	Use DuckDB	pandas crashes
Complex joins across files	Use DuckDB	Painful
Interactive exploration	Use DuckDB CLI	Jupyter

Performance

On a 10M row CSV (1.2 GB):

DuckDB: 0.8 seconds for GROUP BY aggregation
pandas: 12 seconds (if it doesn't crash)
SQLite: 45 seconds (after importing)

DuckDB is columnar and uses vectorized execution. For analytical queries, it destroys everything else in its weight class.

📧 spinov001@gmail.com — I build data pipelines and analytics tools.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*

DEV Community