DEV Community

Alex Spinov
Alex Spinov

Posted on

DuckDB Changed How I Process CSV Files — 7 Queries That Replace pandas

DuckDB just changed how I process CSV files. No more pandas for simple analysis. No more importing into PostgreSQL.

You just query the file directly.

The Basics

-- Query a CSV file. No import, no schema definition.
SELECT * FROM read_csv_auto('sales.csv') LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

DuckDB auto-detects column types, handles headers, and deals with messy data. It takes 0.3 seconds for a 1M row file. pandas takes 4 seconds for the same file.

Install

pip install duckdb
Enter fullscreen mode Exit fullscreen mode

Or use the CLI:

brew install duckdb  # macOS
# Then just run: duckdb
Enter fullscreen mode Exit fullscreen mode

7 Queries That Replace pandas

1. Basic Aggregation

-- pandas: df.groupby('country')['revenue'].sum().sort_values(ascending=False)
SELECT country, SUM(revenue) as total
FROM read_csv_auto('sales.csv')
GROUP BY country
ORDER BY total DESC;
Enter fullscreen mode Exit fullscreen mode

2. Filter and Transform

-- pandas: df[df['status'] == 'active'].assign(tax=df['price'] * 0.2)
SELECT *, price * 0.2 as tax
FROM read_csv_auto('products.csv')
WHERE status = 'active';
Enter fullscreen mode Exit fullscreen mode

3. Join Two CSVs

-- pandas: pd.merge(orders, customers, on='customer_id')
SELECT o.*, c.name, c.email
FROM read_csv_auto('orders.csv') o
JOIN read_csv_auto('customers.csv') c ON o.customer_id = c.id;
Enter fullscreen mode Exit fullscreen mode

4. Window Functions

-- pandas: df['rank'] = df.groupby('category')['sales'].rank(ascending=False)
SELECT *,
    RANK() OVER (PARTITION BY category ORDER BY sales DESC) as rank
FROM read_csv_auto('products.csv');
Enter fullscreen mode Exit fullscreen mode

5. Export to Parquet

-- Convert CSV to Parquet (10x smaller, 100x faster to query later)
COPY (SELECT * FROM read_csv_auto('huge_file.csv'))
TO 'output.parquet' (FORMAT PARQUET);
Enter fullscreen mode Exit fullscreen mode

6. Query JSON Files

SELECT title, score
FROM read_json_auto('hackernews.json')
WHERE score > 100
ORDER BY score DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

7. Query Remote Files

-- Query a CSV from a URL directly
SELECT COUNT(*) FROM read_csv_auto('https://raw.githubusercontent.com/datasets/co2-ppm/master/data/co2-mm-mlo.csv');
Enter fullscreen mode Exit fullscreen mode

Python Integration

import duckdb

# Query directly
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') WHERE amount > 1000").fetchdf()

# Works with existing DataFrames too
import pandas as pd
df = pd.read_csv("data.csv")
result = duckdb.sql("SELECT country, AVG(amount) FROM df GROUP BY country").fetchdf()
Enter fullscreen mode Exit fullscreen mode

When to Use DuckDB vs pandas

Task DuckDB pandas
Quick CSV analysis Use DuckDB Overkill
Data transformation pipeline Use DuckDB Fine
ML feature engineering Fine Use pandas
Files > 1GB Use DuckDB pandas crashes
Complex joins across files Use DuckDB Painful
Interactive exploration Use DuckDB CLI Jupyter

Performance

On a 10M row CSV (1.2 GB):

  • DuckDB: 0.8 seconds for GROUP BY aggregation
  • pandas: 12 seconds (if it doesn't crash)
  • SQLite: 45 seconds (after importing)

DuckDB is columnar and uses vectorized execution. For analytical queries, it destroys everything else in its weight class.


📧 spinov001@gmail.com — I build data pipelines and analytics tools.

Related: 10 Dev Tools I Use Daily | 150+ Free APIs | 77 Scrapers on a Schedule

Top comments (0)