DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Pandas vs Polars vs Dask on 10M Rows: Real Benchmarks

Polars beat Pandas by 8x on aggregations. Dask crashed twice.

I ran the same data pipeline on 10 million rows three times — once with Pandas, once with Polars, once with Dask. The gap between "fast enough" and "production ready" showed up in the profiler, not the docs.

This isn't a synthetic benchmark. I used real-ish e-commerce transaction data: timestamps, user IDs, product categories, prices, and a few messy nulls. The kind of dataset you'd actually wrangle at work. The operations were mundane — groupby aggregations, window functions, joins, string parsing — but at 10M rows, implementation details matter.

Here's what I learned: Polars is genuinely faster, but only if you write Polars-native code. Dask parallelizes beautifully until it doesn't. Pandas is still the safest bet for most teams, even when it's slower.

Charming close-up of a giant panda bear sitting calmly in its zoo habitat.

Photo by Snow Chang on Pexels

The Dataset: 10M Transactions, 1.2GB CSV

I generated a synthetic e-commerce dataset with the following schema:


python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)

---

*Continue reading the full article on [TildAlice](https://tildalice.io/pandas-polars-dask-10m-rows-benchmark/)*
Enter fullscreen mode Exit fullscreen mode

Top comments (0)