The $247 Cloud Bill That Made Me Question Everything
I ran the same 1TB data aggregation job on AWS three times — once with PySpark, once with Dask, and once with Polars. The total cloud compute cost across all three runs was $247. The speed difference was 18x. The winner wasn't what I expected.
Most benchmark posts compare these frameworks on synthetic data or convenient datasets that fit in RAM. I wanted to know what happens when you're processing a real 1TB CSV dump of server logs — the kind where you're burning through EC2 credits and questioning your career choices. This isn't about which framework is "better." It's about which one costs less when you're paying by the hour.
The Dataset and The Problem
I used a 1TB anonymized server access log dataset (think nginx combined logs, but bigger). The task: compute daily active users, aggregate request counts by endpoint, and calculate 95th percentile response times per day. Standard analytics workload.
Continue reading the full article on TildAlice

Top comments (0)