The 4GB CSV That Ate My Laptop
You load a 4GB CSV with pd.read_csv() and watch htop climb to 28GB of RAM before your kernel kills the process. This isn't a Pandas bug — it's the default behavior.
Most tutorials tell you to "just use Dask" or "switch to Polars." But you don't need a new library. Pandas has built-in memory optimization that can compress your DataFrame to 10% of its original size without losing a single value. The catch? You have to opt in manually, and the decisions aren't obvious from the docs.
I'm going to show you how to load that same 4GB CSV in under 400MB of RAM, query it faster than the bloated version, and understand exactly which dtypes to use when. We'll start with the nuclear option (categorical downcast + chunked loading), then work backwards to see why the naive approach fails.
Why Pandas Eats 7x More RAM Than Your CSV
The file on disk is 4GB. Pandas loads it into 28GB of RAM. Where did the extra 24GB go?
Continue reading the full article on TildAlice

Top comments (0)