DEV Community: Herbert Tzekian

When pandas runs out of RAM, I change one import line

Herbert Tzekian — Wed, 24 Jun 2026 15:01:00 +0000

You know the moment. You've got a Parquet file or a fat CSV, you write the obvious pd.read_parquet(...), and Python sits there. Then either it finishes in its own sweet time, or you get the message everyone who's done data work in Python has seen at least once:

MemoryError: Unable to allocate ... for an array

The file is bigger than your RAM, pandas wants the whole thing in memory at once, and your laptop says no.

I used to solve this by chunking, or by giving up and rewriting the whole notebook in SQL against some database just to run one GROUP BY. Both are annoying, and the second one means I now maintain two versions of the same analysis. What I do now is change one import line and keep writing the exact same pandas I already had. No server, no connection string, and crucially no SQL, because I never leave the DataFrame API.

The one-line change

The library is chDB. It's ClickHouse compiled as an in-process engine you pip install, basically the ClickHouse query engine as a Python library, with nothing to run in Docker and nothing to connect to. If you want the internals, it's ClickHouse compiled as an in-process engine, but the part that changed my day-to-day is its datastore module.

pip install chdb

# Change this:
import pandas as pd
# To this:
import chdb.datastore as pd
# Everything else stays the same.

That's it. Your existing pandas code keeps working, with the same method names, the same arguments, and the same DataFrames coming back. The one thing that changes is how you point at a file too big to open: instead of pd.read_parquet, you hand the file to a DataStore, which streams it instead of loading it whole. From there it's the pandas API you already know. The difference underneath is that operations are lazy and compile down to the ClickHouse engine, so nothing executes until you actually need a result (a print(), a len(), plotting). The giant file never gets slurped into RAM in one go.

The query I'd otherwise have crashed on

Here's a real example. Events Parquet, I want the top user actions in the last 30 days. This is the textbook read_parquet (boom) followed by a filter and a groupby. Written as plain pandas, except it doesn't fall over:

import chdb.datastore as pd

events = pd.DataStore.from_file("events.parquet")

recent = events[events["event_date"] > "2026-05-21"]
top = (recent
       .groupby("action")
       .size()
       .sort_values(ascending=False)
       .head(20))

print(top)

That's the same code I'd have written against pandas. I didn't drop into SQL, I didn't chunk anything, I didn't think about memory. Because it's lazy and columnar underneath, reading three columns out of forty doesn't pay for the other thirty-seven, and the filter gets pushed down into the file scan so rows I don't want never get decoded. That's the difference between "loads 12 GB" and "reads the slice you asked for."

It's the same pandas all the way through

You don't get a cut-down subset that handles groupby and bails on everything else. The whole DataFrame API is here: filtering, column selection, computed columns, the string and datetime accessors, joins:

import chdb.datastore as pd

reqs = pd.DataStore.from_file("requests.parquet")

reqs = reqs.assign(hour=reqs["ts"].dt.floor("h"))   # datetime accessor, computed column
hourly = (reqs
          .groupby("hour")["latency_ms"]
          .mean()
          .sort_index())

hourly.plot()   # tiny result now, plots like any pandas Series

The big file got crunched down to a few hundred rows before anything had to fit in memory.

Joining across files (and databases) without leaving pandas

The bit I didn't expect to use as much as I do: the same DataFrame can be backed by a file or a remote source, and you join them with a normal .join(). So a Parquet file on disk and a table in Postgres or MySQL line up in one expression.

import chdb.datastore as pd

orders = pd.DataStore.from_file("orders.parquet")
customers = pd.DataStore.uri("mysql://root:pass@db:3306/crm/customers")

by_country = (orders
              .join(customers, left_on="customer_id", right_on="id")
              .groupby("country")["amount"]
              .sum()
              .sort_values(ascending=False))

print(by_country)

It also reads CSV, TSV, JSON and friends the same way, so the same approach works for a giant CSV you'd otherwise be afraid to open, and for Parquet as shown above.

Why this became my default

The honest reason I keep using it isn't the speed, though it is faster. It's that I stopped maintaining two versions of my analysis.

Before, a notebook experiment used pandas, and then if it needed to run for real on the full dataset it got rewritten into SQL, against some warehouse, with a different dialect and a deployment story. With chdb.datastore the exploration code is the production code. It's pandas while I'm poking at a file on my laptop, and it's still pandas when the file is too big to open the normal way. I scale up by changing what the DataFrame points at, not by rewriting it.

So my rule of thumb now: if a file is small and friendly, sure, regular pandas is fine. The second it's big enough to make Python sweat, I swap the import line. The code doesn't change, it doesn't run out of RAM, and the throwaway version turns out not to be throwaway.

Next time you hit MemoryError, try changing import pandas as pd to import chdb.datastore as pd and rerun the cell. Usually that's the whole fix.

I stopped writing throwaway scripts for messy CSVs and just use SQL now

Herbert Tzekian — Sat, 20 Jun 2026 13:41:23 +0000

Someone sends you a CSV. Then a folder of CSVs. Then a CSV that's actually tab-separated but named .csv, with a stray header row and a column that's a number on most rows and the string N/A on the rest.

For years my answer to "can you pull a quick number out of this?" was a throwaway Python script. Read it in, fight pandas about dtypes, groupby, print, delete the script, forget everything, repeat next week. It worked. It was also slow and I never kept any of it.

These days I just point SQL at the file. I want to show you the exact workflow because it's embarrassingly simple and it's saved me a lot of evenings.

The one binary I actually use

The tool is clickhouse-local. It's a single binary, the ClickHouse engine minus the server. You download it and run SQL against files on your disk.

curl https://clickhouse.com/ | sh

That gives you a clickhouse binary in the current directory. Now you can do this:

./clickhouse local -q "SELECT count() FROM file('orders.csv')"

That's it. It read the file, sniffed the format and the column types, and counted the rows. No setup.

Querying the thing like it's a table

Say I've got orders.csv and I want revenue by country, top 10. Normally that's a few lines of pandas. Here it's the query you'd write anyway:

./clickhouse local -q "
  SELECT country, round(sum(amount), 2) AS revenue
  FROM file('orders.csv')
  GROUP BY country
  ORDER BY revenue DESC
  LIMIT 10
"

The file() function is the whole trick. It reads the file and gives you something you can SELECT from. It auto-detects CSV, TSV, JSON, Parquet and a pile of others from the extension and contents, and it infers column names and types from the header and the data. The example above is honestly 90% of what you need.

When the file is "somebody else's CSV"

Real files are messy, so here's where this stops being a toy.

It's actually tab-separated. Override the format instead of renaming the file:

./clickhouse local -q "SELECT * FROM file('weird.csv', 'TSV') LIMIT 5"

A column has N/A mixed in with numbers. Read it as text and clean it inline, no preprocessing pass:

SELECT avg(toFloat64OrNull(amount)) AS avg_amount
FROM file('orders.csv')

toFloat64OrNull turns the junk into NULL instead of blowing up, and avg skips nulls. I use the *OrNull and *OrZero functions constantly for this exact reason.

A whole folder of files. Glob them and query all at once, still one query:

./clickhouse local -q "
  SELECT _file, count()
  FROM file('exports/*.csv')
  GROUP BY _file
"

_file is a virtual column telling you which file each row came from. Great for "which of these 40 exports is missing data."

Turn the slow file into a fast file

If I'm going to keep poking at the same CSV, the first thing I do is convert it to Parquet once. Columnar, compressed, types baked in, so every query after that is faster and smaller on disk:

./clickhouse local -q "
  SELECT * FROM file('orders.csv')
  INTO OUTFILE 'orders.parquet'
  FORMAT Parquet
"

Then query orders.parquet from then on. This one habit alone made my repeated ad-hoc queries feel instant.

Why I stuck with this one

Two reasons, and the second is the one that surprised me.

First, the obvious one: it's fast and there's no ceremony. A multi-GB CSV that made my old pandas script swap is a sub-second GROUP BY here, because the engine is columnar and uses all my cores without me asking.

Second, and this is why I didn't just bounce to the next shiny CLI tool, it's the same SQL and the same engine whether the data is a 5 MB CSV on my laptop or billions of rows in a real ClickHouse cluster. When a "quick look at a file" turns into "okay we actually need to run this every hour over a year of data," I'm not rewriting anything. Same file(), same functions, same query, it just moves to a server and keeps going. I've been burned before by prototyping in one tool and then re-implementing everything for production. Not having to do that is worth a lot.

So now the answer to "can you pull a number out of this?" is thirty seconds and a SQL query, and if it turns out to matter, the thirty-second version is already the production version.

Give the messy-CSV thing a try next time one lands in your inbox. You'll stop writing the throwaway script too.