I stopped writing throwaway scripts for messy CSVs and just use SQL now

#database

Someone sends you a CSV. Then a folder of CSVs. Then a CSV that's actually tab-separated but named .csv, with a stray header row and a column that's a number on most rows and the string N/A on the rest.

For years my answer to "can you pull a quick number out of this?" was a throwaway Python script. Read it in, fight pandas about dtypes, groupby, print, delete the script, forget everything, repeat next week. It worked. It was also slow and I never kept any of it.

These days I just point SQL at the file. I want to show you the exact workflow because it's embarrassingly simple and it's saved me a lot of evenings.

The one binary I actually use

The tool is clickhouse-local. It's a single binary, the ClickHouse engine minus the server. You download it and run SQL against files on your disk.

curl https://clickhouse.com/ | sh

That gives you a clickhouse binary in the current directory. Now you can do this:

./clickhouse local -q "SELECT count() FROM file('orders.csv')"

That's it. It read the file, sniffed the format and the column types, and counted the rows. No setup.

Querying the thing like it's a table

Say I've got orders.csv and I want revenue by country, top 10. Normally that's a few lines of pandas. Here it's the query you'd write anyway:

./clickhouse local -q "
  SELECT country, round(sum(amount), 2) AS revenue
  FROM file('orders.csv')
  GROUP BY country
  ORDER BY revenue DESC
  LIMIT 10
"

The file() function is the whole trick. It reads the file and gives you something you can SELECT from. It auto-detects CSV, TSV, JSON, Parquet and a pile of others from the extension and contents, and it infers column names and types from the header and the data. The example above is honestly 90% of what you need.

When the file is "somebody else's CSV"

Real files are messy, so here's where this stops being a toy.

It's actually tab-separated. Override the format instead of renaming the file:

./clickhouse local -q "SELECT * FROM file('weird.csv', 'TSV') LIMIT 5"

A column has N/A mixed in with numbers. Read it as text and clean it inline, no preprocessing pass:

SELECT avg(toFloat64OrNull(amount)) AS avg_amount
FROM file('orders.csv')

toFloat64OrNull turns the junk into NULL instead of blowing up, and avg skips nulls. I use the *OrNull and *OrZero functions constantly for this exact reason.

A whole folder of files. Glob them and query all at once, still one query:

./clickhouse local -q "
  SELECT _file, count()
  FROM file('exports/*.csv')
  GROUP BY _file
"

_file is a virtual column telling you which file each row came from. Great for "which of these 40 exports is missing data."

Turn the slow file into a fast file

If I'm going to keep poking at the same CSV, the first thing I do is convert it to Parquet once. Columnar, compressed, types baked in, so every query after that is faster and smaller on disk:

./clickhouse local -q "
  SELECT * FROM file('orders.csv')
  INTO OUTFILE 'orders.parquet'
  FORMAT Parquet
"

Then query orders.parquet from then on. This one habit alone made my repeated ad-hoc queries feel instant.

Why I stuck with this one

Two reasons, and the second is the one that surprised me.

First, the obvious one: it's fast and there's no ceremony. A multi-GB CSV that made my old pandas script swap is a sub-second GROUP BY here, because the engine is columnar and uses all my cores without me asking.

Second, and this is why I didn't just bounce to the next shiny CLI tool, it's the same SQL and the same engine whether the data is a 5 MB CSV on my laptop or billions of rows in a real ClickHouse cluster. When a "quick look at a file" turns into "okay we actually need to run this every hour over a year of data," I'm not rewriting anything. Same file(), same functions, same query, it just moves to a server and keeps going. I've been burned before by prototyping in one tool and then re-implementing everything for production. Not having to do that is worth a lot.

So now the answer to "can you pull a number out of this?" is thirty seconds and a SQL query, and if it turns out to matter, the thirty-second version is already the production version.

Give the messy-CSV thing a try next time one lands in your inbox. You'll stop writing the throwaway script too.