Someone sends you a CSV. Then a folder of CSVs. Then a CSV that's actually tab-separated but named .csv, with a stray header row and a column that's a number on most rows and the string N/A on the rest.
For years my answer to "can you pull a quick number out of this?" was a throwaway Python script. Read it in, fight pandas about dtypes, groupby, print, delete the script, forget everything, repeat next week. It worked. It was also slow and I never kept any of it.
These days I just point SQL at the file. I want to show you the exact workflow because it's embarrassingly simple and it's saved me a lot of evenings.
The one binary I actually use
The tool is clickhouse-local. It's a single binary, the ClickHouse engine minus the server. You download it and run SQL against files on your disk.
curl https://clickhouse.com/ | sh
That gives you a clickhouse binary in the current directory. Now you can do this:
./clickhouse local -q "SELECT count() FROM file('orders.csv')"
That's it. It read the file, sniffed the format and the column types, and counted the rows. No setup.
Querying the thing like it's a table
Say I've got orders.csv and I want revenue by country, top 10. Normally that's a few lines of pandas. Here it's the query you'd write anyway:
./clickhouse local -q "
SELECT country, round(sum(amount), 2) AS revenue
FROM file('orders.csv')
GROUP BY country
ORDER BY revenue DESC
LIMIT 10
"
The file() function is the whole trick. It reads the file and gives you something you can SELECT from. It auto-detects CSV, TSV, JSON, Parquet and a pile of others from the extension and contents, and it infers column names and types from the header and the data. The example above is honestly 90% of what you need.
When the file is "somebody else's CSV"
Real files are messy, so here's where this stops being a toy.
It's actually tab-separated. Override the format instead of renaming the file:
./clickhouse local -q "SELECT * FROM file('weird.csv', 'TSV') LIMIT 5"
A column has N/A mixed in with numbers. Read it as text and clean it inline, no preprocessing pass:
SELECT avg(toFloat64OrNull(amount)) AS avg_amount
FROM file('orders.csv')
toFloat64OrNull turns the junk into NULL instead of blowing up, and avg skips nulls. I use the *OrNull and *OrZero functions constantly for this exact reason.
A whole folder of files. Glob them and query all at once, still one query:
./clickhouse local -q "
SELECT _file, count()
FROM file('exports/*.csv')
GROUP BY _file
"
_file is a virtual column telling you which file each row came from. Great for "which of these 40 exports is missing data."
Turn the slow file into a fast file
If I'm going to keep poking at the same CSV, the first thing I do is convert it to Parquet once. Columnar, compressed, types baked in, so every query after that is faster and smaller on disk:
./clickhouse local -q "
SELECT * FROM file('orders.csv')
INTO OUTFILE 'orders.parquet'
FORMAT Parquet
"
Then query orders.parquet from then on. This one habit alone made my repeated ad-hoc queries feel instant.
Why I stuck with this one
Two reasons, and the second is the one that surprised me.
First, the obvious one: it's fast and there's no ceremony. A multi-GB CSV that made my old pandas script swap is a sub-second GROUP BY here, because the engine is columnar and uses all my cores without me asking.
Second, and this is why I didn't just bounce to the next shiny CLI tool, it's the same SQL and the same engine whether the data is a 5 MB CSV on my laptop or billions of rows in a real ClickHouse cluster. When a "quick look at a file" turns into "okay we actually need to run this every hour over a year of data," I'm not rewriting anything. Same file(), same functions, same query, it just moves to a server and keeps going. I've been burned before by prototyping in one tool and then re-implementing everything for production. Not having to do that is worth a lot.
So now the answer to "can you pull a number out of this?" is thirty seconds and a SQL query, and if it turns out to matter, the thirty-second version is already the production version.
Give the messy-CSV thing a try next time one lands in your inbox. You'll stop writing the throwaway script too.
Top comments (0)