DEV Community

Herbert Tzekian
Herbert Tzekian

Posted on

I stopped writing throwaway scripts for messy CSVs and just use SQL now

Someone sends you a CSV. Then a folder of CSVs. Then a CSV that's actually tab-separated but named .csv, with a stray header row and a column that's a number on most rows and the string N/A on the rest.

For years my answer to "can you pull a quick number out of this?" was a throwaway Python script. Read it in, fight pandas about dtypes, groupby, print, delete the script, forget everything, repeat next week. It worked. It was also slow and I never kept any of it.

These days I just point SQL at the file. I want to show you the exact workflow because it's embarrassingly simple and it's saved me a lot of evenings.

The one binary I actually use

The tool is clickhouse-local. It's a single binary, the ClickHouse engine minus the server. You download it and run SQL against files on your disk.

curl https://clickhouse.com/ | sh
Enter fullscreen mode Exit fullscreen mode

That gives you a clickhouse binary in the current directory. Now you can do this:

./clickhouse local -q "SELECT count() FROM file('orders.csv')"
Enter fullscreen mode Exit fullscreen mode

That's it. It read the file, sniffed the format and the column types, and counted the rows. No setup.

Querying the thing like it's a table

Say I've got orders.csv and I want revenue by country, top 10. Normally that's a few lines of pandas. Here it's the query you'd write anyway:

./clickhouse local -q "
  SELECT country, round(sum(amount), 2) AS revenue
  FROM file('orders.csv')
  GROUP BY country
  ORDER BY revenue DESC
  LIMIT 10
"
Enter fullscreen mode Exit fullscreen mode

The file() function is the whole trick. It reads the file and gives you something you can SELECT from. It auto-detects CSV, TSV, JSON, Parquet and a pile of others from the extension and contents, and it infers column names and types from the header and the data. The example above is honestly 90% of what you need.

When the file is "somebody else's CSV"

Real files are messy, so here's where this stops being a toy.

It's actually tab-separated. Override the format instead of renaming the file:

./clickhouse local -q "SELECT * FROM file('weird.csv', 'TSV') LIMIT 5"
Enter fullscreen mode Exit fullscreen mode

A column has N/A mixed in with numbers. Read it as text and clean it inline, no preprocessing pass:

SELECT avg(toFloat64OrNull(amount)) AS avg_amount
FROM file('orders.csv')
Enter fullscreen mode Exit fullscreen mode

toFloat64OrNull turns the junk into NULL instead of blowing up, and avg skips nulls. I use the *OrNull and *OrZero functions constantly for this exact reason.

A whole folder of files. Glob them and query all at once, still one query:

./clickhouse local -q "
  SELECT _file, count()
  FROM file('exports/*.csv')
  GROUP BY _file
"
Enter fullscreen mode Exit fullscreen mode

_file is a virtual column telling you which file each row came from. Great for "which of these 40 exports is missing data."

Turn the slow file into a fast file

If I'm going to keep poking at the same CSV, the first thing I do is convert it to Parquet once. Columnar, compressed, types baked in, so every query after that is faster and smaller on disk:

./clickhouse local -q "
  SELECT * FROM file('orders.csv')
  INTO OUTFILE 'orders.parquet'
  FORMAT Parquet
"
Enter fullscreen mode Exit fullscreen mode

Then query orders.parquet from then on. This one habit alone made my repeated ad-hoc queries feel instant.

Why I stuck with this one

Two reasons, and the second is the one that surprised me.

First, the obvious one: it's fast and there's no ceremony. A multi-GB CSV that made my old pandas script swap is a sub-second GROUP BY here, because the engine is columnar and uses all my cores without me asking.

Second, and this is why I didn't just bounce to the next shiny CLI tool, it's the same SQL and the same engine whether the data is a 5 MB CSV on my laptop or billions of rows in a real ClickHouse cluster. When a "quick look at a file" turns into "okay we actually need to run this every hour over a year of data," I'm not rewriting anything. Same file(), same functions, same query, it just moves to a server and keeps going. I've been burned before by prototyping in one tool and then re-implementing everything for production. Not having to do that is worth a lot.

So now the answer to "can you pull a number out of this?" is thirty seconds and a SQL query, and if it turns out to matter, the thirty-second version is already the production version.

Give the messy-CSV thing a try next time one lands in your inbox. You'll stop writing the throwaway script too.

Top comments (0)