<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Herbert Tzekian</title>
    <description>The latest articles on DEV Community by Herbert Tzekian (@herbze).</description>
    <link>https://dev.to/herbze</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3994159%2Fbc81abd7-7e30-4076-9943-74ac34dc700a.png</url>
      <title>DEV Community: Herbert Tzekian</title>
      <link>https://dev.to/herbze</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/herbze"/>
    <language>en</language>
    <item>
      <title>I stopped writing throwaway scripts for messy CSVs and just use SQL now</title>
      <dc:creator>Herbert Tzekian</dc:creator>
      <pubDate>Sat, 20 Jun 2026 13:41:23 +0000</pubDate>
      <link>https://dev.to/herbze/i-stopped-writing-throwaway-scripts-for-messy-csvs-and-just-use-sql-now-p6h</link>
      <guid>https://dev.to/herbze/i-stopped-writing-throwaway-scripts-for-messy-csvs-and-just-use-sql-now-p6h</guid>
      <description>&lt;p&gt;Someone sends you a CSV. Then a folder of CSVs. Then a CSV that's actually tab-separated but named &lt;code&gt;.csv&lt;/code&gt;, with a stray header row and a column that's a number on most rows and the string &lt;code&gt;N/A&lt;/code&gt; on the rest.&lt;/p&gt;

&lt;p&gt;For years my answer to "can you pull a quick number out of this?" was a throwaway Python script. Read it in, fight pandas about dtypes, &lt;code&gt;groupby&lt;/code&gt;, print, delete the script, forget everything, repeat next week. It worked. It was also slow and I never kept any of it.&lt;/p&gt;

&lt;p&gt;These days I just point SQL at the file. I want to show you the exact workflow because it's embarrassingly simple and it's saved me a lot of evenings.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one binary I actually use
&lt;/h2&gt;

&lt;p&gt;The tool is &lt;a href="https://clickhouse.com/resources/engineering/what-is-clickhouse-local" rel="noopener noreferrer"&gt;&lt;code&gt;clickhouse-local&lt;/code&gt;&lt;/a&gt;. It's a single binary, the ClickHouse engine minus the server. You download it and run SQL against files on your disk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://clickhouse.com/ | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives you a &lt;code&gt;clickhouse&lt;/code&gt; binary in the current directory. Now you can do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./clickhouse &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"SELECT count() FROM file('orders.csv')"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. It read the file, sniffed the format and the column types, and counted the rows. No setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying the thing like it's a table
&lt;/h2&gt;

&lt;p&gt;Say I've got &lt;code&gt;orders.csv&lt;/code&gt; and I want revenue by country, top 10. Normally that's a few lines of pandas. Here it's the query you'd write anyway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./clickhouse &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"
  SELECT country, round(sum(amount), 2) AS revenue
  FROM file('orders.csv')
  GROUP BY country
  ORDER BY revenue DESC
  LIMIT 10
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;file()&lt;/code&gt; function is the whole trick. It &lt;a href="https://clickhouse.com/resources/engineering/run-sql-on-csv-file" rel="noopener noreferrer"&gt;reads the file and gives you something you can &lt;code&gt;SELECT&lt;/code&gt; from&lt;/a&gt;. It auto-detects CSV, TSV, JSON, Parquet and a pile of others from the extension and contents, and it infers column names and types from the header and the data. The example above is honestly 90% of what you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the file is "somebody else's CSV"
&lt;/h2&gt;

&lt;p&gt;Real files are messy, so here's where this stops being a toy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's actually tab-separated.&lt;/strong&gt; Override the format instead of renaming the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./clickhouse &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"SELECT * FROM file('weird.csv', 'TSV') LIMIT 5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A column has &lt;code&gt;N/A&lt;/code&gt; mixed in with numbers.&lt;/strong&gt; Read it as text and clean it inline, no preprocessing pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toFloat64OrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'orders.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;toFloat64OrNull&lt;/code&gt; turns the junk into &lt;code&gt;NULL&lt;/code&gt; instead of blowing up, and &lt;code&gt;avg&lt;/code&gt; skips nulls. I use the &lt;code&gt;*OrNull&lt;/code&gt; and &lt;code&gt;*OrZero&lt;/code&gt; functions constantly for this exact reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A whole folder of files.&lt;/strong&gt; Glob them and query all at once, still one query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./clickhouse &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"
  SELECT _file, count()
  FROM file('exports/*.csv')
  GROUP BY _file
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;_file&lt;/code&gt; is a virtual column telling you which file each row came from. Great for "which of these 40 exports is missing data."&lt;/p&gt;

&lt;h2&gt;
  
  
  Turn the slow file into a fast file
&lt;/h2&gt;

&lt;p&gt;If I'm going to keep poking at the same CSV, the first thing I do is &lt;a href="https://clickhouse.com/resources/engineering/convert-csv-to-parquet" rel="noopener noreferrer"&gt;convert it to Parquet&lt;/a&gt; once. Columnar, compressed, types baked in, so every query after that is faster and smaller on disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./clickhouse &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"
  SELECT * FROM file('orders.csv')
  INTO OUTFILE 'orders.parquet'
  FORMAT Parquet
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then query &lt;code&gt;orders.parquet&lt;/code&gt; from then on. This one habit alone made my repeated ad-hoc queries feel instant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I stuck with this one
&lt;/h2&gt;

&lt;p&gt;Two reasons, and the second is the one that surprised me.&lt;/p&gt;

&lt;p&gt;First, the obvious one: it's fast and there's no ceremony. A multi-GB CSV that made my old pandas script swap is a sub-second &lt;code&gt;GROUP BY&lt;/code&gt; here, because the engine is columnar and uses all my cores without me asking.&lt;/p&gt;

&lt;p&gt;Second, and this is why I didn't just bounce to the next shiny CLI tool, it's &lt;em&gt;the same SQL and the same engine&lt;/em&gt; whether the data is a 5 MB CSV on my laptop or billions of rows in a real ClickHouse cluster. When a "quick look at a file" turns into "okay we actually need to run this every hour over a year of data," I'm not rewriting anything. Same &lt;code&gt;file()&lt;/code&gt;, same functions, same query, it just moves to a server and keeps going. I've been burned before by prototyping in one tool and then re-implementing everything for production. Not having to do that is worth a lot.&lt;/p&gt;

&lt;p&gt;So now the answer to "can you pull a number out of this?" is thirty seconds and a SQL query, and if it turns out to matter, the thirty-second version is already the production version.&lt;/p&gt;

&lt;p&gt;Give the messy-CSV thing a try next time one lands in your inbox. You'll stop writing the throwaway script too.&lt;/p&gt;

</description>
      <category>database</category>
    </item>
  </channel>
</rss>
