Posted on Jun 30

csvtidy: merge and clean CSV files from the terminal, with reusable recipes

#datascience #showdev #opensource #python

Every month it's the same CSV exports. Same system, same columns, same handful of cleanup steps before the data is usable: drop duplicates, trim whitespace, fix the date column, and stack a dozen files into one.

Each of the usual options has a catch. Copy-paste is slow, and eventually you paste a header row into the middle of your data and don't notice. A Python script works great until a column moves or the next person on the team can't run it. Power Query is powerful but it's a lot of clicking, and it lives inside Excel.

So I built csvtidy — a small command-line tool for exactly this: clean and merge messy CSV files, and save the steps so you can re-run them on next month's files without rebuilding anything.

Install

pip install csvtidy

The basics

Merge every CSV in a folder into one file, tagging each row with the file it came from:

csvtidy merge ./exports --output combined.csv --source-column file

Remove duplicate rows:

csvtidy dedupe data.csv --output clean.csv

Basic cleanup — trim whitespace, normalize dates, drop empty rows:

csvtidy clean data.csv --output clean.csv

It reads and writes stdin/stdout, so you can chain steps the Unix way:

csvtidy merge ./exports | csvtidy dedupe - | csvtidy clean -

Recipes: build once, re-run forever

This is the part I actually care about. Instead of retyping the same steps every month, you save them as a recipe — a small YAML file:

steps:
  - clean: { trim: true, fix_dates: true }
  - dedupe: { subset: [email] }
  - merge: { source_column: file }

Then run it against any folder:

csvtidy recipe monthly.yaml ./exports --output combined.csv

Next month, drop the new files in the folder and run the same command. That's the whole idea: the cleanup is reproducible, not redone.

Why DuckDB under the hood

The thing that makes this more than a pandas wrapper is that csvtidy runs on DuckDB. It streams files instead of loading them fully into memory, so it handles CSVs bigger than your RAM — the ones that make Excel hang and a naive pd.read_csv fall over. For a tool whose entire job is "merge a folder of large exports," that's the difference between working and crashing on exactly the files that matter most.

It's fast, and it's all local — nothing leaves your machine, which matters when the files are client, payroll, or finance data.

It's open source

csvtidy is MIT-licensed. Code and docs are here:

github.com/abhishekrai43/csvtidy

Side note: I also build Kramata, a desktop app for the same kind of repeatable spreadsheet cleanup — but with a visual recipe builder, for the non-technical people on a team who'd rather click than type. csvtidy is the open-source CLI cousin of that engine.

Top comments (1)

Ebony Martin • Jul 15

Jumping in on this older post because I'm curious if anyone has tried csvtidy with other file formats, like TSV or JSON? Wondering if there are plans to expand its capabilities. Also, does anyone have tips on creating effective YAML recipes for large datasets? Sharing success stories or challenges could be super helpful for those diving into this tool!