Every month it's the same CSV exports. Same system, same columns, same handful of cleanup steps before the data is usable: drop duplicates, trim whitespace, fix the date column, and stack a dozen files into one.
Each of the usual options has a catch. Copy-paste is slow, and eventually you paste a header row into the middle of your data and don't notice. A Python script works great until a column moves or the next person on the team can't run it. Power Query is powerful but it's a lot of clicking, and it lives inside Excel.
So I built csvtidy — a small command-line tool for exactly this: clean and merge messy CSV files, and save the steps so you can re-run them on next month's files without rebuilding anything.
Install
pip install csvtidy
The basics
Merge every CSV in a folder into one file, tagging each row with the file it came from:
csvtidy merge ./exports --output combined.csv --source-column file
Remove duplicate rows:
csvtidy dedupe data.csv --output clean.csv
Basic cleanup — trim whitespace, normalize dates, drop empty rows:
csvtidy clean data.csv --output clean.csv
It reads and writes stdin/stdout, so you can chain steps the Unix way:
csvtidy merge ./exports | csvtidy dedupe - | csvtidy clean -
Recipes: build once, re-run forever
This is the part I actually care about. Instead of retyping the same steps every month, you save them as a recipe — a small YAML file:
steps:
- clean: { trim: true, fix_dates: true }
- dedupe: { subset: [email] }
- merge: { source_column: file }
Then run it against any folder:
csvtidy recipe monthly.yaml ./exports --output combined.csv
Next month, drop the new files in the folder and run the same command. That's the whole idea: the cleanup is reproducible, not redone.
Why DuckDB under the hood
The thing that makes this more than a pandas wrapper is that csvtidy runs on DuckDB. It streams files instead of loading them fully into memory, so it handles CSVs bigger than your RAM — the ones that make Excel hang and a naive pd.read_csv fall over. For a tool whose entire job is "merge a folder of large exports," that's the difference between working and crashing on exactly the files that matter most.
It's fast, and it's all local — nothing leaves your machine, which matters when the files are client, payroll, or finance data.
It's open source
csvtidy is MIT-licensed. Code and docs are here:
github.com/abhishekrai43/csvtidy
Side note: I also build Kramata, a desktop app for the same kind of repeatable spreadsheet cleanup — but with a visual recipe builder, for the non-technical people on a team who'd rather click than type. csvtidy is the open-source CLI cousin of that engine.
Top comments (0)