Phantasm0009

Posted on Mar 16

I wasted 200+ hours parsing client CSVs. So I built a library that does it in one line.

#python #datascience #opensource #productivity

Every data person has the same nightmare.

German client sends a CSV with semicolons, DD.MM.YYYY dates, and European number formatting. French client sends commas with DD/MM/YYYY. US client sends MM/DD/YYYY. You open each one with pandas.read_csv() and it silently corrupts all three.

I spent two years writing the same "detect encoding, guess delimiter, figure out date format" code for every new client. Last month I finally snapped and built a library.

The problem

Here's a typical European export file:

Kunden-Nr;Name;Datum;Umsatz;Aktiv
00742;Müller GmbH;01.03.2025;1.234,56;Ja
00123;Schäfer AG;15.07.2024;789,00;Nein
00456;Böhm & Co;25.12.2024;12.345,67;Ja

Semicolons as delimiters. Dots in dates. Commas as decimal separators. Leading zeros in IDs. Ja/Nein instead of True/False.

Watch what pandas does to it:

>>> import pandas as pd
>>> df = pd.read_csv("german_export.csv")
>>> print(df)
                                Kunden-Nr;Name;Datum;Umsatz;Aktiv
00742;Müller GmbH;01.03.2025;1.234                         56;Ja
00123;Schäfer AG;15.07.2024;789                          00;Nein
00456;Böhm & Co;25.12.2024;12.345                          67;Ja

One giant mangled column. Pandas assumed commas, so the semicolons became part of the values. The European decimal comma in 1.234,56 split the number across columns. Every single row is corrupted.

You can fix this. Pass sep=";". Pass dtype=str for the ID column. Write a custom date parser. Replace commas in numbers. Map Ja to True.

That's 15 lines of boilerplate. Per file. Per client. Per locale.

The solution

>>> import csvmedic
>>> df = csvmedic.read("german_export.csv")
>>> print(df)
  Kunden-Nr         Name      Datum    Umsatz  Aktiv
0     00742  Müller GmbH 2025-03-01   1234.56   True
1     00123   Schäfer AG 2024-07-15    789.00  False
2     00456    Böhm & Co 2024-12-25  12345.67   True

One line. csvmedic auto-detected the semicolon delimiter, parsed DD.MM.YYYY dates correctly, normalized European numbers, preserved leading zeros as strings, and converted Ja/Nein to booleans.

Every decision is logged:

>>> print(df.diagnosis)
csvmedic Diagnosis (0.17s)
  Encoding: utf_8 (confidence: 62%)
  Delimiter: ';'
  Shape: 3 rows × 5 columns

  — Kunden-Nr: string (confidence: 100%, preserved)
      reason: leading_zeros_detected
  · Name: string (confidence: 100%, skipped)
  ✓ Datum: date (confidence: 100%, converted)
      format_detected: %d.%m.%Y
      dayfirst: True
      ambiguous_count: 1
      unambiguous_count: 2
  ✓ Umsatz: float (confidence: 100%, converted)
      locale_detected: de_DE
      decimal_separator: ,
      thousands_separator: .
  ✓ Aktiv: boolean (confidence: 100%, converted)
      true_variants: ['ja']
      false_variants: ['nein']

No silent corruption. Every transformation visible and auditable. The — means preserved as string (leading zeros detected), · means skipped (already a clean string), and ✓ means successfully converted.

The hard part: DD/MM vs MM/DD

Encoding and delimiter detection are solved problems. The genuinely hard part — and the reason I built this — is date disambiguation.

When you see 03/04/2025, is that March 4 or April 3? There's no way to know from a single value. And pandas just guesses (or gives up and leaves it as a string).

csvmedic doesn't guess. It uses the data itself.

The core insight: you don't need every value to be unambiguous. You need one. If a column contains 03/04/2025, 05/06/2025, and 25/03/2025, that last value has a first component of 25. Since 25 > 12, it can't be a month. So the first component must be the day. One unambiguous value resolves the entire column.

You can see this in the diagnosis output above: ambiguous_count: 1 (the value 01.03.2025 where both 01 and 03 could be day or month) and unambiguous_count: 2 (the values where the day component was > 12, which locked in the format for the whole column).

When there's no single unambiguous value (every date has both components ≤ 12), csvmedic falls through a chain of strategies: cross-column inference (if another date column in the same file was already resolved, assume the same locale), separator hints (dot separator like 01.03.2025 is strongly associated with European day-first formats), and sequential monotonicity (try both orderings and see which produces sorted dates).

If none of these produce a confident answer, csvmedic doesn't guess. It leaves the column as a string and flags it as ambiguous in the diagnosis. Silent corruption is worse than no conversion.

Proving it: csvmedic.diff()

If you're skeptical (you should be), csvmedic has a built-in proof tool:

>>> result = csvmedic.diff("leading_zeros.csv")
>>> print(result.summary())
csvmedic.diff() — pandas vs csvmedic
  File: leading_zeros.csv
  Shape: pandas (3, 4) vs csvmedic (3, 4)
  Columns with value differences: ['product_id', 'zip_code']
  Sample value differences (row, column, pandas, csvmedic):
    (0, 'product_id'): np.int64(742) vs '00742'
    (1, 'product_id'): np.int64(123) vs '00123'
    (2, 'product_id'): np.int64(456) vs '00456'
    (0, 'zip_code'): np.int64(1234) vs '01234'
    (1, 'zip_code'): np.int64(90210) vs '90210'

Pandas converted 00742 to 742, 01234 to 1234. Those are customer IDs and zip codes — they're not numbers, they're strings that happen to contain digits. csvmedic detects the leading zeros and preserves them.

This is the output you show your team lead when they ask "why do we need another dependency?"

Schema pinning for recurring files

Most real-world data work involves the same file format every month. csvmedic lets you save the detected schema and reuse it:

# First time: detect everything
df = csvmedic.read("monthly_export.csv")
csvmedic.save_schema(
    df.attrs["diagnosis"].file_profile,
    "monthly_export.csvmedic.json"
)

# Every time after: skip detection, apply cached schema
df = csvmedic.read("monthly_export.csv", schema="monthly_export.csvmedic.json")

The schema file is plain JSON — you can commit it to your repo, review it in PRs, and share it across your team.

Batch reads with consensus

When you're reading 12 monthly CSVs that should all have the same format, use consensus mode:

dfs = csvmedic.read_batch(
    ["jan.csv", "feb.csv", "mar.csv"],
    use_consensus=True
)

csvmedic samples each file, runs encoding and delimiter detection on all of them, and uses the majority result for every file. If 11 out of 12 files are UTF-8 with semicolons and one is mislabeled, consensus wins.

What it doesn't do

csvmedic is not trying to be a data cleaning framework. It does exactly one thing: get a messy CSV into a clean DataFrame with correct types. It doesn't clean column names, remove duplicates, or handle missing value imputation. It's the first line of your pipeline, not the whole pipeline.

Technical decisions

A few choices that might matter to you:

Everything is read as dtype=str first. csvmedic never lets pandas do type inference. It reads everything as strings, runs its own detection, and then applies conversions explicitly. This is the only way to prevent silent data loss (like leading zeros).

No dateutil.parser. It's too permissive (it'll parse "hello" as today's date) and too slow for batch column analysis. csvmedic uses strptime with explicit format strings — 50-100x faster.

Confidence thresholds. Every detection has a confidence score from 0 to 1. Below 0.75 (configurable), the column stays as a string. You can tighten this to 0.9 if you want csvmedic to be more conservative, or loosen it to 0.6 if you trust the heuristics.

Try it

pip install csvmedic

import csvmedic

df = csvmedic.read("your_messy_file.csv")
print(df.diagnosis)

Links:

PyPI: pypi.org/project/csvmedic
GitHub: github.com/Phantasm0009/CSVMedic

MIT licensed. Pure Python. Two dependencies (pandas + charset-normalizer).

I'd love feedback — especially on edge cases that the date disambiguation misses. Open an issue or drop a comment here.

DEV Community