DEV Community

SEN LLC
SEN LLC

Posted on

What's Actually In This CSV? A Zero-Dep Python Profiler

What's Actually In This CSV? A Zero-Dep Python Profiler

The first question you ask a CSV that just landed in Slack is "what's in this file?". Pandas makes you answer that twice. Here's a tiny stdlib-only CLI that does it in one go β€” and the thing it gets right is stubborn, deliberate type inference.

πŸ“¦ GitHub: https://github.com/sen-ltd/csv-profile

screenshot

The problem I kept having

Someone sends you a CSV. You don't know the schema, you don't know the encoding, you don't know whether it's comma- or tab-separated, and the first thing you want is a one-shot summary:

  • How many rows?
  • What are the columns and what type is each?
  • Which columns have nulls and how many?
  • Min, max, mean for numerics. Min/max length for strings. Min/max date for dates.
  • A couple of example values so you can eyeball the shape.

Pandas almost does this. df.describe() gives you count / mean / std / min / 25% / 50% / 75% / max for numeric columns, and count / unique / top / freq for object columns. Two different tables. Non-numeric columns skipped unless you pass include="all". And you had to import pandas, which on a cold Python is about a second of startup before anything happens.

csvkit's csvstat is closer in spirit β€” it's even a CLI β€” but csvkit has been effectively unmaintained for years, predates Python 3's nicer stdlib csv, and pulls a dependency graph nobody wants in a one-shot tool.

So I wrote csv-profile. Pure Python, stdlib only, runs in a 57 MB alpine container. Drop it on a file and it tells you what's in there.

rows=4  cols=5  delimiter=comma  encoding=utf-8-sig  nulls=2  bytesβ‰ˆ120

  id      int     nulls=0/4 (0%)  distinct=4
    min=1, max=4, mean=2.5, median=2.5, stdev=1.118
    examples=['1', '2', '3']
  name    string  nulls=0/4 (0%)  distinct=4
    len=3..5
    examples=['alice', 'bob', 'carol']
  age     int     nulls=1/4 (25%)  distinct=3
    min=25, max=45, mean=33.33, median=30, stdev=8.498
  active  bool    nulls=0/4 (0%)  distinct=2
    most_common='true' (Γ—3)
  signup  date    nulls=1/4 (25%)  distinct=3
    min=2024-01-15, max=2024-03-10
Enter fullscreen mode Exit fullscreen mode

That's the shape. Now the interesting part: how it decides signup is a date and not a string.

Type inference is a heuristic, not a solver

Every CSV profiler has to answer the same question for each column: "what type are these values?" And there are two honest answers you could design toward.

The probabilistic answer: "this column is 95% integer, 5% string β€” it's mostly int". That's what pandas does when you read_csv with default settings; it'll happily coerce a column to int64 and then silently convert the outliers to NaN. Technically correct, subtly terrifying.

The all-or-nothing answer: "every single non-null value in this column parsed cleanly as int, so it's int. If even one didn't, it's a string." Less clever, more boring, but it never hides data corruption.

csv-profile picks the second one. This is the rule I want a profiler to enforce, because the whole reason I'm running a profiler is to find the weird rows β€” and "mostly int" is exactly how you lose them. The value of the profile is that it tells the truth about the data, not a flattering summary of it.

Here's the entire inference core:

TYPE_LADDER = ("int", "float", "bool", "date", "datetime", "string")

def infer_column_type(values, *, empty_as_null=True):
    alive = {"int", "float", "bool", "date", "datetime"}
    any_non_null = False

    for raw in values:
        if is_null(raw, empty_as_null=empty_as_null):
            continue
        any_non_null = True

        if "int" in alive and not _try(parse_int, raw):
            alive.discard("int")
        if "float" in alive and not _try(parse_float, raw):
            alive.discard("float")
        if "bool" in alive and not _try(parse_bool, raw):
            alive.discard("bool")
        if "date" in alive and not _try(parse_date, raw):
            alive.discard("date")
        if "datetime" in alive and not _try(parse_datetime, raw):
            alive.discard("datetime")

        if not alive:
            return "string"

    if not any_non_null:
        return "string"

    for t in TYPE_LADDER:
        if t in alive:
            return t
    return "string"
Enter fullscreen mode Exit fullscreen mode

Walk the column once. Keep a set of candidate types that are still alive. For each non-null value, try to parse it with every candidate that's still alive β€” if the parse fails, eliminate that candidate. The moment the alive set empties out we short-circuit to "string" and stop reading. At the end, return the strictest type still alive, following the ladder.

A few consequences worth calling out:

  • Ints parse as floats too, so an all-int column has both int and float alive at the end. The ladder picks int because it comes first β€” i.e. it's stricter. Good.
  • 0 and 1 parse as both int and bool. Ladder again: int wins, because a column of pure 0/1 is much more commonly a numeric flag than a boolean in my experience. The tradeoff is: if you have a column labelled active and it's all 0/1, you'll see it as int. Re-running with --columns active and reading the most_common tells you the shape. Fine.
  • An empty column (every value null) reports as string β€” there's no signal to do otherwise, and string is the safest default for downstream tooling.

The parsers themselves are strict and blunt. Here's parse_int:

def parse_int(value: str) -> int:
    stripped = value.strip()
    if not stripped:
        raise ValueError("empty")
    if "." in stripped or "e" in stripped.lower() or "_" in stripped:
        raise ValueError("not an int")
    return int(stripped)
Enter fullscreen mode Exit fullscreen mode

Python's int() is too permissive for this job. int("0x10") doesn't work β€” good β€” but int(" 42 ") does, and we already stripped. More to the point, I want parse_int("1.0") to fail, because a column of 1.0 2.0 3.0 is a float column with no decimal info, not an int column with some decoration. Letting those through would mean the profile says int when the file says something else, and that's exactly the silent lie the all-or-nothing rule is trying to prevent.

parse_float similarly refuses nan / inf. Real CSV exports almost never contain them; when they do, it's almost always a bug upstream, and I'd rather the profile say "string" so you notice.

parse_date only accepts zero-padded YYYY-MM-DD. strptime would quietly accept 2024-1-1 if I let it, but then round-tripping through the profile would silently normalize it to 2024-01-01, and I don't want the profile mutating its inputs even in intermediate state.

Encoding detection, honestly

CSVs come in every encoding that was ever a good idea and a few that weren't. Standard solutions involve chardet or charset-normalizer, which pull in real dependencies and a whole heuristic scoring pass over the bytes. For a one-shot profiler, that's overkill.

csv-profile does the cheap thing:

_FALLBACK_ENCODINGS = ("utf-8-sig", "utf-8", "latin-1")

def _decode_bytes(raw, encoding):
    if encoding is not None:
        return raw.decode(encoding), encoding
    for enc in _FALLBACK_ENCODINGS:
        try:
            return raw.decode(enc), enc
        except UnicodeDecodeError:
            continue
    raise ReaderError("...")
Enter fullscreen mode Exit fullscreen mode

Try utf-8-sig first β€” it strips the BOM Excel adds on export, and for non-BOM files it behaves identically to utf-8. Then plain utf-8 (kept for symmetry). Finally, latin-1 as the no-fail bottom of the ladder: every byte 0x00–0xFF is a valid code point in latin-1, so it always "succeeds". If the file was actually Shift-JIS or GBK, the decoded text will be mojibake β€” but it won't crash, and the encoding=latin-1 line in the profile output is a clear signal that something is off.

This is the honest tradeoff: I will not try to detect Shift-JIS vs GBK vs Windows-1252 without adding a dependency. If you need that, the --encoding flag is your escape hatch. csv-profile users.csv --encoding shift-jis and you're done. I'd rather expose the knob clearly than pretend the tool is magic.

I hit one subtle bug in this while writing the tests. My first draft had the fallback order as ("utf-8", "utf-8-sig", "latin-1"), on the theory that plain utf-8 was "cheaper". The BOM test failed β€” not because utf-8 errored on a BOM'd file, but because it succeeded, with the BOM glued onto the front of the first header as \ufeff. That's why utf-8-sig goes first: it's the only one that consumes the BOM properly, and on non-BOM files there's no cost. Small lesson: "fallback" ordering needs to be sorted by strictness of success, not by alphabetical taste.

Delimiter sniffing

The stdlib csv.Sniffer handles this. Give it the first ~1 KB, it trials a handful of delimiters (,, \t, ;, |) against the sample, and picks the one that produces the most consistent column count. It's not perfect, but for well-formed files it nails the answer, and for weird files the --delimiter override takes over.

def _sniff_delimiter(sample):
    if not sample.strip():
        return ","
    try:
        dialect = csv.Sniffer().sniff(sample, delimiters=",\t;|")
        return dialect.delimiter
    except csv.Error:
        return ","
Enter fullscreen mode Exit fullscreen mode

Empty-sample default and sniff-failure default are both comma, because CSV is the name of the format. If I'm wrong about that, --delimiter is there.

The markdown formatter (for PR comments)

I added a markdown output mode mostly because I kept pasting profile results into PR descriptions and formatting them by hand. Now:

def format_markdown(result):
    out = []
    out.append(f"**{result.row_count} rows Γ— {result.column_count} cols** β€” ...")
    out.append("")
    out.append("| Column | Type | Nulls | Distinct | Min | Max | Example |")
    out.append("|---|---|---|---|---|---|---|")
    for c in result.columns:
        example = (c.examples[0] if c.examples else "").replace("|", "\\|")
        out.append(
            f"| `{c.name}` | {c.inferred_type} | "
            f"{c.null_count}/{c.count} | {c.distinct_count} | "
            f"{_fmt_min(c)} | {_fmt_max(c)} | `{example}` |"
        )
    return "\n".join(out)
Enter fullscreen mode Exit fullscreen mode

The subtle bit is escaping pipes in example values. If the actual data contains a|b, pasting that into a markdown table breaks the row. The fix is a one-liner .replace("|", "\\|") on every user-controlled field that lands in a table cell. I almost forgot it, then wrote the test and caught it immediately:

def test_markdown_escapes_pipe_in_example():
    c = compute_column_stats("x", "string", ["a|b", "c|d"])
    ...
    assert "a\\|b" in format_markdown(r)
Enter fullscreen mode Exit fullscreen mode

This is a class of bug I only ever catch via tests, because in manual usage you don't think about it until something renders wrong and you're already embarrassed in a PR thread.

What csv-profile doesn't do (the honest tradeoffs)

  • No "mostly numeric" column detection. By design. If your column has one non-numeric value, you want to see that reflected, not buried in a round percentage.
  • No preservation of quoted string-y numerics. A column of US ZIP codes ("02139", "90210") will profile as int. csv-profile reads the CSV the same way everything does: it throws away the quoting information and looks at the values. If you need zero-padded ZIPs to stay strings, you needed a stricter serialization format upstream β€” this tool won't save you.
  • No timezone-aware datetime support. 2024-01-15T10:30:00Z parses. 2024-01-15T10:30:00+09:00 does not. Full RFC 3339 parsing either means hand-rolling a parser (fragile) or pulling in dateutil (dependency). I picked "document the gap and move on".
  • In-memory. Columns are materialized in a first pass before stats get computed in a second pass. For multi-GB files, use --sample N. A true streaming implementation would need a second pass over the file anyway (because distinct-count and median aren't single-pass), so the simplicity of materializing is a real win until you hit memory pressure.
  • All the statistics.pstdev stuff is population standard deviation, not sample. For a profile that's what you want: you're describing this file, not estimating a distribution.

Try it in 30 seconds

docker build -t csv-profile https://github.com/sen-ltd/csv-profile.git

mkdir -p /tmp/demo
cat > /tmp/demo/users.csv << 'CSV'
id,name,age,active,signup
1,alice,30,true,2024-01-15
2,bob,25,false,2024-02-01
3,carol,,true,2024-03-10
4,dave,45,true,
CSV

docker run --rm -v /tmp/demo:/work csv-profile users.csv
docker run --rm -v /tmp/demo:/work csv-profile users.csv --format markdown
docker run --rm -v /tmp/demo:/work csv-profile users.csv --format json | python3 -m json.tool
Enter fullscreen mode Exit fullscreen mode

The image is 57 MB, non-root, and has zero third-party Python dependencies β€” just csv, statistics, datetime, json, argparse, and collections.Counter from the stdlib. 61 tests ship in the image; run them with docker run --rm --entrypoint pytest csv-profile -q.

Closing

Entry #130 in a 100+ portfolio series by SEN LLC. Sibling entries in the same vein:

  • csvdiff β€” semantic CSV diff with key-based row matching
  • imdiff β€” perceptual image diff with three complementary metrics

The pattern I keep hitting in these tools is the same: a small, opinionated CLI is more useful than a large permissive library, as long as the opinions are the right ones. For a CSV profiler, the right opinion is "don't lie about what's in the file". All-or-nothing inference is the concrete form of that.

Feedback welcome.

Top comments (0)