I built a data-contract validator in pure Python (no pandas, no PyYAML) and it caught a 30% revenue ghost

#python #datascience #opensource #dataengineering

A few months ago I spent the better part of a day chasing a bug that turned out not to be a bug at all. A downstream dashboard showed revenue had jumped 30% overnight. No deploys, no schema changes, nothing in the logs. After far too long I found it: an upstream system had started sending a total column that no longer equaled subtotal + tax. The pipeline didn't crash. The data just lied, quietly, and everything downstream believed it.

That's the thing about data bugs. They rarely throw exceptions. A status field grows a new typo'd value. A join key starts producing orphans. A nullable column that was "never actually null in practice" suddenly is. None of it crashes anything — it just rots the numbers people make decisions on.

So I built DataPact: a small framework for writing down what your data is supposed to look like, and then enforcing it. It's a data quality and data-contract validation tool, and the whole thing runs on the Python standard library. No pandas, no PyYAML, no network calls.

Live demo report: https://hajirufai.github.io/datapact/report.html
Landing page: https://hajirufai.github.io/datapact/
Source: https://github.com/hajirufai/datapact

The idea: contracts, not assertions scattered everywhere

Most teams already validate data — but it's usually a pile of ad-hoc assert df["x"].notna().all() lines buried in notebooks and DAGs. Nobody can answer "what are the rules for the orders table?" without grepping three repos.

A data contract flips that. You write the rules down in one declarative document — column types, null rules, ranges, allowed sets, regexes, cross-column math, referential integrity — version it in git, and let producers and consumers share it. DataPact then validates any batch against that contract and tells you, precisely, what broke.

Here's a contract in DataPact's YAML-lite format:

name: orders
version: 1.0
strictness: lenient
columns:
  - name: order_id
    type: int
    nullable: false
    checks:
      - kind: column_values_unique
        severity: error
  - name: status
    type: str
    checks:
      - kind: column_values_in_set
        kwargs: { values: [new, paid, shipped, refunded] }
expectations:
  - kind: multicolumn_sum_to_equal
    kwargs: { columns: [subtotal, tax], total_column: total, tolerance: 0.01 }

That last expectation is the exact rule that would have caught my 30% revenue ghost. subtotal + tax must equal total, within a cent.

"Zero dependencies" wasn't a vanity thing

I want to be honest about why this is stdlib-only, because it sounds like a flex and it mostly isn't. Two real reasons:

First, a lot of data platforms are locked down. You can't always pip install half of PyPI on the box where the pipeline runs. A validation tool that drops in with nothing but Python is genuinely easier to adopt than one that drags pandas + pyarrow + a YAML parser behind it.

Second, I wanted to actually understand the problem instead of gluing libraries together. Writing my own YAML reader and type-inference ladder taught me more about the messy reality of "what type is this column" than any wrapper would have.

The downside is I had to write a YAML parser. Which brings me to the most annoying bug of the whole project.

The escape-sequence bug that broke every email

DataPact ships its own tiny YAML reader — a strict subset: maps, lists, scalars, comments, quotes, flow lists. No arbitrary-object deserialization, which is a nice security property for free.

My email validation regex in the contract looked like this:

checks:
  - kind: column_values_match_regex
    kwargs:
      pattern: "^[^@ ]+@[^@ ]+\\.[^@ ]+$"

When I ran it, every single email failed, including obviously valid ones. 14 out of 14, 100%. My first naive parser just stripped the surrounding quotes and handed back the raw string — so \\. stayed as a literal backslash-backslash-dot. In the compiled regex that means "a literal backslash followed by any character," and no email on earth has a backslash in it.

The fix was to make the parser do what real YAML does: process escape sequences inside double-quoted strings, while leaving single-quoted strings literal.

_ESCAPES = {"n": "\n", "t": "\t", "r": "\r", '"': '"', "\\": "\\", "/": "/", "0": "\0"}

def _unescape_double(s: str) -> str:
    out, i = [], 0
    while i < len(s):
        ch = s[i]
        if ch == "\\" and i + 1 < len(s):
            out.append(_ESCAPES.get(s[i + 1], "\\" + s[i + 1]))
            i += 2
        else:
            out.append(ch)
            i += 1
    return "".join(out)

After that, the regex compiled to \. and the dirty email got flagged on its own (not-an-email has no @, so it correctly fails) while real addresses passed. The lesson, for the hundredth time in my career: escaping rules are never as simple as you hope, and the bug always shows up as "100% of things fail" rather than something subtle.

Type inference, and why "1" is not a boolean

The other rabbit hole was type inference. DataPact reads CSVs where everything is a string, so it has to guess: is "42" an int, is "4.5" a float, is "2026-01-04" a date?

I built an inference ladder — try bool, then int, then float, then datetime, then date, then fall back to string. And immediately got bitten. My parse_bool accepted "1" and "0" as True/False (reasonable when you're explicitly coercing). But during inference, that meant a column full of 1s and 0s got classified as boolean — and then my stdev check refused to run on it because "this isn't a numeric column."

The fix was to make inference conservative. Only unambiguous words — true, false, yes, no — infer as boolean. "1", "0", "t", "f" are far more likely to be integers or category codes, so they stay numeric:

if isinstance(value, str):
    if value.strip().lower() in ("true", "false", "yes", "no"):
        return "bool"
    if parse_int(value) is not None:
        return "int"
    if parse_float(value) is not None:
        return "float"
    ...

parse_bool is still lenient when you explicitly tell DataPact a column is boolean — but it no longer guesses bool from a digit. Small change, but it's the kind of default that quietly saves users from a confusing afternoon.

The architecture

The whole thing is a handful of small, pure pieces:

flowchart LR
    A[CSV / JSON / JSONL<br/>SQLite / records] --> B[Dataset<br/>+ type inference]
    C[Contract<br/>YAML / JSON / builder] --> D[Validation Engine]
    B --> D
    D --> E[ValidationReport]
    E --> F[CLI exit code]
    E --> G[HTML report]
    E --> H[guard / raise]
    B --> P[Profiler] --> C

Sources normalize CSV, JSON, JSONL, SQLite and plain lists-of-dicts into one Dataset view.
Expectations are a registry of pure functions — one per check kind. There are 23 of them across column-level (not_null, unique, between, in_set, match_regex, mean_between...), table-level (row_count_between, compound_columns_unique...) and cross-column (a > b, sum_to_equal, referential integrity).
The engine runs every expectation, applies strictness rules for unexpected columns, and builds a structured ValidationReport.

Every column-level check supports a mostly= tolerance, so you can say "this should be non-null in at least 99% of rows" instead of demanding perfection — real data is messy and a single bad row shouldn't always fail a 10-million-row batch.

Using it: three ways

As a library, validating against a contract file:

import datapact as dp

report = dp.validate("orders.csv", dp.load_contract("orders.yaml"))
print(report.success, report.passed, report.failed)
for r in report.results:
    if not r.success:
        print(r.expectation.label(), "→", r.message)

As a pipeline gate, with a decorator that raises before bad data escapes:

from datapact import guard, DataContractError

@guard(contract)
def load_orders():
    return fetch_rows_from_somewhere()

try:
    rows = load_orders()
except DataContractError as exc:
    alert(exc.report)   # the full report is attached to the exception

As a CI check, where a contract breach fails the build like a unit test:

datapact validate orders.csv --contract orders.yaml --fail-on error
echo $?   # 1 on breach

That --fail-on flag is the part I'm most happy with. It makes data quality a gate, not a dashboard nobody looks at. My GitHub Actions workflow actually runs two jobs: one proves the clean sample passes, and one proves the dirty sample fails — because a gate that never rejects anything is worse than no gate at all.

The report is the part people actually see

A validation result is only useful if a human can read it. So every run renders to a single self-contained HTML file — no external assets, no JavaScript framework — with failures sorted to the top and a full data profile (null rates, distinct counts, distributions) underneath.

I built the live demo report from a deliberately dirty orders file with ten injected problems: a duplicate order ID, a null email, a malformed email, an unknown status value, a bad date format, a negative subtotal, a quantity of 40, an invalid country code, and three rows where the totals don't add up. DataPact catches all ten — 41 expectations, 31 passed, 10 failed — and the report lays them out so you can see exactly which rows and values broke each rule.

Where it fits

DataPact is rule-based contract validation. It answers one question well: "does this batch obey the rules we agreed on?" That's deliberately different from statistical drift detection (is the distribution shifting?) or ETL orchestration (move the data around). It complements both — you'd run DataPact as the gate between an extract step and a load step, or in CI on your fixtures.

It's about 3,000 lines of pure Python with 141 tests, all stdlib unittest, no test dependencies either. If you've ever lost an afternoon to data that lied to you, give it a look — and if you find a rule it can't express yet, that's exactly the kind of issue I want to see.