I built a CLI that hashes your ML accuracy claims before the experiment runs
Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page.
I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment.
That night I started building falsify. Three days later I shipped it.
This post is what I built, why I built it that small, and the one Python function that does most of the work.
The problem in one sentence
If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.
Psychology and medicine figured this out the hard way and invented pre-registration. You write down the prediction, the threshold, and the analysis plan, hash it, timestamp it, and you cannot move it later without everyone knowing.
ML never adopted any of this. A git commit is the closest thing most teams have, and git commit --amend followed by a force-push will quietly erase the receipt.
So I wrote a CLI that does the smallest possible version of pre-registration: canonicalize a YAML spec, SHA-256 it, lock the hash, and refuse to let it move.
What "the smallest possible version" actually looks like
# falsify.yaml
claim:
metric: accuracy
threshold: 0.94
dataset: customer_eval_v3
dataset_sha256: 4f1a8b2c...
model: ranker-7b-2026q1
test_n: 1200
created_at: 2026-04-28T19:45:00Z
That is the contract. The CLI workflow is three commands:
pip install falsify
falsify lock falsify.yaml # writes a .lock file with the hash
falsify check falsify.yaml --result actual_accuracy=0.91
Exit codes are the API:
-
0— claim verified -
10— claim falsified (you missed the threshold, but cleanly) -
3— tamper detected (someone edited the spec after lock) -
11— spec invalid
10 and 3 being different exit codes is the whole point. "We didn't hit the number" is a different thing from "we moved the number."
The one function that matters
The reason this works at all is YAML canonicalization. JSON looks canonical but isn't — key order, whitespace, and unicode forms can all drift while the document stays "the same." YAML is worse by default, but easy to canonicalize once you commit to a few rules.
Here is the actual hashing function from the source. It is small on purpose:
import hashlib
import unicodedata
import yaml # PyYAML
def canonical_sha256(spec_path: str) -> str:
"""Return SHA-256 of a canonicalized YAML spec.
Canonicalization rules:
- Parse the document, drop comments and anchors
- Recursively sort all mapping keys
- Normalize all strings to NFC unicode
- Re-emit as UTF-8 with LF line endings, no trailing whitespace
- Hash the bytes
"""
with open(spec_path, "rb") as f:
data = yaml.safe_load(f)
def normalize(node):
if isinstance(node, dict):
return {
unicodedata.normalize("NFC", k): normalize(v)
for k, v in sorted(node.items())
}
if isinstance(node, list):
return [normalize(x) for x in node]
if isinstance(node, str):
return unicodedata.normalize("NFC", node)
return node
canonical = yaml.safe_dump(
normalize(data),
sort_keys=True,
allow_unicode=True,
default_flow_style=False,
line_break="\n",
).encode("utf-8")
return hashlib.sha256(canonical).hexdigest()
That is the entire trust primitive. Everything else in the 3925-line file — the lock file format, the CI integration, the tamper detection, the schema validation — is plumbing around this one function.
The reason it has to be exactly this strict: any wiggle room (key order, trailing whitespace, BOM, unicode form) is a place where someone can quietly change the spec and produce a "matching" hash. Canonicalize once, hash once, never look back.
The CI moment
The point of all of this is the moment a teammate edits the spec after lock. Maybe they have a good reason. Maybe they don't. Either way, you want the system to notice.
# .github/workflows/eval.yml
- name: verify accuracy claim
run: |
falsify check falsify.yaml --result-file results.json
If anyone touches falsify.yaml after the lock, the action exits with code 3 and the PR cannot merge. The lie is blocked at the filesystem level, not by trust.
What I learned in three days
A few things surprised me while building this:
YAML canonicalization is most of the value. I spent way more time on the canonicalizer than on anything else. Every "clever" optimization I tried later turned out to be a place where two byte-different YAMLs produced the same hash. Boring is correct.
Exit codes are an API. I almost shipped with just
0and1. Splitting "falsified" from "tampered" was the single biggest jump in how teams reacted to it. People immediately understood the difference.One file is a feature. I kept resisting the urge to split it into a package. Auditors and skeptical SREs read single-file Python CLIs in one sitting. They do not read packages.
Dogfooding is non-negotiable. falsify locks its own test claims with falsify. The honesty badge on the README is generated by the tool itself, on its own metrics. If a tool that locks claims cannot lock its own, why would you trust it.
Agents change what one person can ship in a weekend. I built this solo in three days with Claude Opus 4.7 in the loop — pair programming, eval generation, doc drafting, the whole pipeline. The 518 tests and the YAML canonicalizer corner cases would have been a two-week solo grind without it. The actual design decisions were still mine; the agent just made the cost of being thorough a lot lower.
Try it
pip install falsify
falsify init
- Repo: https://github.com/sk8ordie84/falsify
- 90-second demo: https://youtu.be/vVZTNeak5PA
- Site: https://falsify.dev
- PyPI: https://pypi.org/project/falsify/
Single file, MIT, Python 3.11+, stdlib plus pyyaml. If you ship any number followed by a percent sign, lock it before the experiment runs. It costs 30 seconds and saves the meeting where someone has to explain why the number changed.
Top comments (0)