DEV Community

sk8ordie84
sk8ordie84

Posted on

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54%.

I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.

Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures?

I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done.

This is not a problem about that specific model card or those specific authors. It is a problem about every published ML accuracy number I have ever read.

Five failure modes that current reporting practices cannot detect

A claim like "our model achieves 91.3% accuracy on benchmark X" can be wrong, in published form, in at least these five ways, none of which leave a forensic trace:

  1. Threshold drift. The team picked the threshold after running the experiment, by looking at where their model happened to land, and reported that as if it was the original target.
  2. Slice selection. The evaluation set was filtered after results were observed (e.g., dropping the 12 hardest examples because "they were mislabeled").
  3. Silent re-runs. Five seeds were tried; only the seed that passed was reported.
  4. Metric ambiguity. "F1" without specifying micro vs macro. "Accuracy" without specifying top-k. "Pass@1" without specifying temperature.
  5. Dataset drift. The benchmark hosted at the canonical URL changed between the experiment date and the publication date, and the team did not pin the bytes.

Each of these is consistent with current best-practice reporting. Each leaves the published number unfalsifiable: a reader cannot, even in principle, distinguish honest reporting from any of the above.

Why no infrastructure exists

Pre-registration solved this exact problem in adjacent fields:

ML never got the equivalent. The closest thing — the ML Reproducibility Challenge — is an annual peer-driven effort to re-run published experiments. It produces excellent post-hoc analysis but does not change the publication-time commitment surface.

The 2026 regulatory window is the part that matters most for builders. The EU AI Act Article 12 requires automatic logging of evaluation events for high-risk systems. Article 18 requires 10-year retention. Both enter force August 2, 2026. NIST AI RMF references content-addressed audit trails as a recommended control. ISO/IEC 42001:2023 mandates documented information practices that PRML directly satisfies.

In other words: there is now a regulatory deadline by which "we have a tradition of reporting these numbers honestly" stops being a sufficient answer.

PRML in plain English

I drafted a small format, working draft v0.1, currently under public review. It is called PRML — Pre-Registered ML Manifest. The whole spec fits in a single YAML schema:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000000"
created_at: "2026-05-01T12:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.85
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "studio-11.co"
Enter fullscreen mode Exit fullscreen mode

That is the entire required surface. Eight fields. Plain text. UTF-8. YAML 1.2 strict subset (block style only, lexicographic key ordering, no comments, no flow collections).

The format defines a deterministic canonicalization. Given any logical YAML mapping with these fields, there is exactly one canonical UTF-8 byte sequence. The SHA-256 of those bytes is the manifest hash.

The hash is published before the experiment runs. After the experiment, an independent verifier can:

  1. Re-canonicalize the manifest.
  2. Recompute SHA-256.
  3. Compare against the published sidecar hash. If they differ, the manifest has been edited post-lock — exit code 3 (TAMPERED).
  4. Load the dataset by its content hash. Verify byte integrity.
  5. Run the metric computation under the seed. Compare against threshold.
  6. Emit 0 (PASS), 10 (FAIL), or one of the diagnostic codes.

There is no trust in the producer required at verification time. Anyone with the manifest, the dataset, and the model can reproduce the verdict offline.

Honest amendments — "we found 12 mislabeled examples and re-ran" — do not overwrite. They append. Each new manifest carries a prior_hash field pointing to the manifest it amends. The chain is the audit log. When a regulator or reviewer asks "what was committed when?", the answer is one hash, and from that hash the entire history is recoverable.

A worked example with the reference implementation

The reference implementation is a single-file Python CLI called falsify, MIT-licensed, 1287 lines. Install it the usual way:

pip install falsify
Enter fullscreen mode Exit fullscreen mode

Initialize a claim:

falsify init imagenet-87
Enter fullscreen mode Exit fullscreen mode

This writes .falsify/imagenet-87/spec.yaml with the required PRML fields as placeholders. Edit the file with your real values:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000010"
created_at: "2026-05-01T14:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.87
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "your-org.example"
Enter fullscreen mode Exit fullscreen mode

Lock it:

$ falsify lock imagenet-87
locked: yes (sha256:1a3466cc08ee, locked_at 2026-05-01T14:00:00Z)
Enter fullscreen mode Exit fullscreen mode

Now the spec is hash-bound. If anyone — including you — edits the YAML, the next falsify verify exits 3 and refuses to produce a verdict.

Run the experiment, capture the metric value (let us say 0.876), and verify:

$ falsify verify imagenet-87 --observed 0.876
PASS  metric=accuracy observed=0.876 >= threshold=0.87
exit 0
Enter fullscreen mode Exit fullscreen mode

If the team had silently raised the threshold to 0.88 after seeing the result:

$ falsify verify imagenet-87 --observed 0.876
TAMPERED  spec hash drift detected
recorded: 1a3466cc08ee...
current:  7b2c9a5d1e4f...
exit 3
Enter fullscreen mode Exit fullscreen mode

The CI pipeline halts. The deploy does not happen. There is no judgment call.

How do you know the canonicalization actually works?

The most reasonable skeptical question about a content-addressed format is: what guarantees that two implementations produce the same canonical bytes for the same input?

For v0.1 we publish 12 conformance test vectors. Each vector defines:

  • An input manifest (logical YAML, key order irrelevant).
  • The exact UTF-8 byte sequence the canonicalizer must produce.
  • The exact lowercase-hex SHA-256 of those bytes.

The vectors exercise:

Test Property
TV-001 Minimal valid manifest
TV-002 Key-ordering invariance — random insertion order produces same hash
TV-003 Single-bit-of-content sensitivity — 0.85 vs 0.86 produces different hash
TV-004 Optional fields populated (model.id, model.hash, dataset.uri)
TV-005 Unicode handling in producer.id
TV-006 Maximum seed value (2⁶⁴ − 1)
TV-007 Minimum seed (0)
TV-008 Equality comparator with integer-valued threshold
TV-009 Amendment with prior_hash linkage
TV-010 pass@k metric for code generation
TV-011 AUROC with strict comparator
TV-012 Regression metric with <= comparator

A new implementation in Rust, Go, or TypeScript is conformant only if it reproduces all 12 vectors exactly. The reference implementation has 28 unittest assertions in CI that lock in the v0.1 hash contract; any code change that breaks a vector forces a v0.2 spec bump.

What it is not

PRML does not establish whether a claimed metric is correct, fair, or sufficient. It establishes only that the claim was committed before it was tested. Specifically:

  • Not a model card replacement. PRML manifests sit underneath model cards as the cryptographic floor.
  • Not a benchmark. PRML does not pick metrics for you.
  • Not a reproducibility framework. PRML does not ship code or data.
  • Not a tool. PRML is a format. falsify is one implementation. A second implementation in any language passes if it reproduces the test vectors.
  • Not a compliance product. It is a primitive that makes named regulatory obligations satisfiable with arithmetic verification rather than process attestation.

What it costs

The cost of adopting PRML at the experiment level is one hash function call. SHA-256 is FIPS 180-4, available in every standard library written since 2002. The format is UTF-8 plain text, readable in 2046 by any tool that can read text.

The cost of not adopting it scales with deployment scope. For a personal project, zero. For a research paper, growing pressure as reviewers begin to ask. For a product subject to EU AI Act Annex III obligations, measurable in regulatory exposure plus legal review hours. For a foundation model that will be cited in safety cases for a decade, the cost is roughly the credibility of every accuracy claim you have ever shipped.

What I am asking for

This is a working draft. v0.2 freeze is targeted 2026-05-22. Three concrete asks:

  1. Format review. Is the canonical serialization in §3 of the spec unambiguous? Are there YAML 1.2 edge cases the spec misses?
  2. Threat-model gaps. §6 of the spec enumerates six adversaries. What is missing?
  3. Compliance correctness. The AI Act mapping maps PRML fields to Articles 12, 17, 18, 50, 72, and 73. Compliance lawyers and engineers in EU AI Act adjacent roles: are the bindings defensible?

Discussion thread: github.com/sk8ordie84/falsify/discussions/6.

Tl;dr

  • Most published ML accuracy numbers are unfalsifiable in practice.
  • A small spec — eight fields, one hash function, one canonical serialization — gives published claims a cryptographic floor.
  • Reference implementation in Python, MIT, single file. Spec under CC BY 4.0.
  • v0.2 freeze in 3 weeks. Reviews, ambiguity reports, threat-model critiques are wanted.

Spec: spec.falsify.dev/v0.1
Code: github.com/sk8ordie84/falsify
Discussion: github.com/sk8ordie84/falsify/discussions/6

Top comments (0)