A few weeks ago, I shipped the first reference implementation of a small specification I'd been working on. Eight YAML fields, a SHA-256 hash, and the rule that the hash gets computed before the experiment runs. The point was modest: if you're going to publish an ML accuracy claim, you ought to be able to prove that the threshold you wrote down was the threshold you committed to, not the one you settled on after seeing the test set.
The spec is called PRML — Pre-Registered ML Manifest. The Python reference implementation took a weekend.
Then I made the mistake of writing it in JavaScript. Then in Go. Then in Rust.
Each one found a different bug. Not in the implementations. In the spec.
What the spec is supposed to do
A PRML manifest is a tiny YAML document that locks an evaluation claim before the experiment runs:
version: prml/0.1
claim_id: 01900000-0000-7000-8000-000000000099
created_at: '2026-05-01T09:00:00Z'
metric: auroc
comparator: '>='
threshold: 0.85
dataset:
id: credit-default-2026
hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
seed: 314159
producer:
id: studio-11.co
You write the manifest. You compute SHA-256 over its canonical bytes. You commit that hash somewhere public. Then you run the experiment. Anyone with the manifest, the dataset, and the model can recompute the hash, recompute the metric, and check that you didn't move the goalposts.
The cryptographic primitive is boring. The hard part, it turns out, is "canonical bytes."
The Python implementation
PyYAML is friendly. yaml.safe_dump(manifest, sort_keys=True, default_flow_style=False) gives you something readable, lexicographically key-sorted, with a stable line layout. You hash the bytes. You're done.
The whole thing fit in 1,287 lines of single-file Python. CLI verbs, conformance suite, manifest loader, hash sidecar generator, signature stub. The spec said "canonical YAML, lexicographic key order, UTF-8" and PyYAML basically did that for free.
Then someone asked the obvious question: would another language produce the same bytes?
JavaScript said no
The first port was JavaScript with js-yaml. Same fields, same input, hash off by everything.
The diff:
# Python (PyYAML, sort_keys=True)
threshold: 0.85
# JavaScript (js-yaml default)
threshold: 0.85
Looks identical. The bytes weren't.
js-yaml was emitting a trailing space after some scalars where PyYAML wasn't. Or rather, both libraries thought they were emitting "compact" YAML and disagreed about what that meant for floats with trailing zeros. The fix in JavaScript was a custom serializer that normalized whitespace by post-processing the output. Ugly, but it worked.
I told myself this was a JavaScript-specific oddity. Three more languages would clarify it.
Go made it worse
gopkg.in/yaml.v3 is a careful library. It also iterates map keys in random order by default, because Go maps are explicitly unordered.
So the canonical form rule "sort keys lexicographically" had been doing all the heavy lifting in Python, where dicts preserve insertion order, and silently in JavaScript, where objects also preserve insertion order in practice. In Go, every output was a different byte sequence until I sorted explicitly into a slice and serialized that.
That was a real spec bug, not an implementation bug. The spec said "lexicographic key order" but didn't say "MUST not depend on language map iteration order." The Python implementation had been compliant by accident.
Spec patch: §3.2 now says the algorithm is "extract keys, sort with byte-order comparison, serialize in that order, recurse into nested maps." Not "lexicographic order" — that was ambiguous between byte-order and Unicode collation. Byte-order it is.
Rust caught the float bug
Rust uses serde_yaml, which has its own opinions about how to render numbers.
The canonical form rule said "render integers as integers, floats with their full decimal representation." Python rendered 0.000001 as 1.0e-06. Rust rendered the same value as 0.000001. JavaScript rendered it as 1e-6. Three different bytes, three different hashes, all "valid YAML" by their respective parsers.
This wasn't a sortable thing. The spec just didn't say what canonical YAML float rendering looked like.
I wrote it down: scientific notation for magnitudes outside [1e-4, 1e+15], decimal notation inside, mantissa with explicit .0 for integer-valued floats, exponent zero-padded to 2 digits. Implemented in all four languages. The TV-018 conformance vector tests this specifically. It now passes byte-for-byte everywhere.
What the four implementations actually proved
The spec wasn't precise. It was "precise enough for one library." The minute the second library had a different opinion about whitespace, key order, or float rendering, the SHA-256 hashes diverged. The protocol was correct. The format was underspecified.
I now think this is the actual lesson: the spec is the second implementation.
You don't know what your specification says until at least two independent implementations have to agree on what byte-for-byte equivalence means. Specs written by single-implementation teams are not specs. They're "PyYAML output, plus some prose."
The honest caveat
PRML v0.1.3 is not done. Even with byte-equivalent canonical output across four implementations, there's a structural gap I documented in §8.1 of the spec rather than hide.
A producer can run an evaluation, get a result they don't like, and just not publish the manifest. Then they can re-run with different parameters, lock a different manifest, and publish only the favorable one. The cryptographic protocol is satisfied. The regulatory purpose — "you committed before you knew" — is not.
The format itself can't fix this. It's closed only at the deployment layer: publish-before-run timestamps, sequential claim_id allocation by an external registrar, or external pre-registration anchoring (OSF, blockchain, whatever).
v0.2 normatively adopts the third option for the high-risk producer tier. v0.1 ships with the gap documented and three named mitigations.
I think open documentation of a real failure mode is worth more than papered-over silence. Specs that pretend to be airtight invite worse trust than specs that say "here's what we don't yet do."
Where it is
- Spec (CC BY 4.0): https://spec.falsify.dev/v0.1
- Repo (MIT): https://github.com/studio-11-co/falsify
- v0.1.3 release notes: https://github.com/studio-11-co/falsify/releases/tag/v0.1.3
- v0.2 RFC roadmap (freeze 2026-05-22): in the repo's
spec/v0.2/folder - All four reference implementations in
impl/
If you've ever published an ML benchmark and wished there were a way to prove later you didn't tune the threshold post-hoc, this is the substrate. If you've ever written a spec and discovered it was secretly "your library's output," I would genuinely value notes on what you did about it.
Working draft, v0.1.3. The format will probably change in v0.2 — review window is open.
Top comments (0)