DEV Community: sk8ordie84

Model cards vs pre-registration: what counts as evidence under the EU AI Act

sk8ordie84 — Fri, 29 May 2026 13:20:14 +0000

When the EU AI Act's high-risk obligations bind on 2 August 2026, a provider of a high-risk system has to do two unglamorous things about its accuracy numbers: declare them (Article 15) and keep records that let them be checked (Article 12, with the technical documentation in Annex IV). The Act tells you the obligation. It does not tell you which artifact satisfies it.

So teams reach for the artifact they already know: the model card, or an internal evaluation report. That instinct is right that documentation is required. It is wrong about what the documentation has to prove.

This is a short, practical comparison of the two artifacts a reviewer is most likely to see, and the one property that separates them.

This is an engineering pattern, not legal advice. Article and Annex references are to Regulation (EU) 2024/1689. Confirm scope and sufficiency with qualified counsel and your notified body.

What the Act actually asks for

Strip away the framing and Article 15 wants a stated level of accuracy that is appropriate and, crucially, verifiable. Article 12 wants logs that record the system's functioning across its lifetime. Annex IV section 2 wants the evaluation methodology written down.

The recurring word, implicit in all three, is verifiable. Not "asserted." Not "documented in prose." Checkable by someone who was not in the room.

The default answer: model cards and eval reports

A model card is a good thing. It collects the metric, the dataset, the intended use, known limitations, and the headline numbers in one place. For transparency and for Annex IV section 3 it does real work.

What a model card does not do is establish when its numbers were decided. The accuracy figure, the threshold, the evaluation slice, the random seed: in a model card these are reported after the experiment, in editable prose. Nothing in the artifact distinguishes a number that was committed before the run from one that was selected, after seeing several runs, because it looked best. A model card is a trust-me document. It can be edited after the fact and nothing about it changes.

That is not a knock on the people who write them. It is a structural property: a document written after the result cannot, by itself, prove what was promised before it.

The property a reviewer will eventually ask about

Here is the question that turns documentation into evidence, and it is the one an auditor or a careful customer eventually asks:

"When was this accuracy threshold set, relative to when you saw the result?"

If the honest answer is "we cannot show that," then the number is, in the Act's sense, hard to verify. It is an assertion with a citation, not evidence.

Closing that gap needs one property the model card does not have: pre-commitment that is tamper-evident. A way to show that the claim existed, in its exact form, before the experiment ran, and that it has not been quietly edited since.

Where pre-registration fits

Pre-registration is the discipline of writing down what you will measure, and the bar you will hold it to, before you measure it. It is standard in clinical trials and in parts of empirical science for exactly the reason above: it removes the room to reshape the claim after seeing the data.

A pre-registered ML manifest (PRML) applies this to an evaluation claim. You write the claim as eight fields (metric, comparator, threshold, dataset hash, seed, producer, and so on) and bind it to a SHA-256 digest computed over the canonical bytes before the run. A verifier with the manifest, the dataset, and the model recomputes the digest, executes the claim, and returns a deterministic verdict: PASS, FAIL, or TAMPERED. Editing the claim after the fact changes the bytes, which changes the hash, which the verifier flags. The spec is open (CC BY 4.0); reference implementations are MIT.

This is not a replacement for the model card. It is the evidence layer underneath it. The model card says "accuracy is 0.91." The manifest says "and here is cryptographic proof that 0.90 was the bar we committed to before we ran, and that nobody moved it." One is the human-readable summary; the other is the part a reviewer can independently re-derive.

The one-line comparison

Question	Model card / eval report	Pre-registered manifest
States the metric and number?	Yes	Yes
Human-readable summary of intended use, limits?	Yes	No (points to the card)
Proves the threshold was set before the run?	No	Yes
Detects a silent post-hoc edit?	No	Yes (hash mismatch)
Independently re-derivable by a third party?	Not by itself	Yes
Required reading for a regulator?	Likely (Annex IV section 3)	Supports Art. 12 / 15 verifiability

What this does and does not buy you

A pre-registered manifest does not make a model accurate, safe, or compliant. It makes one specific claim, the accuracy figure you report, checkable rather than trusted. That is a narrow thing. It is also exactly the thing that is missing from the documentation most teams will bring to an Article 15 conversation.

If you keep model cards, keep them. Add the layer that answers the "when was the threshold set" question before someone asks it.

PRML (Pre-Registered ML Manifest) is an open specification with four byte-equivalent reference implementations and a published conformance suite. Spec: spec.falsify.dev/v0.1. Article-by-article mapping: falsify.dev/compliance.

Lock #2: the first thing PRML falsified was its own distribution hypothesis

sk8ordie84 — Fri, 29 May 2026 13:03:00 +0000

The mechanism worked. The outreach didn't.

On 2026-05-08 I locked a public commitment using PRML, the pre-registration format my own project is built on. The claim: at least 3 independent contributors would file RFC engagement (issues or PRs against the rfc-v0.2 tracks) within a two-week window. Public hash, public target, public resolve date.

Today it resolved at 0 / 3. Zero independent contributors. Not partial, not "close" — zero.

An honest account:

1. The mechanism worked

Lock #2 exists precisely so that a missed commitment can't be quietly re-framed after the fact. The fail is public now, automatically, via the same registry, with no admin intervention from me. If it had soft-failed silently, the whole pre-registration thesis would be theatre.

2. It's a distribution problem, not an interest problem

The v0.2 RFC is technically sound: the JSON Schemas validate, the conformance vectors pass byte-equivalently across four reference implementations. What didn't happen is the RFC reaching rooms where someone is paid to read RFCs in this domain.

3. v0.2 still freezes on schedule, founder-only

The version doesn't get postponed because the engagement target was missed. That would be moving the goalposts.

The second thing it falsified: its own counting bug

Two days before the lock resolved, a registry-side audit found the unique-producer-count routine was over-counting: it showed 8 when the correct number was 2 (a quoting/regex bug that misread other fields as producer IDs). Fixed and deployed; the lock manifest itself was untouched, only the observing code. This is what running the mechanism on yourself looks like in practice: if a critic finds the next bug, that's PRML doing its job; if I find it first, that's also PRML doing its job.

PRML (Pre-Registered ML Manifest) is an open spec for committing an ML evaluation claim to a SHA-256 hash before the run. Spec and four reference implementations: https://falsify.dev

I audited our CMS and 86% of our articles were invisible. A Sanity gotcha.

sk8ordie84 — Thu, 21 May 2026 15:30:52 +0000

A week ago I ran a routine count on our Sanity dataset, expecting maybe a 5% gap between drafts and published articles. The result was 33 published, 253 drafts. 86% of the content I thought was on our site wasn't there. The bug had been silently shipping for the entire 9-day life of the project.

This post is the postmortem. It is specifically about Sanity, but the underlying gotcha (a CMS client default that disagrees with what you actually want at read time) applies to any headless setup.

The setup

I run Fax Office 1987, a small daily editorial publication. Next.js 15 App Router, Sanity for the CMS, Inngest for the dispatch pipeline. The editor (me) gets a review email for each AI-assisted draft and approves or rejects via a link. Approval was supposed to make a piece appear at /dispatch/<slug>.

The review route handler looked like this:

const next = action === 'approve' ? 'approved' : 'rejected'
await sanity
  .patch(id)
  .set({ reviewStatus: next, reviewedAt: new Date().toISOString() })
  .commit()

Clean. Set the flag, return the HTML confirmation page. Done.

And separately, the public site read articles like this:

const sanity = createClient({
  projectId,
  dataset,
  apiVersion: '2024-10-01',
  useCdn: false,
  token: process.env.SANITY_WRITE_TOKEN, // we use a token so private-read works
})

export const ARTICLES_QUERY = `
  *[_type == "article"
      && defined(slug.current)
      && (reviewStatus == "approved" || !defined(reviewStatus))
    ]
    | order(publishedAt desc) { ... }
`

For 9 days I thought this worked. Approve emails arrived, I clicked approve, the confirmation page said "the piece is now visible on the site." It wasn't.

The audit

I ran a per-status count:

const r = await client.fetch(`{
  "published":      count(*[_type=="article" && !(_id in path("drafts.**"))]),
  "draft_approved": count(*[_type=="article" && _id in path("drafts.**") && reviewStatus=="approved"]),
  "draft_pending":  count(*[_type=="article" && _id in path("drafts.**") && reviewStatus=="pending"]),
  "draft_rejected": count(*[_type=="article" && _id in path("drafts.**") && reviewStatus=="rejected"])
}`)

Result:

{
  "published":       33,
  "draft_approved": 227,
  "draft_pending":    7,
  "draft_rejected":  19
}

227 drafts marked approved. All of them had their reviewStatus flag set correctly. None of them were visible to readers.

Root cause #1: perspective default

Sanity documents have two layers. A draft sits at drafts.<id>, the published version sits at <id>. Both can coexist for the same logical document. When you fetch with a token, the default perspective overlays the draft on top of the published version and returns whichever exists. For an editorial site running with a token (because the dataset is in private-read mode for our use case), this is the wrong default. We always want to read the published version on the public site, never the draft.

Without perspective: 'published' set on the client, a draft document with reviewStatus == "approved" would pass our GROQ filter and get served to readers under the same slug as its published twin. We never noticed because the in-progress drafts were never published to begin with: the bug below kept them stuck.

Fix:

export const sanity = projectId
  ? createClient({
      projectId,
      dataset,
      apiVersion: '2024-10-01',
      useCdn: false,
      token: process.env.SANITY_WRITE_TOKEN,
      perspective: 'published', // <-- the one-line fix
    })
  : null

Belt-and-suspenders, every public GROQ query also gained a filter:

const NO_DRAFTS = `!(_id in path("drafts.**"))`

So even if a future caller built an ad-hoc client without the perspective set, the query itself would still hide drafts.

Root cause #2: the approve handler

After the perspective fix, the published count was still 33. The 227 approved drafts were still drafts, just with a flag set.

Reading the approve handler again with the perspective context in mind:

await sanity.patch(id).set({ reviewStatus: 'approved', ... }).commit()

This patches the draft document. It does not promote it. The published version under the bare id never gets created. From the public site's point of view, nothing changed.

The standard Sanity draft-promotion idiom is:

if (next === 'approved' && id.startsWith('drafts.')) {
  const publishedId = id.replace(/^drafts\./, '')
  const draft = await sanity.getDocument(id)
  if (draft) {
    const { _id, _rev, _createdAt, _updatedAt, ...rest } = draft as any
    await sanity.createOrReplace({
      ...rest,
      _id: publishedId,
      _type: 'article',
      reviewStatus: 'approved',
      reviewedAt: new Date().toISOString(),
    })
    await sanity.delete(id)
  }
}

Fetch the full draft, write it under the published id (strip the drafts. prefix), delete the draft. Sanity treats the result as published. The reject path still patches in place because rejected items stay as drafts on purpose: kept as a record, hidden from readers.

Backfill

That fixes new approvals. The 227 already in the backlog still needed promoting. A one-time script that walks every approved draft and applies the same promotion logic:

const drafts = await client.fetch(
  `*[_type == "article" && _id in path("drafts.**") && reviewStatus == "approved"]
    | order(_createdAt asc)`,
)

for (const draft of drafts) {
  const publishedId = draft._id.replace(/^drafts\./, '')
  // skip if a published twin already exists; don't clobber manual edits
  const existing = await client.fetch(
    `*[_id == $id][0]{ _id }`,
    { id: publishedId },
  )
  if (existing) continue

  const { _id, _rev, _createdAt, _updatedAt, ...rest } = draft
  await client.createOrReplace({ ...rest, _id: publishedId, _type: 'article' })
  await client.delete(draft._id)
}

Ran it against production. 227 promoted, 0 errors. Published count moved from 33 to 260. Sitemap discovered URLs went from a couple dozen to 293.

Followed up with an IndexNow bulk ping so Bing, Yandex, and the consortium would crawl the new URLs without waiting for sitemap re-discovery. Single POST, 289 URLs, accepted in one shot.

The takeaway

The Sanity perspective default is not a bug. The docs are clear. The mistake was a blind spot: when you write code that uses a token for read operations (because your dataset is private-read), you have to actively pick a perspective. Otherwise you get whichever overlay Sanity decided to give you, which for a public website is rarely what you want.

The deeper lesson: I had two bugs that combined into invisible content. Either alone would have been visible. Together they hid the site from itself. A monthly audit catches this kind of compounding silently-fails-but-works-anyway state.

Code: the fix landed in two commits on the Fax Office 1987 repo. If you run Sanity and your dataset is private-read, the perspective: 'published' line might be the highest-leverage one-character change you ship this month.

What writing the same spec in four languages taught me about YAML

sk8ordie84 — Mon, 04 May 2026 15:04:37 +0000

A few weeks ago, I shipped the first reference implementation of a small specification I'd been working on. Eight YAML fields, a SHA-256 hash, and the rule that the hash gets computed before the experiment runs. The point was modest: if you're going to publish an ML accuracy claim, you ought to be able to prove that the threshold you wrote down was the threshold you committed to, not the one you settled on after seeing the test set.

The spec is called PRML — Pre-Registered ML Manifest. The Python reference implementation took a weekend.

Then I made the mistake of writing it in JavaScript. Then in Go. Then in Rust.

Each one found a different bug. Not in the implementations. In the spec.

What the spec is supposed to do

A PRML manifest is a tiny YAML document that locks an evaluation claim before the experiment runs:

version: prml/0.1
claim_id: 01900000-0000-7000-8000-000000000099
created_at: '2026-05-01T09:00:00Z'
metric: auroc
comparator: '>='
threshold: 0.85
dataset:
  id: credit-default-2026
  hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
seed: 314159
producer:
  id: studio-11.co

You write the manifest. You compute SHA-256 over its canonical bytes. You commit that hash somewhere public. Then you run the experiment. Anyone with the manifest, the dataset, and the model can recompute the hash, recompute the metric, and check that you didn't move the goalposts.

The cryptographic primitive is boring. The hard part, it turns out, is "canonical bytes."

The Python implementation

PyYAML is friendly. yaml.safe_dump(manifest, sort_keys=True, default_flow_style=False) gives you something readable, lexicographically key-sorted, with a stable line layout. You hash the bytes. You're done.

The whole thing fit in 1,287 lines of single-file Python. CLI verbs, conformance suite, manifest loader, hash sidecar generator, signature stub. The spec said "canonical YAML, lexicographic key order, UTF-8" and PyYAML basically did that for free.

Then someone asked the obvious question: would another language produce the same bytes?

JavaScript said no

The first port was JavaScript with js-yaml. Same fields, same input, hash off by everything.

The diff:

# Python (PyYAML, sort_keys=True)
threshold: 0.85

# JavaScript (js-yaml default)
threshold: 0.85

Looks identical. The bytes weren't.

js-yaml was emitting a trailing space after some scalars where PyYAML wasn't. Or rather, both libraries thought they were emitting "compact" YAML and disagreed about what that meant for floats with trailing zeros. The fix in JavaScript was a custom serializer that normalized whitespace by post-processing the output. Ugly, but it worked.

I told myself this was a JavaScript-specific oddity. Three more languages would clarify it.

Go made it worse

gopkg.in/yaml.v3 is a careful library. It also iterates map keys in random order by default, because Go maps are explicitly unordered.

So the canonical form rule "sort keys lexicographically" had been doing all the heavy lifting in Python, where dicts preserve insertion order, and silently in JavaScript, where objects also preserve insertion order in practice. In Go, every output was a different byte sequence until I sorted explicitly into a slice and serialized that.

That was a real spec bug, not an implementation bug. The spec said "lexicographic key order" but didn't say "MUST not depend on language map iteration order." The Python implementation had been compliant by accident.

Spec patch: §3.2 now says the algorithm is "extract keys, sort with byte-order comparison, serialize in that order, recurse into nested maps." Not "lexicographic order" — that was ambiguous between byte-order and Unicode collation. Byte-order it is.

Rust caught the float bug

Rust uses serde_yaml, which has its own opinions about how to render numbers.

The canonical form rule said "render integers as integers, floats with their full decimal representation." Python rendered 0.000001 as 1.0e-06. Rust rendered the same value as 0.000001. JavaScript rendered it as 1e-6. Three different bytes, three different hashes, all "valid YAML" by their respective parsers.

This wasn't a sortable thing. The spec just didn't say what canonical YAML float rendering looked like.

I wrote it down: scientific notation for magnitudes outside [1e-4, 1e+15], decimal notation inside, mantissa with explicit .0 for integer-valued floats, exponent zero-padded to 2 digits. Implemented in all four languages. The TV-018 conformance vector tests this specifically. It now passes byte-for-byte everywhere.

What the four implementations actually proved

The spec wasn't precise. It was "precise enough for one library." The minute the second library had a different opinion about whitespace, key order, or float rendering, the SHA-256 hashes diverged. The protocol was correct. The format was underspecified.

I now think this is the actual lesson: the spec is the second implementation.

You don't know what your specification says until at least two independent implementations have to agree on what byte-for-byte equivalence means. Specs written by single-implementation teams are not specs. They're "PyYAML output, plus some prose."

The honest caveat

PRML v0.1.3 is not done. Even with byte-equivalent canonical output across four implementations, there's a structural gap I documented in §8.1 of the spec rather than hide.

A producer can run an evaluation, get a result they don't like, and just not publish the manifest. Then they can re-run with different parameters, lock a different manifest, and publish only the favorable one. The cryptographic protocol is satisfied. The regulatory purpose — "you committed before you knew" — is not.

The format itself can't fix this. It's closed only at the deployment layer: publish-before-run timestamps, sequential claim_id allocation by an external registrar, or external pre-registration anchoring (OSF, blockchain, whatever).

v0.2 normatively adopts the third option for the high-risk producer tier. v0.1 ships with the gap documented and three named mitigations.

I think open documentation of a real failure mode is worth more than papered-over silence. Specs that pretend to be airtight invite worse trust than specs that say "here's what we don't yet do."

Where it is

Spec (CC BY 4.0): https://spec.falsify.dev/v0.1
Repo (MIT): https://github.com/studio-11-co/falsify
v0.1.3 release notes: https://github.com/studio-11-co/falsify/releases/tag/v0.1.3
v0.2 RFC roadmap (freeze 2026-05-22): in the repo's spec/v0.2/ folder
All four reference implementations in impl/

If you've ever published an ML benchmark and wished there were a way to prove later you didn't tune the threshold post-hoc, this is the substrate. If you've ever written a spec and discovered it was secretly "your library's output," I would genuinely value notes on what you did about it.

Working draft, v0.1.3. The format will probably change in v0.2 — review window is open.

"I implemented PRML in two languages. Three things broke that the spec didn't warn about." published: true

sk8ordie84 — Fri, 01 May 2026 19:50:42 +0000

PRML v0.1 is a small specification I drafted three weeks ago. It binds an ML evaluation claim — (metric, comparator, threshold, dataset hash, random seed, producer) — to a SHA-256 digest computed over canonical YAML bytes, before the experiment runs. The spec is at spec.falsify.dev/v0.1. The Python reference implementation is on GitHub. v0.2 freezes 2026-05-22.

A specification with one implementation is indistinguishable from that implementation's bugs. So this past weekend I sat down and built a second reference implementation, in Node.js, from scratch. The goal: take the prose spec, ignore the Python source, and produce byte-identical canonical bytes for all twelve v0.1 conformance vectors.

It worked. 12/12 vectors pass byte-for-byte. The implementation is 404 lines of JavaScript with zero runtime dependencies beyond the Node.js standard library. You can run it from impl/js/falsify.js.

What's interesting is what didn't work the first time. The exercise surfaced three quiet portability gotchas — places where the spec's prose and the spec's twelve vectors silently disagreed about what the bytes should be. Each of them is a real defect in the v0.1 specification, and each is now an action item for v0.2.

This post is the three findings.

Finding 1 — Sixty-four-bit integer precision

The first failing vector was TV-006: seed: 18446744073709551615. That's $2^{64} - 1$, the largest unsigned 64-bit integer the v0.1 spec allows for the seed field.

Naive Node.js parses this through JSON.parse into a Number. JavaScript's Number is IEEE-754 binary64. The largest integer you can safely represent in binary64 is $2^{53} - 1$, which is about $9 \times 10^{15}$. Above that, integers round to the nearest representable float.

So when Node.js read the test vector input file, the seed 18446744073709551615 quietly became 18446744073709552000 — a value $385$ larger than what the test vector said. The canonicalizer then dumped that wrong number, and the hash didn't match.

The same problem hits Go (int64, $2^{63} - 1$ ceiling), Java (same), and any other language whose default integer type isn't unbounded.

Language	Native integer ceiling	TV-006 round-trips?
Python 3	unbounded	yes
JavaScript Number	$2^{53} - 1$	no
Go `int64`	$2^{63} - 1$	no
Java `long`	$2^{63} - 1$	no
Rust `u64`	$2^{64} - 1$	yes

The PyYAML-based Python reference implementation works only because Python's int is arbitrary-precision. The spec did not mention this, anywhere.

The fix in the Node.js implementation: parse the JSON text with a regex that wraps any 16-or-more-digit integer in a sentinel string before JSON.parse sees it, then unwrap to BigInt after parse. Twenty lines of JavaScript that no spec reader could have predicted from the prose.

The fix for v0.2: make seed a quoted decimal string in the canonical form: seed: '18446744073709551615'. Languages with weak integer types now get a string and can opt into BigInt themselves. The format is unambiguous from the bytes alone.

Finding 2 — Integer-valued floats lose their type

The next failing vector was TV-008: a manifest with threshold: 1.0.

The expected canonical bytes contain threshold: 1.0. The actual produced bytes contain threshold: 1. The hash differed. This bothered me for ten minutes.

It turns out: when JSON parsers encounter 1.0 in a JSON document, almost all of them lose the float-ness. JavaScript's JSON.parse returns Number(1), indistinguishable at runtime from the integer 1. When a YAML emitter then takes that number and serialises it, it has no signal that the producer wrote 1.0 rather than 1. So it emits 1. The hash drifts.

PyYAML doesn't have this problem because PyYAML's load-and-dump cycle uses Python's native float type, which round-trips through 1.0 cleanly. JavaScript's Number cannot.

This is a property of the JSON format itself. JSON does not distinguish integer-valued floats from integers. The information is destroyed at parse time, before any canonicalizer runs.

The fix in the Node.js implementation: a small "this field should always render as a float" set, currently containing one element: {'threshold'}. The canonicalizer checks the field name and forces .0 when the value is integer-valued. A field-specific hack.

The fix for v0.2: specify that threshold always renders with at least one decimal place in the canonical form. Two lines in the spec close it. No field-aware emitter logic required.

Finding 3 — "Plain scalar" disagreements

The third failing case was the same vector, TV-008: comparator: ==.

The expected canonical bytes have comparator: ==. JavaScript's js-yaml library produced comparator: '==' — single-quoted. SHA-256 is unforgiving; this difference sets a different hash.

YAML 1.1 and 1.2 both have a notion of "plain scalars": strings that don't need quotes because they contain no characters or patterns that would confuse the parser. A long list of rules governs whether a particular string can be plain: must not start with an indicator character (-, ?, :, ,, [, ], {, }, #, &, *, !, |, >, ', ", %, @, `), must not contain colon-space, must not look like a number/boolean/null/timestamp, must not have leading/trailing whitespace, etc.

PyYAML and js-yaml implement this predicate with subtly different conservatism. PyYAML accepts == as a plain scalar because none of the rules fire — there is no indicator character, no number resolution, no timestamp pattern. js-yaml is more defensive: it sees a string that could be confusing and quotes it.

For >=, <=, >, <, both libraries quote — the leading character is in the indicator set. So those work. Only == is special, and only == differs.

The fix in the Node.js implementation: I rewrote the plain-scalar predicate from scratch, in about fifty lines, matching PyYAML's behaviour. It checks for indicator-prefix, leading/trailing whitespace, colon-space and hash-space, number-resolution regex, boolean/null set, timestamp regex, and control-character escape. With this hand-rolled predicate, TV-008 reproduces.

The fix for v0.2: publish a formal canonicalization grammar. Or, simpler and aggressive: drop the plain-scalar concept entirely. Always single-quote every string scalar in the canonical form. The output is ~10% larger; the ambiguity surface is zero. No predicate needed; no second implementation reverse-engineering an emitter.

What this exercise really proves

It does not prove that PRML is bulletproof. It proves that PRML is implementable in a second language — which, at the v0.1 stage, was not yet established. A specification existing in only one implementation is indistinguishable from that implementation's bugs. PRML is now demonstrably more than that.

It also does not prove that all PyYAML edge cases are covered. The Node.js implementation matches the twelve current vectors, which exercise specific cases. Adding new vectors (Unicode normalisation, control characters, very long strings, unusual line-folding) might reveal further divergences.

The general lesson: a content-addressed format has to be specified in terms of the bytes it produces, not in terms of the emitter that produces them. PyYAML's safe_dump is a stable, careful, twenty-year-old emitter. It is not a specification. The next time someone wants to write a content-addressed YAML format — for SBOMs, for build provenance, for AI evaluation claims, anything — write the canonicalization grammar first, and then implement it. Don't describe an emitter; describe bytes.

v0.2 action items, summarised

The findings translate to three concrete v0.2 specification changes:

seed is a quoted decimal string. Closes 64-bit integer precision portability.
threshold always renders with at least one decimal place. Closes integer-valued float type loss.
Always-quoted string scalars. Eliminates the plain-scalar predicate ambiguity entirely.

Plus a fourth, broader change:

Publish a formal canonicalization grammar in ABNF. With the always-quoted rule, the grammar is short — about forty production rules. It becomes the source of truth for conformance, replacing the implicit "PyYAML's behaviour" reference.

The full v0.2 roadmap, including six other extension fields (algorithm agility, tolerance, multi-claim manifests, mandatory signatures for high-risk Annex III, twelve new conformance vectors, sidecar format extension), is at spec/v0.2/ROADMAP.md. The freeze is targeted 2026-05-22 — three weeks from this writing — and the five open RFC questions in the roadmap are the parts where outside opinion would carry the most weight.

How to read along

If you want to see the artefacts directly:

The Node.js implementation: impl/js/falsify.js — 404 LOC, MIT.
The portability findings document: spec/analysis/canonicalization-portability-v0.1.md.
The conformance suite: spec/test-vectors/v0.1/ — JSON, twelve entries with locked digests.
The v0.1 spec: spec.falsify.dev/v0.1.
The arXiv preprint (working draft): spec/paper/ — 14-page LaTeX, CC BY 4.0.
Public review thread: GitHub Discussion #6.

If you want to add a third implementation in a third language — Rust, Go, Java, Swift, OCaml — the test vectors are the contract. If your canonicalizer reproduces all twelve byte-for-byte, your implementation is conformant. Open a PR; I'll add it.

— Studio-11 (independent), hello@studio-11.co

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

sk8ordie84 — Fri, 01 May 2026 13:46:43 +0000

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54%.

I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.

Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures?

I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done.

This is not a problem about that specific model card or those specific authors. It is a problem about every published ML accuracy number I have ever read.

Five failure modes that current reporting practices cannot detect

A claim like "our model achieves 91.3% accuracy on benchmark X" can be wrong, in published form, in at least these five ways, none of which leave a forensic trace:

Threshold drift. The team picked the threshold after running the experiment, by looking at where their model happened to land, and reported that as if it was the original target.
Slice selection. The evaluation set was filtered after results were observed (e.g., dropping the 12 hardest examples because "they were mislabeled").
Silent re-runs. Five seeds were tried; only the seed that passed was reported.
Metric ambiguity. "F1" without specifying micro vs macro. "Accuracy" without specifying top-k. "Pass@1" without specifying temperature.
Dataset drift. The benchmark hosted at the canonical URL changed between the experiment date and the publication date, and the team did not pin the bytes.

Each of these is consistent with current best-practice reporting. Each leaves the published number unfalsifiable: a reader cannot, even in principle, distinguish honest reporting from any of the above.

Why no infrastructure exists

Pre-registration solved this exact problem in adjacent fields:

Clinical trials, in 2007, with ClinicalTrials.gov.
Psychology, in 2013, with Open Science Framework.
Economics, the same year, with the AEA registry.

ML never got the equivalent. The closest thing — the ML Reproducibility Challenge — is an annual peer-driven effort to re-run published experiments. It produces excellent post-hoc analysis but does not change the publication-time commitment surface.

The 2026 regulatory window is the part that matters most for builders. The EU AI Act Article 12 requires automatic logging of evaluation events for high-risk systems. Article 18 requires 10-year retention. Both enter force August 2, 2026. NIST AI RMF references content-addressed audit trails as a recommended control. ISO/IEC 42001:2023 mandates documented information practices that PRML directly satisfies.

In other words: there is now a regulatory deadline by which "we have a tradition of reporting these numbers honestly" stops being a sufficient answer.

PRML in plain English

I drafted a small format, working draft v0.1, currently under public review. It is called PRML — Pre-Registered ML Manifest. The whole spec fits in a single YAML schema:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000000"
created_at: "2026-05-01T12:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.85
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "studio-11.co"

That is the entire required surface. Eight fields. Plain text. UTF-8. YAML 1.2 strict subset (block style only, lexicographic key ordering, no comments, no flow collections).

The format defines a deterministic canonicalization. Given any logical YAML mapping with these fields, there is exactly one canonical UTF-8 byte sequence. The SHA-256 of those bytes is the manifest hash.

The hash is published before the experiment runs. After the experiment, an independent verifier can:

Re-canonicalize the manifest.
Recompute SHA-256.
Compare against the published sidecar hash. If they differ, the manifest has been edited post-lock — exit code 3 (TAMPERED).
Load the dataset by its content hash. Verify byte integrity.
Run the metric computation under the seed. Compare against threshold.
Emit 0 (PASS), 10 (FAIL), or one of the diagnostic codes.

There is no trust in the producer required at verification time. Anyone with the manifest, the dataset, and the model can reproduce the verdict offline.

Honest amendments — "we found 12 mislabeled examples and re-ran" — do not overwrite. They append. Each new manifest carries a prior_hash field pointing to the manifest it amends. The chain is the audit log. When a regulator or reviewer asks "what was committed when?", the answer is one hash, and from that hash the entire history is recoverable.

A worked example with the reference implementation

The reference implementation is a single-file Python CLI called falsify, MIT-licensed, 1287 lines. Install it the usual way:

pip install falsify

Initialize a claim:

falsify init imagenet-87

This writes .falsify/imagenet-87/spec.yaml with the required PRML fields as placeholders. Edit the file with your real values:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000010"
created_at: "2026-05-01T14:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.87
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "your-org.example"

Lock it:

$ falsify lock imagenet-87
locked: yes (sha256:1a3466cc08ee, locked_at 2026-05-01T14:00:00Z)

Now the spec is hash-bound. If anyone — including you — edits the YAML, the next falsify verify exits 3 and refuses to produce a verdict.

Run the experiment, capture the metric value (let us say 0.876), and verify:

$ falsify verify imagenet-87 --observed 0.876
PASS  metric=accuracy observed=0.876 >= threshold=0.87
exit 0

If the team had silently raised the threshold to 0.88 after seeing the result:

$ falsify verify imagenet-87 --observed 0.876
TAMPERED  spec hash drift detected
recorded: 1a3466cc08ee...
current:  7b2c9a5d1e4f...
exit 3

The CI pipeline halts. The deploy does not happen. There is no judgment call.

How do you know the canonicalization actually works?

The most reasonable skeptical question about a content-addressed format is: what guarantees that two implementations produce the same canonical bytes for the same input?

For v0.1 we publish 12 conformance test vectors. Each vector defines:

An input manifest (logical YAML, key order irrelevant).
The exact UTF-8 byte sequence the canonicalizer must produce.
The exact lowercase-hex SHA-256 of those bytes.

The vectors exercise:

Test	Property
TV-001	Minimal valid manifest
TV-002	Key-ordering invariance — random insertion order produces same hash
TV-003	Single-bit-of-content sensitivity — `0.85` vs `0.86` produces different hash
TV-004	Optional fields populated (`model.id`, `model.hash`, `dataset.uri`)
TV-005	Unicode handling in `producer.id`
TV-006	Maximum seed value (`2⁶⁴ − 1`)
TV-007	Minimum seed (`0`)
TV-008	Equality comparator with integer-valued threshold
TV-009	Amendment with `prior_hash` linkage
TV-010	`pass@k` metric for code generation
TV-011	AUROC with strict comparator
TV-012	Regression metric with `<=` comparator

A new implementation in Rust, Go, or TypeScript is conformant only if it reproduces all 12 vectors exactly. The reference implementation has 28 unittest assertions in CI that lock in the v0.1 hash contract; any code change that breaks a vector forces a v0.2 spec bump.

What it is not

PRML does not establish whether a claimed metric is correct, fair, or sufficient. It establishes only that the claim was committed before it was tested. Specifically:

Not a model card replacement. PRML manifests sit underneath model cards as the cryptographic floor.
Not a benchmark. PRML does not pick metrics for you.
Not a reproducibility framework. PRML does not ship code or data.
Not a tool. PRML is a format. falsify is one implementation. A second implementation in any language passes if it reproduces the test vectors.
Not a compliance product. It is a primitive that makes named regulatory obligations satisfiable with arithmetic verification rather than process attestation.

What it costs

The cost of adopting PRML at the experiment level is one hash function call. SHA-256 is FIPS 180-4, available in every standard library written since 2002. The format is UTF-8 plain text, readable in 2046 by any tool that can read text.

The cost of not adopting it scales with deployment scope. For a personal project, zero. For a research paper, growing pressure as reviewers begin to ask. For a product subject to EU AI Act Annex III obligations, measurable in regulatory exposure plus legal review hours. For a foundation model that will be cited in safety cases for a decade, the cost is roughly the credibility of every accuracy claim you have ever shipped.

What I am asking for

This is a working draft. v0.2 freeze is targeted 2026-05-22. Three concrete asks:

Format review. Is the canonical serialization in §3 of the spec unambiguous? Are there YAML 1.2 edge cases the spec misses?
Threat-model gaps. §6 of the spec enumerates six adversaries. What is missing?
Compliance correctness. The AI Act mapping maps PRML fields to Articles 12, 17, 18, 50, 72, and 73. Compliance lawyers and engineers in EU AI Act adjacent roles: are the bindings defensible?

Discussion thread: github.com/sk8ordie84/falsify/discussions/6.

Tl;dr

Most published ML accuracy numbers are unfalsifiable in practice.
A small spec — eight fields, one hash function, one canonical serialization — gives published claims a cryptographic floor.
Reference implementation in Python, MIT, single file. Spec under CC BY 4.0.
v0.2 freeze in 3 weeks. Reviews, ambiguity reports, threat-model critiques are wanted.

Spec: spec.falsify.dev/v0.1
Code: github.com/sk8ordie84/falsify
Discussion: github.com/sk8ordie84/falsify/discussions/6

I built a CLI that hashes your ML accuracy claims before the experiment runs

sk8ordie84 — Wed, 29 Apr 2026 07:33:37 +0000

I built a CLI that hashes your ML accuracy claims before the experiment runs

Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page.

I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment.

That night I started building falsify. Three days later I shipped it.

This post is what I built, why I built it that small, and the one Python function that does most of the work.

The problem in one sentence

If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.

Psychology and medicine figured this out the hard way and invented pre-registration. You write down the prediction, the threshold, and the analysis plan, hash it, timestamp it, and you cannot move it later without everyone knowing.

ML never adopted any of this. A git commit is the closest thing most teams have, and git commit --amend followed by a force-push will quietly erase the receipt.

So I wrote a CLI that does the smallest possible version of pre-registration: canonicalize a YAML spec, SHA-256 it, lock the hash, and refuse to let it move.

What "the smallest possible version" actually looks like

# falsify.yaml
claim:
  metric: accuracy
  threshold: 0.94
  dataset: customer_eval_v3
  dataset_sha256: 4f1a8b2c...
  model: ranker-7b-2026q1
  test_n: 1200
created_at: 2026-04-28T19:45:00Z

That is the contract. The CLI workflow is three commands:

pip install falsify
falsify lock falsify.yaml      # writes a .lock file with the hash
falsify check falsify.yaml --result actual_accuracy=0.91

Exit codes are the API:

0 — claim verified
10 — claim falsified (you missed the threshold, but cleanly)
3 — tamper detected (someone edited the spec after lock)
11 — spec invalid

10 and 3 being different exit codes is the whole point. "We didn't hit the number" is a different thing from "we moved the number."

The one function that matters

The reason this works at all is YAML canonicalization. JSON looks canonical but isn't — key order, whitespace, and unicode forms can all drift while the document stays "the same." YAML is worse by default, but easy to canonicalize once you commit to a few rules.

Here is the actual hashing function from the source. It is small on purpose:

import hashlib
import unicodedata
import yaml  # PyYAML

def canonical_sha256(spec_path: str) -> str:
    """Return SHA-256 of a canonicalized YAML spec.

    Canonicalization rules:
      - Parse the document, drop comments and anchors
      - Recursively sort all mapping keys
      - Normalize all strings to NFC unicode
      - Re-emit as UTF-8 with LF line endings, no trailing whitespace
      - Hash the bytes
    """
    with open(spec_path, "rb") as f:
        data = yaml.safe_load(f)

    def normalize(node):
        if isinstance(node, dict):
            return {
                unicodedata.normalize("NFC", k): normalize(v)
                for k, v in sorted(node.items())
            }
        if isinstance(node, list):
            return [normalize(x) for x in node]
        if isinstance(node, str):
            return unicodedata.normalize("NFC", node)
        return node

    canonical = yaml.safe_dump(
        normalize(data),
        sort_keys=True,
        allow_unicode=True,
        default_flow_style=False,
        line_break="\n",
    ).encode("utf-8")

    return hashlib.sha256(canonical).hexdigest()

That is the entire trust primitive. Everything else in the 3925-line file — the lock file format, the CI integration, the tamper detection, the schema validation — is plumbing around this one function.

The reason it has to be exactly this strict: any wiggle room (key order, trailing whitespace, BOM, unicode form) is a place where someone can quietly change the spec and produce a "matching" hash. Canonicalize once, hash once, never look back.

The CI moment

The point of all of this is the moment a teammate edits the spec after lock. Maybe they have a good reason. Maybe they don't. Either way, you want the system to notice.

# .github/workflows/eval.yml
- name: verify accuracy claim
  run: |
    falsify check falsify.yaml --result-file results.json

If anyone touches falsify.yaml after the lock, the action exits with code 3 and the PR cannot merge. The lie is blocked at the filesystem level, not by trust.

What I learned in three days

A few things surprised me while building this:

YAML canonicalization is most of the value. I spent way more time on the canonicalizer than on anything else. Every "clever" optimization I tried later turned out to be a place where two byte-different YAMLs produced the same hash. Boring is correct.
Exit codes are an API. I almost shipped with just 0 and 1. Splitting "falsified" from "tampered" was the single biggest jump in how teams reacted to it. People immediately understood the difference.
One file is a feature. I kept resisting the urge to split it into a package. Auditors and skeptical SREs read single-file Python CLIs in one sitting. They do not read packages.
Dogfooding is non-negotiable. falsify locks its own test claims with falsify. The honesty badge on the README is generated by the tool itself, on its own metrics. If a tool that locks claims cannot lock its own, why would you trust it.
Agents change what one person can ship in a weekend. I built this solo in three days with Claude Opus 4.7 in the loop — pair programming, eval generation, doc drafting, the whole pipeline. The 518 tests and the YAML canonicalizer corner cases would have been a two-week solo grind without it. The actual design decisions were still mine; the agent just made the cost of being thorough a lot lower.

Try it

pip install falsify
falsify init

Repo: https://github.com/sk8ordie84/falsify
90-second demo: https://youtu.be/vVZTNeak5PA
Site: https://falsify.dev
PyPI: https://pypi.org/project/falsify/

Single file, MIT, Python 3.11+, stdlib plus pyyaml. If you ship any number followed by a percent sign, lock it before the experiment runs. It costs 30 seconds and saves the meeting where someone has to explain why the number changed.

I built a film camera simulator in a single HTML file here's how

sk8ordie84 — Mon, 20 Apr 2026 18:51:08 +0000

Launched today: faxoffice1987.com — 8 film cameras simulated in Canvas 2D.

The constraints I set myself:

One HTML file
No build step, no dependencies, no npm install
Runs offline from a USB drive
No backend, no account, no uploads

The hard part: per-pixel color science. Each film stock (Tri-X,
Portra, Velvia, Neopan Acros) has its own render path. Not a filter
on top — a decision at the pixel level.

Stack:

Vanilla JS, Canvas 2D
Cloudflare Pages + Functions (share links, license validation)
Polar.sh for checkout
localStorage for state

Pricing experiment: $29 one-time. No subscription. 1 camera free forever.

Would love architecture feedback especially on the color science approach.

Link: https://faxoffice1987.com