<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: sk8ordie84</title>
    <description>The latest articles on DEV Community by sk8ordie84 (@sk8ordie84).</description>
    <link>https://dev.to/sk8ordie84</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3889537%2Fa6734c67-3985-462e-a16e-4a5dc086772f.png</url>
      <title>DEV Community: sk8ordie84</title>
      <link>https://dev.to/sk8ordie84</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sk8ordie84"/>
    <language>en</language>
    <item>
      <title>What writing the same spec in four languages taught me about YAML</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Mon, 04 May 2026 15:04:37 +0000</pubDate>
      <link>https://dev.to/sk8ordie84/what-writing-the-same-spec-in-four-languages-taught-me-about-yaml-32ao</link>
      <guid>https://dev.to/sk8ordie84/what-writing-the-same-spec-in-four-languages-taught-me-about-yaml-32ao</guid>
      <description>&lt;p&gt;A few weeks ago, I shipped the first reference implementation of a small specification I'd been working on. Eight YAML fields, a SHA-256 hash, and the rule that the hash gets computed before the experiment runs. The point was modest: if you're going to publish an ML accuracy claim, you ought to be able to prove that the threshold you wrote down was the threshold you committed to, not the one you settled on after seeing the test set.&lt;/p&gt;

&lt;p&gt;The spec is called PRML — Pre-Registered ML Manifest. The Python reference implementation took a weekend.&lt;/p&gt;

&lt;p&gt;Then I made the mistake of writing it in JavaScript. Then in Go. Then in Rust.&lt;/p&gt;

&lt;p&gt;Each one found a different bug. Not in the implementations. In the spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the spec is supposed to do
&lt;/h2&gt;

&lt;p&gt;A PRML manifest is a tiny YAML document that locks an evaluation claim before the experiment runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prml/0.1&lt;/span&gt;
&lt;span class="na"&gt;claim_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;01900000-0000-7000-8000-000000000099&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-05-01T09:00:00Z'&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auroc&lt;/span&gt;
&lt;span class="na"&gt;comparator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;='&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;credit-default-2026&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855&lt;/span&gt;
&lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;314159&lt;/span&gt;
&lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;studio-11.co&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You write the manifest. You compute SHA-256 over its canonical bytes. You commit that hash somewhere public. Then you run the experiment. Anyone with the manifest, the dataset, and the model can recompute the hash, recompute the metric, and check that you didn't move the goalposts.&lt;/p&gt;

&lt;p&gt;The cryptographic primitive is boring. The hard part, it turns out, is "canonical bytes."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Python implementation
&lt;/h2&gt;

&lt;p&gt;PyYAML is friendly. &lt;code&gt;yaml.safe_dump(manifest, sort_keys=True, default_flow_style=False)&lt;/code&gt; gives you something readable, lexicographically key-sorted, with a stable line layout. You hash the bytes. You're done.&lt;/p&gt;

&lt;p&gt;The whole thing fit in 1,287 lines of single-file Python. CLI verbs, conformance suite, manifest loader, hash sidecar generator, signature stub. The spec said "canonical YAML, lexicographic key order, UTF-8" and PyYAML basically did that for free.&lt;/p&gt;

&lt;p&gt;Then someone asked the obvious question: would another language produce the same bytes?&lt;/p&gt;

&lt;h2&gt;
  
  
  JavaScript said no
&lt;/h2&gt;

&lt;p&gt;The first port was JavaScript with &lt;code&gt;js-yaml&lt;/code&gt;. Same fields, same input, hash off by everything.&lt;/p&gt;

&lt;p&gt;The diff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python (PyYAML, sort_keys=True)&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;

&lt;span class="c1"&gt;# JavaScript (js-yaml default)&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks identical. The bytes weren't.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;js-yaml&lt;/code&gt; was emitting a trailing space after some scalars where PyYAML wasn't. Or rather, both libraries thought they were emitting "compact" YAML and disagreed about what that meant for floats with trailing zeros. The fix in JavaScript was a custom serializer that normalized whitespace by post-processing the output. Ugly, but it worked.&lt;/p&gt;

&lt;p&gt;I told myself this was a JavaScript-specific oddity. Three more languages would clarify it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go made it worse
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;gopkg.in/yaml.v3&lt;/code&gt; is a careful library. It also iterates map keys in random order by default, because Go maps are explicitly unordered.&lt;/p&gt;

&lt;p&gt;So the canonical form rule "sort keys lexicographically" had been doing all the heavy lifting in Python, where dicts preserve insertion order, and silently in JavaScript, where objects also preserve insertion order in practice. In Go, every output was a different byte sequence until I sorted explicitly into a slice and serialized that.&lt;/p&gt;

&lt;p&gt;That was a real spec bug, not an implementation bug. The spec said "lexicographic key order" but didn't say "MUST not depend on language map iteration order." The Python implementation had been compliant by accident.&lt;/p&gt;

&lt;p&gt;Spec patch: §3.2 now says the algorithm is "extract keys, sort with byte-order comparison, serialize in that order, recurse into nested maps." Not "lexicographic order" — that was ambiguous between byte-order and Unicode collation. Byte-order it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rust caught the float bug
&lt;/h2&gt;

&lt;p&gt;Rust uses &lt;code&gt;serde_yaml&lt;/code&gt;, which has its own opinions about how to render numbers.&lt;/p&gt;

&lt;p&gt;The canonical form rule said "render integers as integers, floats with their full decimal representation." Python rendered &lt;code&gt;0.000001&lt;/code&gt; as &lt;code&gt;1.0e-06&lt;/code&gt;. Rust rendered the same value as &lt;code&gt;0.000001&lt;/code&gt;. JavaScript rendered it as &lt;code&gt;1e-6&lt;/code&gt;. Three different bytes, three different hashes, all "valid YAML" by their respective parsers.&lt;/p&gt;

&lt;p&gt;This wasn't a sortable thing. The spec just didn't say what canonical YAML float rendering looked like.&lt;/p&gt;

&lt;p&gt;I wrote it down: scientific notation for magnitudes outside [1e-4, 1e+15], decimal notation inside, mantissa with explicit &lt;code&gt;.0&lt;/code&gt; for integer-valued floats, exponent zero-padded to 2 digits. Implemented in all four languages. The TV-018 conformance vector tests this specifically. It now passes byte-for-byte everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the four implementations actually proved
&lt;/h2&gt;

&lt;p&gt;The spec wasn't precise. It was "precise enough for one library." The minute the second library had a different opinion about whitespace, key order, or float rendering, the SHA-256 hashes diverged. The protocol was correct. The format was underspecified.&lt;/p&gt;

&lt;p&gt;I now think this is the actual lesson: &lt;strong&gt;the spec is the second implementation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You don't know what your specification says until at least two independent implementations have to agree on what byte-for-byte equivalence means. Specs written by single-implementation teams are not specs. They're "PyYAML output, plus some prose."&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest caveat
&lt;/h2&gt;

&lt;p&gt;PRML v0.1.3 is not done. Even with byte-equivalent canonical output across four implementations, there's a structural gap I documented in §8.1 of the spec rather than hide.&lt;/p&gt;

&lt;p&gt;A producer can run an evaluation, get a result they don't like, and just not publish the manifest. Then they can re-run with different parameters, lock a different manifest, and publish only the favorable one. The cryptographic protocol is satisfied. The regulatory purpose — "you committed before you knew" — is not.&lt;/p&gt;

&lt;p&gt;The format itself can't fix this. It's closed only at the deployment layer: publish-before-run timestamps, sequential &lt;code&gt;claim_id&lt;/code&gt; allocation by an external registrar, or external pre-registration anchoring (OSF, blockchain, whatever).&lt;/p&gt;

&lt;p&gt;v0.2 normatively adopts the third option for the high-risk producer tier. v0.1 ships with the gap documented and three named mitigations.&lt;/p&gt;

&lt;p&gt;I think open documentation of a real failure mode is worth more than papered-over silence. Specs that pretend to be airtight invite worse trust than specs that say "here's what we don't yet do."&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it is
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Spec (CC BY 4.0): &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;https://spec.falsify.dev/v0.1&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repo (MIT): &lt;a href="https://github.com/studio-11-co/falsify" rel="noopener noreferrer"&gt;https://github.com/studio-11-co/falsify&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;v0.1.3 release notes: &lt;a href="https://github.com/studio-11-co/falsify/releases/tag/v0.1.3" rel="noopener noreferrer"&gt;https://github.com/studio-11-co/falsify/releases/tag/v0.1.3&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;v0.2 RFC roadmap (freeze 2026-05-22): in the repo's &lt;code&gt;spec/v0.2/&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;All four reference implementations in &lt;code&gt;impl/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever published an ML benchmark and wished there were a way to prove later you didn't tune the threshold post-hoc, this is the substrate. If you've ever written a spec and discovered it was secretly "your library's output," I would genuinely value notes on what you did about it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Working draft, v0.1.3. The format will probably change in v0.2 — review window is open.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>showdev</category>
      <category>rust</category>
    </item>
    <item>
      <title>"I implemented PRML in two languages. Three things broke that the spec didn't warn about." published: true</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Fri, 01 May 2026 19:50:42 +0000</pubDate>
      <link>https://dev.to/sk8ordie84/i-implemented-prml-in-two-languages-three-things-broke-that-the-spec-didnt-warn-about-2j62</link>
      <guid>https://dev.to/sk8ordie84/i-implemented-prml-in-two-languages-three-things-broke-that-the-spec-didnt-warn-about-2j62</guid>
      <description>&lt;p&gt;PRML v0.1 is a small specification I drafted three weeks ago. It binds an ML evaluation claim — &lt;em&gt;(metric, comparator, threshold, dataset hash, random seed, producer)&lt;/em&gt; — to a SHA-256 digest computed over canonical YAML bytes, &lt;em&gt;before&lt;/em&gt; the experiment runs. The spec is at &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;spec.falsify.dev/v0.1&lt;/a&gt;. The Python reference implementation is on GitHub. v0.2 freezes 2026-05-22.&lt;/p&gt;

&lt;p&gt;A specification with one implementation is indistinguishable from that implementation's bugs. So this past weekend I sat down and built a second reference implementation, in Node.js, from scratch. The goal: take the prose spec, ignore the Python source, and produce byte-identical canonical bytes for all twelve v0.1 conformance vectors.&lt;/p&gt;

&lt;p&gt;It worked. 12/12 vectors pass byte-for-byte. The implementation is 404 lines of JavaScript with zero runtime dependencies beyond the Node.js standard library. You can run it from &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/impl/js" rel="noopener noreferrer"&gt;&lt;code&gt;impl/js/falsify.js&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What's interesting is what &lt;em&gt;didn't&lt;/em&gt; work the first time. The exercise surfaced three quiet portability gotchas — places where the spec's prose and the spec's twelve vectors silently disagreed about what the bytes should be. Each of them is a real defect in the v0.1 specification, and each is now an action item for v0.2.&lt;/p&gt;

&lt;p&gt;This post is the three findings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1 — Sixty-four-bit integer precision
&lt;/h2&gt;

&lt;p&gt;The first failing vector was &lt;strong&gt;TV-006&lt;/strong&gt;: &lt;code&gt;seed: 18446744073709551615&lt;/code&gt;. That's $2^{64} - 1$, the largest unsigned 64-bit integer the v0.1 spec allows for the seed field.&lt;/p&gt;

&lt;p&gt;Naive Node.js parses this through &lt;code&gt;JSON.parse&lt;/code&gt; into a &lt;code&gt;Number&lt;/code&gt;. JavaScript's &lt;code&gt;Number&lt;/code&gt; is IEEE-754 binary64. The largest &lt;em&gt;integer&lt;/em&gt; you can safely represent in binary64 is $2^{53} - 1$, which is about $9 \times 10^{15}$. Above that, integers round to the nearest representable float.&lt;/p&gt;

&lt;p&gt;So when Node.js read the test vector input file, the seed &lt;code&gt;18446744073709551615&lt;/code&gt; quietly became &lt;code&gt;18446744073709552000&lt;/code&gt; — a value $385$ larger than what the test vector said. The canonicalizer then dumped that wrong number, and the hash didn't match.&lt;/p&gt;

&lt;p&gt;The same problem hits Go (&lt;code&gt;int64&lt;/code&gt;, $2^{63} - 1$ ceiling), Java (same), and any other language whose default integer type isn't unbounded.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Native integer ceiling&lt;/th&gt;
&lt;th&gt;TV-006 round-trips?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python 3&lt;/td&gt;
&lt;td&gt;unbounded&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JavaScript Number&lt;/td&gt;
&lt;td&gt;$2^{53} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go &lt;code&gt;int64&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{63} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java &lt;code&gt;long&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{63} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust &lt;code&gt;u64&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{64} - 1$&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The PyYAML-based Python reference implementation works only because Python's &lt;code&gt;int&lt;/code&gt; is arbitrary-precision. The spec did not mention this, anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; parse the JSON text with a regex that wraps any 16-or-more-digit integer in a sentinel string before &lt;code&gt;JSON.parse&lt;/code&gt; sees it, then unwrap to &lt;code&gt;BigInt&lt;/code&gt; after parse. Twenty lines of JavaScript that no spec reader could have predicted from the prose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; make &lt;code&gt;seed&lt;/code&gt; a quoted decimal string in the canonical form: &lt;code&gt;seed: '18446744073709551615'&lt;/code&gt;. Languages with weak integer types now get a string and can opt into BigInt themselves. The format is unambiguous from the bytes alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2 — Integer-valued floats lose their type
&lt;/h2&gt;

&lt;p&gt;The next failing vector was &lt;strong&gt;TV-008&lt;/strong&gt;: a manifest with &lt;code&gt;threshold: 1.0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The expected canonical bytes contain &lt;code&gt;threshold: 1.0&lt;/code&gt;. The actual produced bytes contain &lt;code&gt;threshold: 1&lt;/code&gt;. The hash differed. This bothered me for ten minutes.&lt;/p&gt;

&lt;p&gt;It turns out: when JSON parsers encounter &lt;code&gt;1.0&lt;/code&gt; in a JSON document, almost all of them lose the float-ness. JavaScript's &lt;code&gt;JSON.parse&lt;/code&gt; returns &lt;code&gt;Number(1)&lt;/code&gt;, indistinguishable at runtime from the integer &lt;code&gt;1&lt;/code&gt;. When a YAML emitter then takes that number and serialises it, it has no signal that the producer wrote &lt;code&gt;1.0&lt;/code&gt; rather than &lt;code&gt;1&lt;/code&gt;. So it emits &lt;code&gt;1&lt;/code&gt;. The hash drifts.&lt;/p&gt;

&lt;p&gt;PyYAML doesn't have this problem because PyYAML's load-and-dump cycle uses Python's native &lt;code&gt;float&lt;/code&gt; type, which round-trips through &lt;code&gt;1.0&lt;/code&gt; cleanly. JavaScript's &lt;code&gt;Number&lt;/code&gt; cannot.&lt;/p&gt;

&lt;p&gt;This is a property of the JSON format itself. JSON does not distinguish integer-valued floats from integers. The information is destroyed at parse time, before any canonicalizer runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; a small "this field should always render as a float" set, currently containing one element: &lt;code&gt;{'threshold'}&lt;/code&gt;. The canonicalizer checks the field name and forces &lt;code&gt;.0&lt;/code&gt; when the value is integer-valued. A field-specific hack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; specify that &lt;code&gt;threshold&lt;/code&gt; always renders with at least one decimal place in the canonical form. Two lines in the spec close it. No field-aware emitter logic required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3 — "Plain scalar" disagreements
&lt;/h2&gt;

&lt;p&gt;The third failing case was the &lt;em&gt;same&lt;/em&gt; vector, &lt;strong&gt;TV-008&lt;/strong&gt;: &lt;code&gt;comparator: ==&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The expected canonical bytes have &lt;code&gt;comparator: ==&lt;/code&gt;. JavaScript's &lt;code&gt;js-yaml&lt;/code&gt; library produced &lt;code&gt;comparator: '=='&lt;/code&gt; — single-quoted. SHA-256 is unforgiving; this difference sets a different hash.&lt;/p&gt;

&lt;p&gt;YAML 1.1 and 1.2 both have a notion of "plain scalars": strings that don't need quotes because they contain no characters or patterns that would confuse the parser. A long list of rules governs whether a particular string can be plain: must not start with an indicator character (&lt;code&gt;-&lt;/code&gt;, &lt;code&gt;?&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;,&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, &lt;code&gt;]&lt;/code&gt;, &lt;code&gt;{&lt;/code&gt;, &lt;code&gt;}&lt;/code&gt;, &lt;code&gt;#&lt;/code&gt;, &lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;!&lt;/code&gt;, &lt;code&gt;|&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;'&lt;/code&gt;, &lt;code&gt;"&lt;/code&gt;, &lt;code&gt;%&lt;/code&gt;, &lt;code&gt;@&lt;/code&gt;, &lt;code&gt;`&lt;/code&gt;), must not contain colon-space, must not look like a number/boolean/null/timestamp, must not have leading/trailing whitespace, etc.&lt;/p&gt;

&lt;p&gt;PyYAML and &lt;code&gt;js-yaml&lt;/code&gt; implement this predicate with subtly different conservatism. PyYAML accepts &lt;code&gt;==&lt;/code&gt; as a plain scalar because none of the rules fire — there is no indicator character, no number resolution, no timestamp pattern. &lt;code&gt;js-yaml&lt;/code&gt; is more defensive: it sees a string that &lt;em&gt;could&lt;/em&gt; be confusing and quotes it.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;&amp;gt;=&lt;/code&gt;, &lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, both libraries quote — the leading character is in the indicator set. So those work. Only &lt;code&gt;==&lt;/code&gt; is special, and only &lt;code&gt;==&lt;/code&gt; differs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; I rewrote the plain-scalar predicate from scratch, in about fifty lines, matching PyYAML's behaviour. It checks for indicator-prefix, leading/trailing whitespace, colon-space and hash-space, number-resolution regex, boolean/null set, timestamp regex, and control-character escape. With this hand-rolled predicate, TV-008 reproduces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; publish a formal canonicalization grammar. Or, simpler and aggressive: drop the plain-scalar concept entirely. Always single-quote every string scalar in the canonical form. The output is ~10% larger; the ambiguity surface is zero. No predicate needed; no second implementation reverse-engineering an emitter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this exercise really proves
&lt;/h2&gt;

&lt;p&gt;It does not prove that PRML is bulletproof. It proves that PRML is &lt;em&gt;implementable in a second language&lt;/em&gt; — which, at the v0.1 stage, was not yet established. A specification existing in only one implementation is indistinguishable from that implementation's bugs. PRML is now demonstrably more than that.&lt;/p&gt;

&lt;p&gt;It also does not prove that &lt;em&gt;all&lt;/em&gt; PyYAML edge cases are covered. The Node.js implementation matches the twelve current vectors, which exercise specific cases. Adding new vectors (Unicode normalisation, control characters, very long strings, unusual line-folding) might reveal further divergences.&lt;/p&gt;

&lt;p&gt;The general lesson: &lt;strong&gt;a content-addressed format has to be specified in terms of the bytes it produces, not in terms of the emitter that produces them&lt;/strong&gt;. PyYAML's &lt;code&gt;safe_dump&lt;/code&gt; is a stable, careful, twenty-year-old emitter. It is not a specification. The next time someone wants to write a content-addressed YAML format — for SBOMs, for build provenance, for AI evaluation claims, anything — write the canonicalization grammar first, and &lt;em&gt;then&lt;/em&gt; implement it. Don't describe an emitter; describe bytes.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.2 action items, summarised
&lt;/h2&gt;

&lt;p&gt;The findings translate to three concrete v0.2 specification changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;seed&lt;/code&gt; is a quoted decimal string.&lt;/strong&gt; Closes 64-bit integer precision portability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;threshold&lt;/code&gt; always renders with at least one decimal place.&lt;/strong&gt; Closes integer-valued float type loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always-quoted string scalars.&lt;/strong&gt; Eliminates the plain-scalar predicate ambiguity entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plus a fourth, broader change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Publish a formal canonicalization grammar in ABNF.&lt;/strong&gt; With the always-quoted rule, the grammar is short — about forty production rules. It becomes the source of truth for conformance, replacing the implicit "PyYAML's behaviour" reference.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full v0.2 roadmap, including six other extension fields (algorithm agility, tolerance, multi-claim manifests, mandatory signatures for high-risk Annex III, twelve new conformance vectors, sidecar format extension), is at &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/v0.2/ROADMAP.md" rel="noopener noreferrer"&gt;&lt;code&gt;spec/v0.2/ROADMAP.md&lt;/code&gt;&lt;/a&gt;. The freeze is targeted 2026-05-22 — three weeks from this writing — and the five open RFC questions in the roadmap are the parts where outside opinion would carry the most weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to read along
&lt;/h2&gt;

&lt;p&gt;If you want to see the artefacts directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Node.js implementation:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/impl/js" rel="noopener noreferrer"&gt;&lt;code&gt;impl/js/falsify.js&lt;/code&gt;&lt;/a&gt; — 404 LOC, MIT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The portability findings document:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/analysis/canonicalization-portability-v0.1.md" rel="noopener noreferrer"&gt;&lt;code&gt;spec/analysis/canonicalization-portability-v0.1.md&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The conformance suite:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/spec/test-vectors/v0.1" rel="noopener noreferrer"&gt;&lt;code&gt;spec/test-vectors/v0.1/&lt;/code&gt;&lt;/a&gt; — JSON, twelve entries with locked digests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The v0.1 spec:&lt;/strong&gt; &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;&lt;code&gt;spec.falsify.dev/v0.1&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The arXiv preprint (working draft):&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/spec/paper" rel="noopener noreferrer"&gt;&lt;code&gt;spec/paper/&lt;/code&gt;&lt;/a&gt; — 14-page LaTeX, CC BY 4.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public review thread:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;GitHub Discussion #6&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to add a third implementation in a third language — Rust, Go, Java, Swift, OCaml — the test vectors are the contract. If your canonicalizer reproduces all twelve byte-for-byte, your implementation is conformant. Open a PR; I'll add it.&lt;/p&gt;

&lt;p&gt;— Studio-11 (independent), &lt;code&gt;hello@studio-11.co&lt;/code&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>opensource</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Fri, 01 May 2026 13:46:43 +0000</pubDate>
      <link>https://dev.to/sk8ordie84/why-ml-accuracy-numbers-are-unfalsifiable-and-what-a-1287-line-python-tool-does-about-it-40e1</link>
      <guid>https://dev.to/sk8ordie84/why-ml-accuracy-numbers-are-unfalsifiable-and-what-a-1287-line-python-tool-does-about-it-40e1</guid>
      <description>&lt;p&gt;A few weeks ago I was reading a model card for an open-weight code model. It claimed &lt;code&gt;pass@1 = 67%&lt;/code&gt; on HumanEval. I tried to reproduce it. I got &lt;code&gt;54%&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.&lt;/p&gt;

&lt;p&gt;Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to &lt;em&gt;before&lt;/em&gt; they ran the eval, and how do I know the published &lt;code&gt;67%&lt;/code&gt; is not the best of three runs at three temperatures?&lt;/p&gt;

&lt;p&gt;I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done.&lt;/p&gt;

&lt;p&gt;This is not a problem about that specific model card or those specific authors. It is a problem about every published ML accuracy number I have ever read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five failure modes that current reporting practices cannot detect
&lt;/h2&gt;

&lt;p&gt;A claim like &lt;em&gt;"our model achieves 91.3% accuracy on benchmark X"&lt;/em&gt; can be wrong, in published form, in at least these five ways, none of which leave a forensic trace:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Threshold drift.&lt;/strong&gt; The team picked the threshold &lt;em&gt;after&lt;/em&gt; running the experiment, by looking at where their model happened to land, and reported that as if it was the original target.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slice selection.&lt;/strong&gt; The evaluation set was filtered after results were observed (e.g., dropping the 12 hardest examples because "they were mislabeled").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent re-runs.&lt;/strong&gt; Five seeds were tried; only the seed that passed was reported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric ambiguity.&lt;/strong&gt; "F1" without specifying micro vs macro. "Accuracy" without specifying top-k. "Pass@1" without specifying temperature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset drift.&lt;/strong&gt; The benchmark hosted at the canonical URL changed between the experiment date and the publication date, and the team did not pin the bytes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these is consistent with current best-practice reporting. Each leaves the published number unfalsifiable: a reader cannot, even in principle, distinguish honest reporting from any of the above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why no infrastructure exists
&lt;/h2&gt;

&lt;p&gt;Pre-registration solved this exact problem in adjacent fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clinical trials, in 2007, with &lt;a href="https://clinicaltrials.gov" rel="noopener noreferrer"&gt;ClinicalTrials.gov&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Psychology, in 2013, with &lt;a href="https://osf.io/" rel="noopener noreferrer"&gt;Open Science Framework&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Economics, the same year, with the &lt;a href="https://www.aeaweb.org/journals/policies/rcts" rel="noopener noreferrer"&gt;AEA registry&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ML never got the equivalent. The closest thing — the &lt;a href="https://reproml.org/" rel="noopener noreferrer"&gt;ML Reproducibility Challenge&lt;/a&gt; — is an annual peer-driven effort to re-run published experiments. It produces excellent post-hoc analysis but does not change the publication-time commitment surface.&lt;/p&gt;

&lt;p&gt;The 2026 regulatory window is the part that matters most for builders. The &lt;a href="https://artificialintelligenceact.eu/article/12/" rel="noopener noreferrer"&gt;EU AI Act Article 12&lt;/a&gt; requires automatic logging of evaluation events for high-risk systems. &lt;a href="https://artificialintelligenceact.eu/article/18/" rel="noopener noreferrer"&gt;Article 18&lt;/a&gt; requires 10-year retention. Both enter force August 2, 2026. NIST AI RMF references content-addressed audit trails as a recommended control. ISO/IEC 42001:2023 mandates documented information practices that PRML directly satisfies.&lt;/p&gt;

&lt;p&gt;In other words: there is now a regulatory deadline by which "we have a tradition of reporting these numbers honestly" stops being a sufficient answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  PRML in plain English
&lt;/h2&gt;

&lt;p&gt;I drafted a small format, working draft v0.1, currently under public review. It is called &lt;strong&gt;PRML — Pre-Registered ML Manifest&lt;/strong&gt;. The whole spec fits in a single YAML schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prml/0.1"&lt;/span&gt;
&lt;span class="na"&gt;claim_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01900000-0000-7000-8000-000000000000"&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-01T12:00:00Z"&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy"&lt;/span&gt;
&lt;span class="na"&gt;comparator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;="&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imagenet-val-2012"&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"&lt;/span&gt;
&lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;42&lt;/span&gt;
&lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;studio-11.co"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire required surface. Eight fields. Plain text. UTF-8. YAML 1.2 strict subset (block style only, lexicographic key ordering, no comments, no flow collections).&lt;/p&gt;

&lt;p&gt;The format defines a deterministic canonicalization. Given any logical YAML mapping with these fields, there is exactly one canonical UTF-8 byte sequence. The SHA-256 of those bytes is the manifest hash.&lt;/p&gt;

&lt;p&gt;The hash is published &lt;em&gt;before&lt;/em&gt; the experiment runs. After the experiment, an independent verifier can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Re-canonicalize the manifest.&lt;/li&gt;
&lt;li&gt;Recompute SHA-256.&lt;/li&gt;
&lt;li&gt;Compare against the published sidecar hash. If they differ, the manifest has been edited post-lock — exit code &lt;code&gt;3&lt;/code&gt; (TAMPERED).&lt;/li&gt;
&lt;li&gt;Load the dataset by its content hash. Verify byte integrity.&lt;/li&gt;
&lt;li&gt;Run the metric computation under the seed. Compare against threshold.&lt;/li&gt;
&lt;li&gt;Emit &lt;code&gt;0&lt;/code&gt; (PASS), &lt;code&gt;10&lt;/code&gt; (FAIL), or one of the diagnostic codes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is no trust in the producer required at verification time. Anyone with the manifest, the dataset, and the model can reproduce the verdict offline.&lt;/p&gt;

&lt;p&gt;Honest amendments — "we found 12 mislabeled examples and re-ran" — do not overwrite. They append. Each new manifest carries a &lt;code&gt;prior_hash&lt;/code&gt; field pointing to the manifest it amends. The chain is the audit log. When a regulator or reviewer asks &lt;em&gt;"what was committed when?"&lt;/em&gt;, the answer is one hash, and from that hash the entire history is recoverable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A worked example with the reference implementation
&lt;/h2&gt;

&lt;p&gt;The reference implementation is a single-file Python CLI called &lt;code&gt;falsify&lt;/code&gt;, MIT-licensed, 1287 lines. Install it the usual way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initialize a claim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;falsify init imagenet-87
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This writes &lt;code&gt;.falsify/imagenet-87/spec.yaml&lt;/code&gt; with the required PRML fields as placeholders. Edit the file with your real values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prml/0.1"&lt;/span&gt;
&lt;span class="na"&gt;claim_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01900000-0000-7000-8000-000000000010"&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-01T14:00:00Z"&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy"&lt;/span&gt;
&lt;span class="na"&gt;comparator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;="&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.87&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imagenet-val-2012"&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"&lt;/span&gt;
&lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;42&lt;/span&gt;
&lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-org.example"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lock it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify lock imagenet-87
locked: &lt;span class="nb"&gt;yes&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;sha256:1a3466cc08ee, locked_at 2026-05-01T14:00:00Z&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the spec is hash-bound. If anyone — including you — edits the YAML, the next &lt;code&gt;falsify verify&lt;/code&gt; exits 3 and refuses to produce a verdict.&lt;/p&gt;

&lt;p&gt;Run the experiment, capture the metric value (let us say 0.876), and verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify verify imagenet-87 &lt;span class="nt"&gt;--observed&lt;/span&gt; 0.876
PASS  &lt;span class="nv"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;accuracy &lt;span class="nv"&gt;observed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.876 &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nv"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.87
&lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the team had silently raised the threshold to 0.88 after seeing the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify verify imagenet-87 &lt;span class="nt"&gt;--observed&lt;/span&gt; 0.876
TAMPERED  spec &lt;span class="nb"&gt;hash &lt;/span&gt;drift detected
recorded: 1a3466cc08ee...
current:  7b2c9a5d1e4f...
&lt;span class="nb"&gt;exit &lt;/span&gt;3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CI pipeline halts. The deploy does not happen. There is no judgment call.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you know the canonicalization actually works?
&lt;/h2&gt;

&lt;p&gt;The most reasonable skeptical question about a content-addressed format is: &lt;em&gt;what guarantees that two implementations produce the same canonical bytes for the same input?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For v0.1 we publish &lt;a href="https://spec.falsify.dev/test-vectors/v0.1/test-vectors.md" rel="noopener noreferrer"&gt;12 conformance test vectors&lt;/a&gt;. Each vector defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An input manifest (logical YAML, key order irrelevant).&lt;/li&gt;
&lt;li&gt;The exact UTF-8 byte sequence the canonicalizer must produce.&lt;/li&gt;
&lt;li&gt;The exact lowercase-hex SHA-256 of those bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The vectors exercise:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TV-001&lt;/td&gt;
&lt;td&gt;Minimal valid manifest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-002&lt;/td&gt;
&lt;td&gt;Key-ordering invariance — random insertion order produces same hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-003&lt;/td&gt;
&lt;td&gt;Single-bit-of-content sensitivity — &lt;code&gt;0.85&lt;/code&gt; vs &lt;code&gt;0.86&lt;/code&gt; produces different hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-004&lt;/td&gt;
&lt;td&gt;Optional fields populated (&lt;code&gt;model.id&lt;/code&gt;, &lt;code&gt;model.hash&lt;/code&gt;, &lt;code&gt;dataset.uri&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-005&lt;/td&gt;
&lt;td&gt;Unicode handling in &lt;code&gt;producer.id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-006&lt;/td&gt;
&lt;td&gt;Maximum seed value (&lt;code&gt;2⁶⁴ − 1&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-007&lt;/td&gt;
&lt;td&gt;Minimum seed (&lt;code&gt;0&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-008&lt;/td&gt;
&lt;td&gt;Equality comparator with integer-valued threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-009&lt;/td&gt;
&lt;td&gt;Amendment with &lt;code&gt;prior_hash&lt;/code&gt; linkage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-010&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pass@k&lt;/code&gt; metric for code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-011&lt;/td&gt;
&lt;td&gt;AUROC with strict comparator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-012&lt;/td&gt;
&lt;td&gt;Regression metric with &lt;code&gt;&amp;lt;=&lt;/code&gt; comparator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A new implementation in Rust, Go, or TypeScript is conformant only if it reproduces all 12 vectors exactly. The reference implementation has 28 unittest assertions in CI that lock in the v0.1 hash contract; any code change that breaks a vector forces a v0.2 spec bump.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is not
&lt;/h2&gt;

&lt;p&gt;PRML does not establish &lt;em&gt;whether&lt;/em&gt; a claimed metric is correct, fair, or sufficient. It establishes only &lt;em&gt;that&lt;/em&gt; the claim was committed before it was tested. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not a model card replacement. PRML manifests sit &lt;em&gt;underneath&lt;/em&gt; model cards as the cryptographic floor.&lt;/li&gt;
&lt;li&gt;Not a benchmark. PRML does not pick metrics for you.&lt;/li&gt;
&lt;li&gt;Not a reproducibility framework. PRML does not ship code or data.&lt;/li&gt;
&lt;li&gt;Not a tool. PRML is a format. &lt;code&gt;falsify&lt;/code&gt; is one implementation. A second implementation in any language passes if it reproduces the test vectors.&lt;/li&gt;
&lt;li&gt;Not a compliance product. It is a primitive that makes named regulatory obligations satisfiable with arithmetic verification rather than process attestation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What it costs
&lt;/h2&gt;

&lt;p&gt;The cost of adopting PRML at the experiment level is one hash function call. SHA-256 is FIPS 180-4, available in every standard library written since 2002. The format is UTF-8 plain text, readable in 2046 by any tool that can read text.&lt;/p&gt;

&lt;p&gt;The cost of &lt;em&gt;not&lt;/em&gt; adopting it scales with deployment scope. For a personal project, zero. For a research paper, growing pressure as reviewers begin to ask. For a product subject to EU AI Act Annex III obligations, measurable in regulatory exposure plus legal review hours. For a foundation model that will be cited in safety cases for a decade, the cost is roughly &lt;em&gt;the credibility of every accuracy claim you have ever shipped&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am asking for
&lt;/h2&gt;

&lt;p&gt;This is a working draft. v0.2 freeze is targeted &lt;strong&gt;2026-05-22&lt;/strong&gt;. Three concrete asks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Format review.&lt;/strong&gt; Is the canonical serialization in §3 of &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;the spec&lt;/a&gt; unambiguous? Are there YAML 1.2 edge cases the spec misses?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threat-model gaps.&lt;/strong&gt; §6 of the spec enumerates six adversaries. What is missing?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance correctness.&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/compliance/AI-Act-mapping-v0.1.md" rel="noopener noreferrer"&gt;The AI Act mapping&lt;/a&gt; maps PRML fields to Articles 12, 17, 18, 50, 72, and 73. Compliance lawyers and engineers in EU AI Act adjacent roles: are the bindings defensible?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Discussion thread: &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify/discussions/6&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tl;dr
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Most published ML accuracy numbers are unfalsifiable in practice.&lt;/li&gt;
&lt;li&gt;A small spec — eight fields, one hash function, one canonical serialization — gives published claims a cryptographic floor.&lt;/li&gt;
&lt;li&gt;Reference implementation in Python, MIT, single file. Spec under CC BY 4.0.&lt;/li&gt;
&lt;li&gt;v0.2 freeze in 3 weeks. Reviews, ambiguity reports, threat-model critiques are wanted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spec: &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;spec.falsify.dev/v0.1&lt;/a&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/sk8ordie84/falsify" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify&lt;/a&gt;&lt;br&gt;
Discussion: &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify/discussions/6&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I built a CLI that hashes your ML accuracy claims before the experiment runs</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:33:37 +0000</pubDate>
      <link>https://dev.to/sk8ordie84/i-built-a-cli-that-hashes-your-ml-accuracy-claims-before-the-experiment-runs-ick</link>
      <guid>https://dev.to/sk8ordie84/i-built-a-cli-that-hashes-your-ml-accuracy-claims-before-the-experiment-runs-ick</guid>
      <description>&lt;h1&gt;
  
  
  I built a CLI that hashes your ML accuracy claims before the experiment runs
&lt;/h1&gt;

&lt;p&gt;Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page.&lt;/p&gt;

&lt;p&gt;I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment.&lt;/p&gt;

&lt;p&gt;That night I started building falsify. Three days later I shipped it.&lt;/p&gt;

&lt;p&gt;This post is what I built, why I built it that small, and the one Python function that does most of the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem in one sentence
&lt;/h2&gt;

&lt;p&gt;If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.&lt;/p&gt;

&lt;p&gt;Psychology and medicine figured this out the hard way and invented pre-registration. You write down the prediction, the threshold, and the analysis plan, hash it, timestamp it, and you cannot move it later without everyone knowing.&lt;/p&gt;

&lt;p&gt;ML never adopted any of this. A &lt;code&gt;git commit&lt;/code&gt; is the closest thing most teams have, and &lt;code&gt;git commit --amend&lt;/code&gt; followed by a force-push will quietly erase the receipt.&lt;/p&gt;

&lt;p&gt;So I wrote a CLI that does the smallest possible version of pre-registration: canonicalize a YAML spec, SHA-256 it, lock the hash, and refuse to let it move.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "the smallest possible version" actually looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# falsify.yaml&lt;/span&gt;
&lt;span class="na"&gt;claim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;accuracy&lt;/span&gt;
  &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;
  &lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_eval_v3&lt;/span&gt;
  &lt;span class="na"&gt;dataset_sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4f1a8b2c...&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ranker-7b-2026q1&lt;/span&gt;
  &lt;span class="na"&gt;test_n&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1200&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-04-28T19:45:00Z&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the contract. The CLI workflow is three commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
falsify lock falsify.yaml      &lt;span class="c"&gt;# writes a .lock file with the hash&lt;/span&gt;
falsify check falsify.yaml &lt;span class="nt"&gt;--result&lt;/span&gt; &lt;span class="nv"&gt;actual_accuracy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.91
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit codes are the API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; — claim verified&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;10&lt;/code&gt; — claim falsified (you missed the threshold, but cleanly)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;3&lt;/code&gt; — tamper detected (someone edited the spec after lock)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;11&lt;/code&gt; — spec invalid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;10&lt;/code&gt; and &lt;code&gt;3&lt;/code&gt; being different exit codes is the whole point. "We didn't hit the number" is a different thing from "we moved the number."&lt;/p&gt;

&lt;h2&gt;
  
  
  The one function that matters
&lt;/h2&gt;

&lt;p&gt;The reason this works at all is YAML canonicalization. JSON looks canonical but isn't — key order, whitespace, and unicode forms can all drift while the document stays "the same." YAML is worse by default, but easy to canonicalize once you commit to a few rules.&lt;/p&gt;

&lt;p&gt;Here is the actual hashing function from the source. It is small on purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicodedata&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;  &lt;span class="c1"&gt;# PyYAML
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canonical_sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return SHA-256 of a canonicalized YAML spec.

    Canonicalization rules:
      - Parse the document, drop comments and anchors
      - Recursively sort all mapping keys
      - Normalize all strings to NFC unicode
      - Re-emit as UTF-8 with LF line endings, no trailing whitespace
      - Hash the bytes
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;unicodedata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;unicodedata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;

    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;allow_unicode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_flow_style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;line_break&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire trust primitive. Everything else in the 3925-line file — the lock file format, the CI integration, the tamper detection, the schema validation — is plumbing around this one function.&lt;/p&gt;

&lt;p&gt;The reason it has to be exactly this strict: any wiggle room (key order, trailing whitespace, BOM, unicode form) is a place where someone can quietly change the spec and produce a "matching" hash. Canonicalize once, hash once, never look back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI moment
&lt;/h2&gt;

&lt;p&gt;The point of all of this is the moment a teammate edits the spec after lock. Maybe they have a good reason. Maybe they don't. Either way, you want the system to notice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/eval.yml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;verify accuracy claim&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;falsify check falsify.yaml --result-file results.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If anyone touches &lt;code&gt;falsify.yaml&lt;/code&gt; after the lock, the action exits with code 3 and the PR cannot merge. The lie is blocked at the filesystem level, not by trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned in three days
&lt;/h2&gt;

&lt;p&gt;A few things surprised me while building this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML canonicalization is most of the value.&lt;/strong&gt; I spent way more time on the canonicalizer than on anything else. Every "clever" optimization I tried later turned out to be a place where two byte-different YAMLs produced the same hash. Boring is correct.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exit codes are an API.&lt;/strong&gt; I almost shipped with just &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;. Splitting "falsified" from "tampered" was the single biggest jump in how teams reacted to it. People immediately understood the difference.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One file is a feature.&lt;/strong&gt; I kept resisting the urge to split it into a package. Auditors and skeptical SREs read single-file Python CLIs in one sitting. They do not read packages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dogfooding is non-negotiable.&lt;/strong&gt; falsify locks its own test claims with falsify. The honesty badge on the README is generated by the tool itself, on its own metrics. If a tool that locks claims cannot lock its own, why would you trust it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agents change what one person can ship in a weekend.&lt;/strong&gt; I built this solo in three days with Claude Opus 4.7 in the loop — pair programming, eval generation, doc drafting, the whole pipeline. The 518 tests and the YAML canonicalizer corner cases would have been a two-week solo grind without it. The actual design decisions were still mine; the agent just made the cost of being thorough a lot lower.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
falsify init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/sk8ordie84/falsify" rel="noopener noreferrer"&gt;https://github.com/sk8ordie84/falsify&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;90-second demo: &lt;a href="https://youtu.be/vVZTNeak5PA" rel="noopener noreferrer"&gt;https://youtu.be/vVZTNeak5PA&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://falsify.dev" rel="noopener noreferrer"&gt;https://falsify.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/falsify/" rel="noopener noreferrer"&gt;https://pypi.org/project/falsify/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Single file, MIT, Python 3.11+, stdlib plus pyyaml. If you ship any number followed by a percent sign, lock it before the experiment runs. It costs 30 seconds and saves the meeting where someone has to explain why the number changed.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I built a film camera simulator in a single HTML file here's how</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Mon, 20 Apr 2026 18:51:08 +0000</pubDate>
      <link>https://dev.to/sk8ordie84/i-built-a-film-camera-simulator-in-a-single-html-file-heres-how-403b</link>
      <guid>https://dev.to/sk8ordie84/i-built-a-film-camera-simulator-in-a-single-html-file-heres-how-403b</guid>
      <description>&lt;p&gt;Launched today: faxoffice1987.com — 8 film cameras simulated in Canvas 2D.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The constraints I set myself:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One HTML file&lt;/li&gt;
&lt;li&gt;No build step, no dependencies, no npm install&lt;/li&gt;
&lt;li&gt;Runs offline from a USB drive&lt;/li&gt;
&lt;li&gt;No backend, no account, no uploads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The hard part:&lt;/strong&gt; per-pixel color science. Each film stock (Tri-X, &lt;br&gt;
Portra, Velvia, Neopan Acros) has its own render path. Not a filter &lt;br&gt;
on top — a decision at the pixel level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vanilla JS, Canvas 2D&lt;/li&gt;
&lt;li&gt;Cloudflare Pages + Functions (share links, license validation)&lt;/li&gt;
&lt;li&gt;Polar.sh for checkout&lt;/li&gt;
&lt;li&gt;localStorage for state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing experiment:&lt;/strong&gt; $29 one-time. No subscription. 1 camera free forever.&lt;/p&gt;

&lt;p&gt;Would love architecture feedback especially on the color science approach.&lt;/p&gt;

&lt;p&gt;Link: &lt;a href="https://faxoffice1987.com" rel="noopener noreferrer"&gt;https://faxoffice1987.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>canvas</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
