TL;DR — I built a differential tester for TOON: it runs data through one implementation's encoder and a different implementation's decoder, and checks the round-trip survived. On its first run it found two silent-corruption bugs (a rounded 64-bit integer and an empty array that decoded to a corrupted string), both filed upstream. The hard part wasn't finding bugs — it was building a comparison oracle honest enough that its FAIL means something. Repo: github.com/antrixy/toon-diff
I built a small tool that checks whether independent TOON implementations actually agree with each other. On its first real run — across the TypeScript reference and the Python port — it found two silent-corruption bugs. Both are now filed upstream.
This post is about why the approach finds bugs that ordinary conformance suites miss, and about the one genuinely tricky part: building a comparison oracle that doesn't corrupt the data while it's checking it.
The setup: TOON, and the promise of lossless round-trips
TOON (Token-Oriented Object Notation) is a compact, line-oriented encoding of the JSON data model, designed for things like trimming token counts in LLM prompts. The whole value proposition rests on one property:
JSON → TOON → JSON should give you back what you started with.
There are independent TOON implementations in 25+ languages. Each ships its own conformance tests. Each is green. And yet the moment you have more than one implementation, a new failure mode appears that none of those green test suites can see.
Why conformance suites miss the interesting bugs
A conformance suite checks one implementation against a set of blessed expected outputs: given input X, the encoder must produce exactly Y. That's useful, but it has a structural blind spot.
Every implementation round-trips its own output just fine. The TS encoder produces something the TS decoder reads back perfectly. The Python encoder produces something the Python decoder reads back perfectly. Both suites pass. The bug lives in the gap between them — when TS encodes something and Python has to decode it, or vice versa.
That's what differential testing targets directly. Instead of checking against expected outputs, you check implementations against each other:
decode_Y( encode_X( value ) ) == value for every ordered pair (X, Y)
With N implementations you run N×N ordered pairs (including each against itself, which is your control). Two implementations is a 4-cell matrix; three is 9. Any cell that fails is a place where two implementations disagree about what a given value means — and because both sides individually pass their own tests, nobody had noticed.
The harness for this is almost trivial. Each implementation gets wrapped in a tiny adapter with a uniform, text-in/text-out contract:
export interface Adapter {
name: string;
encode(jsonText: string): Promise<string>; // JSON text -> TOON text
decode(toonText: string): Promise<string>; // TOON text -> JSON text
}
Working on text means the harness never has to hold any language's native value model. Adding a new language is one adapter. The driver loop is just:
for (const X of adapters)
for (const Y of adapters)
for (const c of cases) {
const toon = await X.encode(c.jsonText);
const back = await Y.decode(toon);
if (!oracle.equal(c.value, ingest(back))) report(X, Y, c);
}
All the difficulty is hiding in one word: equal.
The hard part: an oracle that doesn't corrupt the data it's judging
Here's the trap. You want to compare the original value against the round-tripped value. The obvious way is to parse both back into native objects and compare. In JavaScript that means JSON.parse. And JSON.parse will quietly destroy the exact cases you most want to test.
Consider the integer 9007199254740993, which is 2⁵³ + 1:
JSON.parse("9007199254740993") // -> 9007199254740992 (!!)
It comes back as 2⁵³, off by one, because it rounded through an IEEE-754 double. If your oracle parses values this way, then when one implementation preserves the integer and another rounds it, your oracle rounds both and reports a false PASS — on the single most important case in the suite. The comparator silently corrupts the evidence.
My first version "solved" this by quarantining such cases — detecting numbers that couldn't survive a float and benching them. But that benches exactly the inputs where implementations with different number models (JS f64, Python arbitrary-precision int, Rust i64/u64/f64) are guaranteed to diverge. You're throwing away your strongest evidence to protect a broken comparator.
The fix is to never let a number touch a float. ES2023 added a context argument to the JSON.parse reviver that hands you the exact source lexeme of each value, before any numeric conversion:
const NUM = Symbol("num"); // collision-proof: JSON.parse can't produce a Symbol key
export function ingest(rawText: string): Node {
return JSON.parse(rawText, (_key, value, ctx?: { source?: string }) => {
if (typeof value === "number") {
// ctx.source is the raw digits as written: "9007199254740993",
// captured BEFORE the lossy f64 conversion that produced `value`.
return { [NUM]: canonicalNumber(ctx.source) };
}
return value;
});
}
ctx.source is the string "9007199254740993" — the actual characters from the input — even though value is already the rounded double. We ignore value entirely and keep the digits. Numbers are stored as a Symbol-tagged node so they can never collide with a real object that happens to have a "__num" key.
canonicalNumber then reduces the lexeme to a canonical value form using arbitrary-precision string arithmetic — never an f64 — so 2⁵³+1 stays itself all the way through the comparison.
Inside canonicalNumber: value identity without a float
The reviver gets us the raw digits; the remaining job is to map two different lexemes that denote the same number to the same string, without ever evaluating them numerically. "1.0", "1.00", and "1e0" must all become "1"; "1e-2" must become "0.01"; and "9007199254740993" must stay exactly itself. The whole thing is regex + string shifts:
export function canonicalNumber(lex: string): string {
const m = /^([+-]?)(\d+)(?:\.(\d*))?(?:[eE]([+-]?\d+))?$/.exec(lex.trim());
if (!m) return lex; // not a well-formed JSON number; fail safe, never throw
const sign = m[1] === "-" ? "-" : "";
const digits = m[2] + (m[3] ?? ""); // all significant digits, point removed
const pointPos = m[2].length + (m[4] ? parseInt(m[4], 10) : 0); // where the point lands
// Shift the decimal point by `pointPos`, padding with zeros when it falls
// outside the digit run — pure string surgery, no parseFloat anywhere.
let intStr: string, fracStr: string;
if (pointPos <= 0) { intStr = "0"; fracStr = "0".repeat(-pointPos) + digits; }
else if (pointPos >= digits.length) { intStr = digits + "0".repeat(pointPos - digits.length); fracStr = ""; }
else { intStr = digits.slice(0, pointPos); fracStr = digits.slice(pointPos); }
intStr = intStr.replace(/^0+(?=\d)/, ""); // strip leading zeros, keep one
fracStr = fracStr.replace(/0+$/, ""); // strip trailing zeros
if (intStr === "0" && fracStr === "") return "0"; // canonical zero, sign dropped
return fracStr ? `${sign}${intStr}.${fracStr}` : `${sign}${intStr}`;
}
Two things worth calling out. First, the if (!m) return lex line: a lexeme that doesn't match the JSON number grammar is returned untouched rather than throwing — the oracle should never crash on input, it should compare faithfully and let the result be the signal. Second, this is the exact spot where the value-vs-representation policy lives. Returning a value-normalized form here is what makes 1.0 == 1 and -0 == 0. If you instead wanted to flag representational drift — say, to surface that one implementation preserves -0 while another normalizes it — you'd return a representation-preserving form here (keep the trailing .0, keep the leading - on zero) and integers would still compare exactly. One function, one deliberate stance, documented in place.
The one real judgment call: compare by value, with exact integers
Once numbers survive ingestion intact, you have to decide what "equal" means. This is the only genuinely opinionated part of the oracle, and it's worth stating explicitly:
1.0 == 1 value-equal (RFC 8259: these denote the same number)
-0 == 0 value-equal (JSON's value model has no signed zero)
2^53+1 != 2^53 DIFFERENT integers — precision loss is a real divergence
In other words: compare by mathematical value (so representational noise like 1.0 vs 1 doesn't generate false positives), but preserve exact integer identity (so genuine precision loss is caught). That's the correct default for a "did the round-trip stay lossless?" question. Everything else in the oracle is strict: types don't coerce ("123" ≠ 123), array order matters, missing keys differ from explicit nulls, and there's no Unicode normalization (e + combining acute ≠ precomposed é).
How equality is actually computed
Rather than a recursive deep-equal with a pile of type checks, the oracle serializes each value tree to a single canonical string and compares the strings. The serialization is where the strictness is enforced structurally, so it can't be forgotten case by case:
export function canonical(node: Node): string {
if (node === null) return "null";
if (typeof node === "boolean") return node ? "true" : "false";
if (typeof node === "string") return JSON.stringify(node); // quoted form
if (isNum(node)) return "#" + node[NUM]; // unquoted #-token
if (Array.isArray(node)) return "[" + node.map(canonical).join(",") + "]";
const keys = Object.keys(node).sort(); // key order normalized
return "{" + keys.map(k => JSON.stringify(k) + ":" + canonical(node[k])).join(",") + "}";
}
export const equal = (a: Node, b: Node) => canonical(a) === canonical(b);
The detail that does the heavy lifting: a string serializes to its quoted form ("123" → "123") while a number serializes to a #-prefixed token (123 → #123). Those can never collide, so the string "123" and the number 123 are structurally incapable of comparing equal — type-strictness falls out of the representation instead of being a check you have to remember to write. Object keys are sorted so key order doesn't matter, but arrays aren't, so element order does. And because the leaves are the Symbol-tagged value-form numbers from canonicalNumber, exact-integer identity is already baked in by the time we get here.
This is also what makes the oracle cheaply self-provable: canonical has no dependency on TOON at all, so its self-test just asserts pairs of values that must (or must not) share a canonical string — 1.0/1 equal, 2⁵³+1/2⁵³ not, "123"/123 not — and runs before any adapter is touched.
Reading the matrix: the shape of the failures diagnoses the bug
With two adapters (TS, Python) and 13 cases, a run is 2×2×13 = 52 pair-checks. Most cells pass. But the interesting thing isn't that cells fail — it's which cells fail, because the pattern tells you what kind of bug you're looking at before you read a single value.
Here are the two cases that diverged, as 2×2 grids. Rows are the encoder, columns are the decoder; ✓ means the round-trip survived, ✗ means the oracle caught a divergence:
Case 013 — integer 9007199254740993 (2⁵³+1):
decode_TS decode_Py
encode_TS ✗ ✗
encode_Py ✗ ✓
Three of four pairs fail — including TS→TS, the self-pair that's supposed to be your control. That shape is the diagnosis: when even an implementation's round-trip with itself fails, the problem isn't a handoff between languages, it's that implementation's number model. TS holds the integer as an f64 the instant it touches it, so encode_TS has already lost the digit before any decoder runs; and decode_TS re-loses it even when Python encoded it faithfully. The only surviving cell is Py→Py, because Python's arbitrary-precision integers never round. A failure that includes the diagonal = an encode/decode-side capability limit, not a protocol disagreement.
Case 002 — empty array []:
decode_TS decode_Py
encode_TS ✓ ✗
encode_Py ✓ ✓
Exactly one cell fails: encode_TS → decode_Py. That's the signature of a true cross-implementation handoff bug. TS encodes the empty array as the bare [] — a form TS's own decoder happily reads back (so TS→TS passes and its conformance suite stays green), but which Python's decoder chokes on, returning the corrupted '['. Python's own output ([0]:, the spec-canonical form) is read correctly by everyone, so its whole row passes. A single off-diagonal failure points straight at "implementation A emits something only A can read."
This is the entire argument for differential testing in one picture. The 013 pattern (fails on the diagonal) and the 002 pattern (fails on exactly one off-diagonal cell) are different bug classes, and you can tell them apart by the geometry of the matrix without even inspecting the payloads. A single-implementation conformance suite only ever runs the diagonal — so it can see 013 (sort of, if it tests the boundary) but is structurally blind to 002, whose only failing cell is off-diagonal by definition.
Scale this to three adapters and it's a 3×3 grid per case; the Rust adapter (different number model again) turns each case into 9 pair-checks and adds a whole new row and column of handoffs where divergences can hide.
Thirteen hand-designed probe cases, aimed at known fault lines: empty containers, almost-uniform tables, string-lookalikes, Unicode edge cases, whitespace, and the number boundaries around 2⁵³. First run, two real divergences.
1. Integer 2⁵³+1 — silent precision loss across the boundary
Input {"unsafe": 9007199254740993}. The TS path rounds it to ...992; the Python path preserves it exactly. Round-trip each through itself and both pass — the loss only appears when you cross them, which is precisely what the matrix does.
The honest framing here matters: TOON's spec permits precision loss for numbers outside a host's safe range if the implementation documents it. So this isn't a flat bug — it's a documented-divergence boundary, and the right place to surface it is the tool's README, not an accusatory issue. Differential testing is what makes the boundary visible instead of theoretical.
2. Empty array — a genuine cross-implementation bug, both halves
This one's a real bug, and it has two sides.
The spec is explicit (§9.1): an empty array encodes as [0]:. But:
-
Encoder side (TS):
encode([])emits the bare[], with no length header — non-conformant to §9.1. -
Decoder side (Python):
decode("[]")returns the string'['— a single character, with the]silently dropped. Not the empty array, not an error, not even the literal string"[]". Just corrupted output flowing downstream with no signal.
decode("[]") # -> '[' corrupted
decode("[0]:") # -> [] correct (canonical form)
Both halves filed: toon-format/toon#322 (encoder) and toon-format/toon-python#61 (decoder). Each side individually "worked" — the bug only existed in the handoff.
Honest status: a differential probe, not yet a fuzzer
I want to be precise about what this is. Right now it's 13 curated inputs, not a generative fuzzer. The next step is a mutation-based generator that takes those seeds and pushes along the same fault lines — boundary integers, delimiter-adjacent strings, near-uniform tables, empty containers — then shrinks any failure to a minimal reproducer. That's what turns "I picked inputs that break things" into "the tool finds inputs nobody wrote." A third adapter (Rust, with its i64/u64/f64 split) widens the number-model coverage where the next divergences most likely hide.
The transferable idea
None of this is TOON-specific. If you maintain any format with multiple independent implementations — a serializer, a parser, a protocol codec — the same shape applies:
-
Cross, don't self-check. Run
decode_B(encode_A(x)), notdecode_A(encode_A(x)). Same-implementation round-trips hide boundary bugs by construction. -
Don't let your oracle corrupt the evidence. If your comparison path rounds, normalizes, or coerces, it will mask the exact divergences you're hunting. Capture values losslessly (the
ctx.sourcetrick is a clean way to do it in JS) and decide your equality semantics deliberately. - Prove the judge independently. The oracle must pass its own self-test with zero dependency on the things it judges.
Repo: https://github.com/antrixy/toon-diff
Two real bugs, no fuzzer yet, on inputs a person hand-wrote. The interesting part wasn't the bugs — it was building a comparator honest enough to believe when it said FAIL.
Top comments (0)