DEV Community: Maverick Y

How LLMLingua compresses prompts: perplexity, deletion, and trust

Maverick Y — Sun, 19 Jul 2026 13:57:49 +0000

I build ctxfold, a lossless prompt compressor, so I spend a lot of time thinking about the other school of prompt compression — the one that deletes things on purpose. The most important tool in that school is LLMLingua, from Microsoft Research. Before I publish a measured comparison (that post is coming), I want to explain fairly how LLMLingua actually works, because it's clever, it's genuinely effective at what it targets, and the design choice at its core is the exact opposite of mine.

The one-sentence version

LLMLingua uses a small language model to figure out which tokens in your prompt the big model could have predicted anyway — and deletes them.

That's the whole philosophical move. If a token is highly predictable from its context, removing it costs the downstream LLM very little, because the information it carried was mostly redundant. The measure of "predictable" is perplexity: tokens the small model assigns low perplexity are candidates for deletion.

The original LLMLingua (2023)

The first paper is a coarse-to-fine pipeline with three pieces:

Budget controller. Not all parts of a prompt tolerate compression equally. Instructions and the question tend to be information-dense; few-shot demonstrations are usually the flabby part. The budget controller assigns different compression ratios to different sections — it might keep your instruction nearly intact while cutting demonstrations hard, even dropping whole examples.
Iterative token-level compression. The surviving text is split into segments and compressed token by token, where each token's perplexity is computed against the already-compressed preceding context. The iteration matters: deleting tokens changes the perplexity of what follows, so a single pass would misjudge dependencies between tokens.
Distribution alignment. The small model (GPT2-small or LLaMA-7B scale) is instruction-tuned on data generated by the target LLM, so its sense of "predictable" tracks the big model's, not just its own.

The headline result was up to 20x compression on benchmarks like GSM8K and BBH with little loss in task performance — particularly for in-context learning and reasoning, which makes sense: few-shot demos are exactly where prompts carry the most redundancy per token.

What the output looks like

Compressed prompts from this family are not pretty. Deletion doesn't respect grammar. You get telegraphic text — function words gone, sentences collapsed into keyword runs. Something shaped like this (illustrative, not tool output):

Original:   Sam bought a dozen boxes, each with 30 highlighter pens
            inside, for $10 each box. He rearranged five of these...
Compressed: Sam bought dozen boxes each 30 highlighter pens $10 each.
            rearranged five...

The striking empirical finding — and the reason this line of work exists — is that LLMs read this stuff far better than humans do. The big model reconstructs the missing connective tissue without being asked. Microsoft's own framing is that compressed prompts may be difficult for humans while staying highly effective for LLMs.

LLMLingua-2 (2024): from perplexity to classification

The second generation changes the mechanism entirely. Instead of measuring perplexity with a small causal LM, the team had GPT-4 compress a corpus of prompts, then trained a small bidirectional encoder (XLM-RoBERTa-large) to classify each token: keep or discard, imitating GPT-4's choices. This data-distillation approach is faster (the encoder sees the whole sequence at once, no iterative left-to-right passes) and it fixed a real weakness: causal perplexity only looks backward, but whether a token matters often depends on what comes after it.

There's also LongLLMLingua in between, specialized for long-context and RAG prompts — question-aware compression, document reordering to fight "lost in the middle," and subsequence recovery.

Trying it

The library is a pip install:

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)
result = compressor.compress_prompt(
    context, rate=0.33, force_tokens=["\n", "?"]
)
print(result["compressed_prompt"])
print(result["origin_tokens"], "->", result["compressed_tokens"])

The rate is your target: keep a third of the tokens, delete the rest. The force_tokens escape hatch — a list of tokens that must never be deleted — is worth noticing. It exists because sometimes the classifier deletes things you needed. Which brings me to the interesting part.

The trust question

LLMLingua is lossy by design. Once compressed, the original text is unrecoverable — there is no decompressor. You are trusting that what got deleted didn't matter, and that trust is statistical: the classifier learned what GPT-4 usually considered droppable, on the corpora it was trained on.

For prose, summaries, and few-shot demos, that bet pays off, and the benchmark numbers back it up. The open question — the one I care about, because ctxfold lives on the other side of it — is what happens to exact values. SKU codes, prices, quantities, the one log line out of 300 that mattered. A rare identifier is, almost by definition, high-perplexity, which should protect it. But "should" is doing work in that sentence, and the failure mode is silent: nothing tells you a number your task depended on was deleted.

My own tool makes the opposite trade: byte-perfect losslessness, enforced by round-trip tests, at the cost of far smaller ratios (~39% on the formats it targets, vs. up-to-20x). And I recently learned lossless has its own readability limits — my CSV folding is byte-perfect and models still can't read it directly. So neither school gets to claim purity; the real axis is where the information-recovery work happens and who pays for it.

That's the comparison I want to measure rather than argue: same datasets, same exact-match questions, both tools, results whatever they are. Next post.

I write up negative results too — the CSV one is here. ctxfold is on npm.

I built a readability test for my own compression format. It scored 0/24.

Maverick Y — Sat, 18 Jul 2026 14:34:09 +0000

A negative result, its root cause, and the feature it produced.

My last post introduced ctxfold — lossless, structure-aware compression for the bulky stuff we put in LLM prompts. Its benchmark table had one cell I wasn't proud of:

| CSV / TSV | char reduction | ~30–45%* |

*CSV readability not yet validated against a model

JSON and logs had real numbers: a model answering lookup questions off the folded form, scored against exact ground truth, matching raw field-for-field. CSV had a character count and an asterisk.

So I wrote the same harness for CSV. Generate 400 records with realistic redundancy, fold them, ask GPT-4o-mini to look up specific records in both forms, score against ground truth.

Raw CSV: 24/24. Folded CSV: 0/24.

Not "slightly worse." Zero.

Maybe the model is too small?

First hypothesis: capability threshold. Re-ran on GPT-4o.

Raw: 24/24. Folded: 6/24. A later run: 9/24. So no — a stronger model helps a little and inconsistently. The format is the problem.

The failure had structure, which is what made it diagnosable. Asked about record NW-1258, the model reported a warehouse of WH-2 (a half-applied prefix), a supplier of DAL-2 (a value from the warehouse column), and a price from some other row entirely. In another run it answered qty 1238 for sku NW-1238 — it read the record's own identifier back as data. Two distinct failures, every time: it couldn't find the right row, and it couldn't reconstruct the values in it.

The root cause is embarrassing in hindsight

ctxfold's JSON and logs encoders fold syntax. A JSON array of objects repeats every key on every record; the encoder lifts keys, braces, and quotes into a one-time header, and every value stays verbatim in its row. Logs are the same story with templates: timestamps, levels, reqId= prefixes get lifted; the payload stays put. The model reads a plain labeled table containing exactly the values it needs.

CSV has no syntax to remove. No keys, no braces — it's already a compact table. The only redundancy left is inside the values: shared prefixes like NW- on every sku, WH- on every warehouse, a constant USD column. So the CSV encoder factors those out, and each row keeps only the varying middle. Lossless, byte-exact, verified on every encode.

And unreadable. To answer a question, the model has to find a row by a partial key (the sku column only contains 258, because NW-1 was factored out — all 400 skus shared it) and then reconstruct every value through the header's prefix + middle rules. It can't do either reliably. On GPT-4o-mini it essentially can't do it at all.

I'd actually seen a milder version of this before. ctxfold's opt-in dictionary coding (low-cardinality values become small integers, mapped once in the header) pushes JSON savings from ~39% to ~46% — and in testing, models resolved the coded values slightly less reliably than plain ones. That's why it ships off by default. The CSV result is the same phenomenon at full strength:

Models read values, not reconstruction rules. Indirection through a header costs readability — and in CSV's case, the savings were the indirection.

What I didn't do

I considered "fixing" the format — don't factor unique columns, keep identifiers whole. But that only patches row-finding; the prefix-reconstruction failure remains, and the savings shrink toward nothing. CSV is already near its readable minimum. That's why there was nothing safe to fold.

So the fix was documentation, not code. As of v0.1.4, CSV folding is pipeline-mode: fold it for lossless transit or storage, call decompress() before the prompt is built, and the model never sees the folded form. For direct model reading, send CSV raw. The benchmark table now says so, with the measured numbers where the asterisk used to be. JSON and logs remain validated direct-readable — their folded output keeps every value intact, and the scores show it.

The rule I keep coming back to: ctxfold's core contract is lossless or no-op — never lossy. It turns out the same discipline applies to claims. Either a readability claim has a measurement behind it, or it gets the asterisk. And when the measurement comes in against you, the asterisk becomes documentation.

The feature the failure produced

Sitting with those numbers, the useful question turned out to be: before anyone folds anything — where do a prompt's tokens actually go, and what's safely foldable?

v0.2.0 ships that as ctxfold --profile:

$ ctxfold --profile users.json

[ctxfold profile]
format      JSON array — 300 records × 7 fields
size        65,614 chars ≈ 16,404 tokens (estimated; pass a tokenizer for exact)

where the characters go
  keys         27%   repeated field names (with quotes)
  syntax        7%   braces, brackets, commas, colons
  values       36%   the data itself (with string quotes)
  whitespace   30%   indentation and spacing

foldable (lossless, verified by round-trip)
  fold             -66%   direct-readable — validated 24/24 vs raw
  + --dictionary   -73%   readability tradeoff — off by default, see README

verdict: fold it — 16,404 → ~5,595 tokens (~66% fewer)

That's a real API response shape — wrapped array, pretty-printed. 64% of the file is structure and formatting, which is why the fold is fat. On a CSV file, the same command reports the fold as pipeline-only and tells you to send it raw or decompress first — the negative result is baked into the tool's own advice.

The profiler follows the rules the failure taught:

Composition is measured in characters and attributed exactly — the categories sum to the input size. Attributing individual tokens to categories would be false precision, so token figures are totals, marked estimated unless you pass a real tokenizer.
The "foldable" numbers come from actually running the encoder on your input. The profiler cannot promise more than compress() delivers — that's an invariant with a test on it, not a policy.
If nothing folds, it says why — quoted CSV, nested JSON, too few records, prose — instead of silently no-opping.

Try it with zero setup (no API key, deterministic output):

git clone https://github.com/antrixy/ctxfold && cd ctxfold
node examples/profile-demo.js

Or on your own data:

npm install -g ctxfold
ctxfold --profile your-file.json

The part I'd generalize

If you maintain a tool that makes claims — savings, accuracy, compatibility — the cheapest test you'll ever write is the one that checks your own table. Mine took an afternoon, cost a few cents of API calls, and found that a third of my format list didn't do what a reader would assume. The fix cost nothing but honesty, and the tool that fell out of it is the most useful thing in the package.

The full harness is in the repo (examples/gpt-csv-equivalence.js), along with the v0.1.4 and v0.2.0 release notes. If you push structured data into prompts and your payloads break my assumptions, I want to hear about it.

Repo & docs: https://github.com/antrixy/ctxfold · npm: npm install ctxfold · MIT licensed.

Differential testing found bugs in all three official TOON implementations — and outlived a moving spec

Maverick Y — Fri, 17 Jul 2026 00:58:22 +0000

In late June I filed toon#322
against the reference TypeScript implementation of
TOON, a serialization format that
promises lossless round-trips. The bug: the encoder emitted a bare [] for
empty arrays, where spec v3.0 §9.1 required an explicit zero-length header,
[0]:. Clean report — spec quote, failing test case, versions.

Two weeks later the issue closed. Not because the encoder changed. Because the
spec did: v3.1 introduced [] as a canonical form, v3.3 blessed it as the
SHOULD form. My clause citation was mooted by two spec revisions filed after
the report.

Here's the part that makes this a story worth telling instead of a complaint:
the finding survived anyway — because the report didn't rest on the clause.

Differential evidence outlives clause readings

The bug came out of toon-diff, a
differential conformance fuzzer I've been building. It has no opinion about
what's correct. It runs data through one implementation's encoder and a
different implementation's decoder, for every ordered pair, and checks the
round-trip survived:

decode_Y( encode_X( value ) )  ==  value

So the #322 report carried, alongside the clause citation, a
cross-implementation table: the same input [] through all three official
implementations. TS→TS round-trips. TS→Python silently returns the corrupted
string "[]". TS→Rust throws a parse error. One input, three official
implementations, three different outcomes.

When the spec moved, the clause argument evaporated — but that table didn't.
The closure itself says so: the encoder is now compliant, and the cross-impl
concern "stands until the ports catch up." The spec legitimized one side of
the wire and hardened the obligation on the other (decoders MUST now accept
both forms). The evidence determined which side got fixed; a clause reading
alone would have just been wrong twice.

For a fast-moving format — TOON went v1.0 → v3.0 in about a month — that's
the durable way to file bugs: anchor to observed round-trip corruption, cite
clauses as supporting context.

What the matrix actually found

Thirteen hand-designed seed cases, nine ordered pairs, 117 checks, seven
divergences. The whole disagreement surface fits on one screen:

GRID (encoder row → decoder col): divergent cases per pair, of 13
  enc\dec  ts  python  rust
  ts        1       2     2
  python    1       ·     ·
  rust      1       ·     ·

That cross through the TS row and column is 2^53+1: JavaScript's f64 rounds
9007199254740993 to …992 at JSON.parse, so every pair touching TS
silently loses the integer — while Python↔Rust round-trips it perfectly. Each
implementation round-trips its own output fine. That's why
single-implementation test suites stay green while this class of bug ships:
the failures live at the boundary between implementations, and they're
silent. A wrong value with no error is worse than a crash, and it's exactly
what crossing implementations surfaces. (That finding is now filed as
toon#329 — a spec
deep-dive showed the sharpest form is decoder-side: decode() silently
approximates valid wire tokens with no documented out-of-range policy,
which the spec makes a MUST.)

Other findings from the same loop, all filed upstream with minimized repros:
a Rust encoder emitting grammar its own decoder rejects
(toon-rust#74), a
Python encoder silently dropping a key
(toon-python#64), and
a TS decoder that parses quoted string content as structure when a fake
array header's item count happens to match — silent corruption with no error
raised (toon#324). That
last escalation argued from the spec's own security model, and the
maintainers' triage adopted the framing: quote-aware opacity as the fix
direction, with the count-match variant getting its own regression-test
acceptance criterion.

A divergence is evidence, not a verdict

The subtle discipline: two implementations disagreeing doesn't tell you who's
wrong, or wrong by whose standard. Rust's decoder rejects [] — but Rust
pins spec v3.0, which predates the [] rule. It isn't violating its claim;
it's behind it. Python claims no version at all, so it's measured against the
current spec — and violates it. Same divergence, different verdicts.

toon-diff encodes that distinction as machinery: implementations carry their
claimed spec versions with evidence and verification dates; spec rules carry
the sections and changelog entries that introduced them; verdicts —
behind, violates-claimed, violates-current — are computed per
constrained side, and a decoder rule never indicts the encoder. Rules whose
sections I haven't verified against the live spec render as
citation-PENDING rather than pretending. The tool doesn't just find
disagreements; it says what they mean, and shows its work.

Two bugs in my own harness

Honesty section. The v0.1 comparator corrupted 2^53+1 the same way the TS
implementation does — native JSON.parse — so the most important case in the
corpus would have false-PASSed. The fix was a lossless oracle that ingests
numbers at their exact source lexeme, proven by a selftest before any
cross-implementation claim runs. Your test harness is an implementation too,
and it lies the same way until you prove otherwise.

And twice I recorded upstream facts wrong from early recon — a claimed spec
version, an issue reference. Both were caught by browser verification, and
both fixes added tripwires: every upstream claim in the codebase now carries
its evidence and verification date, and selftests pin every reference so a
correction can't land halfway. Verification is a workflow, not a virtue.

The through-line

Every serious finding here is silent: an integer that rounds without an
error, a string that comes back subtly different, a key that vanishes, quoted
data that parses as structure when the counts happen to agree. Loud failures
get fixed because they announce themselves. Silent ones get fixed when
something crosses two implementations and checks.

The full retrospective — with every finding's upstream trail, the verdict
machinery, and the fuzzing operators that rediscover known bug classes from
scratch — is in the repo:
toon-diff/RETROSPECTIVE.md.

Block Google's AI Overviews at the Network Layer, Not the DOM

Maverick Y — Thu, 02 Jul 2026 18:40:26 +0000

TL;DR: Most extensions block Google's AI Overviews by hiding the panel with a content script after it renders — fragile, flickery, and always a step behind Google's markup changes. A better approach: force udm=14 at the network layer with declarativeNetRequest, so the AI Overview never loads. The content script becomes a backstop, not the main mechanism. One Chrome API mystery — AI Mode being invisible to four different extension APIs — shows why the DOM was never the right layer.

Google puts an AI Overview at the top of most search results now, and a lot of people would rather it didn't. So there's a whole shelf of Chrome extensions that remove it. Almost all of them work the same way, and I think that way is a mistake.

The obvious approach, and why it's a trap

The default move is DOM-hiding: inject a content script, wait for the AI Overview panel to render, find it by class name or attribute, and set display: none. It's the first thing anyone reaches for, and it works — until it doesn't.

The problems are all baked into the approach. You're reacting after the render, so there's a flash of AI content before your script catches it. You're matching against Google's markup, which is obfuscated and reshuffled constantly, so every layout change is a silent breakage. And you're paying for DOM churn on a page you don't control. You end up in a permanent game of catch-up against a page that changes whenever Google feels like it.

The deeper issue is that you're operating one layer too high. The panel is a symptom. By the time it's in the DOM, the work is already done — the server decided to send it, the page rendered it, and now you're scrambling to un-render it. If you can move the decision earlier, none of that scramble has to happen.

The thesis: prevent it at the network layer

Google Search takes a parameter, udm, that selects which result vertical you get. udm=14 is the plain "Web" results view — the classic list of links, no AI Overview, no AI Mode. It's Google's own filter; we're just always asking for it.

So instead of hiding the panel after it loads, force udm=14 onto the search request before it loads. The AI Overview is never generated, never sent, never rendered. Nothing to hide because nothing arrived.

Manifest V3's declarativeNetRequest does exactly this — it rewrites requests by rule, without the extension ever reading the traffic. Here's the core rule:

{
  id: 1,
  priority: 1,
  action: {
    type: "redirect",
    redirect: {
      transform: {
        queryTransform: {
          addOrReplaceParams: [{ key: "udm", value: "14" }]
        }
      }
    }
  },
  condition: {
    requestDomains: ["www.google.com"],
    urlFilter: "/search?",
    requestMethods: ["get"],
    resourceTypes: ["main_frame"]
  }
}

That's the whole mechanism. Every top-level search navigation gets udm=14 stamped onto it and comes back as clean Web results. No content script races the render, because there's no AI render to race. It's language-independent too — udm is a server-side parameter, so it doesn't care whether Google is serving you English or German, which is a nice bonus over text-matching a panel heading.

Note resourceTypes: ["main_frame"] — this only touches the top-level page load, not the page's own internal requests, so it changes what you searched for without breaking how the page works.

The proof that the DOM was never enough: AI Mode

Here's the part that turned me from "the network layer is nicer" to "the DOM layer is genuinely the wrong place."

Alongside AI Overviews, Google has AI Mode — a separate conversational surface you reach from a tab in the results. I wanted to suppress its page too, and I assumed the content-script backstop would catch it like anything else. It didn't. So I tried to observe that page from every angle a Chrome extension has:

declarativeNetRequest — no rule matched the navigation.
A MutationObserver on the document — fired for nothing.
chrome.tabs.onUpdated — never reported the change.
chrome.storage.onChanged as a last resort signal — silent.

Four unrelated APIs, and every one came back blank — even though the address bar plainly showed udm=50 for AI Mode. Not "my selector missed it." Zero signal of any kind reached the extension.

The consistent silence across four different mechanisms is the tell: that surface isn't part of the same page a content script attaches to. Whatever it is, it's stitched into the tab in a way that never injects a content script or surfaces a navigation the extension can see. Which means the DOM layer can't touch it even in principle. If your entire strategy is "hide it in the DOM," you have no move here at all.

The network-layer approach at least has an answer: since I can't catch that page, I prevent the click that leads to it — the extension hides the AI Mode tab in the results before it's ever pressed. Prevention again, one step earlier than the thing I couldn't reach.

What the network layer can't do either

I'd be selling you something if I stopped there, so here are the honest limits.

Forcing udm=14 catches the searches that go through the search box. It does not catch someone who navigates directly to a udm=50 URL — a typed address, an old bookmark, an external link — because there's no clean search request to rewrite. And it does nothing about Chrome's own omnibox AI button, which is native browser UI, not part of any page; no extension API can remove it.

Those aren't bugs to fix later; they're the edge of what this layer owns. Naming them is the difference between an approach you trust and one you find out about the hard way.

Where the content script goes: the backstop, not the star

I didn't delete the content script — I demoted it. It still runs, but as a labeled safety net for one specific case: Google quietly changing how udm=14 behaves. If that ever happens, the backstop hides whatever slipped through and the extension tells me about it, instead of silently degrading.

That's the content script in its right role — a fallback for the thing the primary mechanism can't guarantee, not the primary mechanism itself. The same code that's fragile as a front line is perfectly reasonable as a backstop, because it only has to fire when something has already gone wrong.

There's a small elaboration worth mentioning: I wanted users to be able to summon the AI Overview for a single search on demand. That's a second declarativeNetRequest rule at higher priority that allows a specifically-marked request through unmodified — the redirect and the exception live at the same layer, which keeps the whole thing coherent.

The transferable part

Forget Google for a second. The lesson generalizes: when you're fighting a web page's behavior, find the layer that actually owns the decision before you reach for the DOM. The DOM is where behavior becomes visible, which makes it the tempting place to intervene — and often the wrong one, because by then the decision is already made. Ask what produced the thing you're trying to stop, and see if you can get upstream of it. Sometimes you can't, and the DOM really is your only handle. But it's worth checking, because when you can move upstream, the fragile parts of the problem tend to disappear rather than get patched.

For this problem, that meant a request-rewrite rule instead of a render-and-hide loop — and an AI Mode mystery that made the point better than I could have on purpose.

I built this out as a small extension, No AI Search, if you want to see the whole thing shipped. Source is on GitHub.

I built a differential tester for TOON, and it found two silent-corruption bugs on the first run

Maverick Y — Thu, 02 Jul 2026 16:18:56 +0000

TL;DR — I built a differential tester for TOON: it runs data through one implementation's encoder and a different implementation's decoder, and checks the round-trip survived. On its first run it found two silent-corruption bugs (a rounded 64-bit integer and an empty array that decoded to a corrupted string), both filed upstream. The hard part wasn't finding bugs — it was building a comparison oracle honest enough that its FAIL means something. Repo: github.com/antrixy/toon-diff

I built a small tool that checks whether independent TOON implementations actually agree with each other. On its first real run — across the TypeScript reference and the Python port — it found two silent-corruption bugs. Both are now filed upstream.

This post is about why the approach finds bugs that ordinary conformance suites miss, and about the one genuinely tricky part: building a comparison oracle that doesn't corrupt the data while it's checking it.

The setup: TOON, and the promise of lossless round-trips

TOON (Token-Oriented Object Notation) is a compact, line-oriented encoding of the JSON data model, designed for things like trimming token counts in LLM prompts. The whole value proposition rests on one property:

JSON  →  TOON  →  JSON     should give you back what you started with.

There are independent TOON implementations in 25+ languages. Each ships its own conformance tests. Each is green. And yet the moment you have more than one implementation, a new failure mode appears that none of those green test suites can see.

Why conformance suites miss the interesting bugs

A conformance suite checks one implementation against a set of blessed expected outputs: given input X, the encoder must produce exactly Y. That's useful, but it has a structural blind spot.

Every implementation round-trips its own output just fine. The TS encoder produces something the TS decoder reads back perfectly. The Python encoder produces something the Python decoder reads back perfectly. Both suites pass. The bug lives in the gap between them — when TS encodes something and Python has to decode it, or vice versa.

That's what differential testing targets directly. Instead of checking against expected outputs, you check implementations against each other:

decode_Y( encode_X( value ) )  ==  value     for every ordered pair (X, Y)

With N implementations you run N×N ordered pairs (including each against itself, which is your control). Two implementations is a 4-cell matrix; three is 9. Any cell that fails is a place where two implementations disagree about what a given value means — and because both sides individually pass their own tests, nobody had noticed.

The harness for this is almost trivial. Each implementation gets wrapped in a tiny adapter with a uniform, text-in/text-out contract:

export interface Adapter {
  name: string;
  encode(jsonText: string): Promise<string>; // JSON text -> TOON text
  decode(toonText: string): Promise<string>; // TOON text -> JSON text
}

Working on text means the harness never has to hold any language's native value model. Adding a new language is one adapter. The driver loop is just:

for (const X of adapters)
  for (const Y of adapters)
    for (const c of cases) {
      const toon = await X.encode(c.jsonText);
      const back = await Y.decode(toon);
      if (!oracle.equal(c.value, ingest(back))) report(X, Y, c);
    }

All the difficulty is hiding in one word: equal.

The hard part: an oracle that doesn't corrupt the data it's judging

Here's the trap. You want to compare the original value against the round-tripped value. The obvious way is to parse both back into native objects and compare. In JavaScript that means JSON.parse. And JSON.parse will quietly destroy the exact cases you most want to test.

Consider the integer 9007199254740993, which is 2⁵³ + 1:

JSON.parse("9007199254740993")   // -> 9007199254740992   (!!)

It comes back as 2⁵³, off by one, because it rounded through an IEEE-754 double. If your oracle parses values this way, then when one implementation preserves the integer and another rounds it, your oracle rounds both and reports a false PASS — on the single most important case in the suite. The comparator silently corrupts the evidence.

My first version "solved" this by quarantining such cases — detecting numbers that couldn't survive a float and benching them. But that benches exactly the inputs where implementations with different number models (JS f64, Python arbitrary-precision int, Rust i64/u64/f64) are guaranteed to diverge. You're throwing away your strongest evidence to protect a broken comparator.

The fix is to never let a number touch a float. ES2023 added a context argument to the JSON.parse reviver that hands you the exact source lexeme of each value, before any numeric conversion:

const NUM = Symbol("num"); // collision-proof: JSON.parse can't produce a Symbol key

export function ingest(rawText: string): Node {
  return JSON.parse(rawText, (_key, value, ctx?: { source?: string }) => {
    if (typeof value === "number") {
      // ctx.source is the raw digits as written: "9007199254740993",
      // captured BEFORE the lossy f64 conversion that produced `value`.
      return { [NUM]: canonicalNumber(ctx.source) };
    }
    return value;
  });
}

ctx.source is the string "9007199254740993" — the actual characters from the input — even though value is already the rounded double. We ignore value entirely and keep the digits. Numbers are stored as a Symbol-tagged node so they can never collide with a real object that happens to have a "__num" key.

canonicalNumber then reduces the lexeme to a canonical value form using arbitrary-precision string arithmetic — never an f64 — so 2⁵³+1 stays itself all the way through the comparison.

Inside canonicalNumber: value identity without a float

The reviver gets us the raw digits; the remaining job is to map two different lexemes that denote the same number to the same string, without ever evaluating them numerically. "1.0", "1.00", and "1e0" must all become "1"; "1e-2" must become "0.01"; and "9007199254740993" must stay exactly itself. The whole thing is regex + string shifts:

export function canonicalNumber(lex: string): string {
  const m = /^([+-]?)(\d+)(?:\.(\d*))?(?:[eE]([+-]?\d+))?$/.exec(lex.trim());
  if (!m) return lex; // not a well-formed JSON number; fail safe, never throw

  const sign = m[1] === "-" ? "-" : "";
  const digits = m[2] + (m[3] ?? "");        // all significant digits, point removed
  const pointPos = m[2].length + (m[4] ? parseInt(m[4], 10) : 0); // where the point lands

  // Shift the decimal point by `pointPos`, padding with zeros when it falls
  // outside the digit run — pure string surgery, no parseFloat anywhere.
  let intStr: string, fracStr: string;
  if (pointPos <= 0)                  { intStr = "0"; fracStr = "0".repeat(-pointPos) + digits; }
  else if (pointPos >= digits.length) { intStr = digits + "0".repeat(pointPos - digits.length); fracStr = ""; }
  else                                { intStr = digits.slice(0, pointPos); fracStr = digits.slice(pointPos); }

  intStr  = intStr.replace(/^0+(?=\d)/, ""); // strip leading zeros, keep one
  fracStr = fracStr.replace(/0+$/, "");      // strip trailing zeros

  if (intStr === "0" && fracStr === "") return "0";       // canonical zero, sign dropped
  return fracStr ? `${sign}${intStr}.${fracStr}` : `${sign}${intStr}`;
}

Two things worth calling out. First, the if (!m) return lex line: a lexeme that doesn't match the JSON number grammar is returned untouched rather than throwing — the oracle should never crash on input, it should compare faithfully and let the result be the signal. Second, this is the exact spot where the value-vs-representation policy lives. Returning a value-normalized form here is what makes 1.0 == 1 and -0 == 0. If you instead wanted to flag representational drift — say, to surface that one implementation preserves -0 while another normalizes it — you'd return a representation-preserving form here (keep the trailing .0, keep the leading - on zero) and integers would still compare exactly. One function, one deliberate stance, documented in place.

The one real judgment call: compare by value, with exact integers

Once numbers survive ingestion intact, you have to decide what "equal" means. This is the only genuinely opinionated part of the oracle, and it's worth stating explicitly:

1.0     == 1        value-equal (RFC 8259: these denote the same number)
-0      == 0        value-equal (JSON's value model has no signed zero)
2^53+1  != 2^53     DIFFERENT integers — precision loss is a real divergence

In other words: compare by mathematical value (so representational noise like 1.0 vs 1 doesn't generate false positives), but preserve exact integer identity (so genuine precision loss is caught). That's the correct default for a "did the round-trip stay lossless?" question. Everything else in the oracle is strict: types don't coerce ("123" ≠ 123), array order matters, missing keys differ from explicit nulls, and there's no Unicode normalization (e + combining acute ≠ precomposed é).

How equality is actually computed

Rather than a recursive deep-equal with a pile of type checks, the oracle serializes each value tree to a single canonical string and compares the strings. The serialization is where the strictness is enforced structurally, so it can't be forgotten case by case:

export function canonical(node: Node): string {
  if (node === null)             return "null";
  if (typeof node === "boolean") return node ? "true" : "false";
  if (typeof node === "string")  return JSON.stringify(node);          // quoted form
  if (isNum(node))               return "#" + node[NUM];               // unquoted #-token
  if (Array.isArray(node))       return "[" + node.map(canonical).join(",") + "]";
  const keys = Object.keys(node).sort();                               // key order normalized
  return "{" + keys.map(k => JSON.stringify(k) + ":" + canonical(node[k])).join(",") + "}";
}

export const equal = (a: Node, b: Node) => canonical(a) === canonical(b);

The detail that does the heavy lifting: a string serializes to its quoted form ("123" → "123") while a number serializes to a #-prefixed token (123 → #123). Those can never collide, so the string "123" and the number 123 are structurally incapable of comparing equal — type-strictness falls out of the representation instead of being a check you have to remember to write. Object keys are sorted so key order doesn't matter, but arrays aren't, so element order does. And because the leaves are the Symbol-tagged value-form numbers from canonicalNumber, exact-integer identity is already baked in by the time we get here.

This is also what makes the oracle cheaply self-provable: canonical has no dependency on TOON at all, so its self-test just asserts pairs of values that must (or must not) share a canonical string — 1.0/1 equal, 2⁵³+1/2⁵³ not, "123"/123 not — and runs before any adapter is touched.

Reading the matrix: the shape of the failures diagnoses the bug

With two adapters (TS, Python) and 13 cases, a run is 2×2×13 = 52 pair-checks. Most cells pass. But the interesting thing isn't that cells fail — it's which cells fail, because the pattern tells you what kind of bug you're looking at before you read a single value.

Here are the two cases that diverged, as 2×2 grids. Rows are the encoder, columns are the decoder; ✓ means the round-trip survived, ✗ means the oracle caught a divergence:

Case 013 — integer 9007199254740993 (2⁵³+1):

                decode_TS   decode_Py
  encode_TS        ✗            ✗
  encode_Py        ✗            ✓

Three of four pairs fail — including TS→TS, the self-pair that's supposed to be your control. That shape is the diagnosis: when even an implementation's round-trip with itself fails, the problem isn't a handoff between languages, it's that implementation's number model. TS holds the integer as an f64 the instant it touches it, so encode_TS has already lost the digit before any decoder runs; and decode_TS re-loses it even when Python encoded it faithfully. The only surviving cell is Py→Py, because Python's arbitrary-precision integers never round. A failure that includes the diagonal = an encode/decode-side capability limit, not a protocol disagreement.

Case 002 — empty array []:

                decode_TS   decode_Py
  encode_TS        ✓            ✗
  encode_Py        ✓            ✓

Exactly one cell fails: encode_TS → decode_Py. That's the signature of a true cross-implementation handoff bug. TS encodes the empty array as the bare [] — a form TS's own decoder happily reads back (so TS→TS passes and its conformance suite stays green), but which Python's decoder chokes on, returning the corrupted '['. Python's own output ([0]:, the spec-canonical form) is read correctly by everyone, so its whole row passes. A single off-diagonal failure points straight at "implementation A emits something only A can read."

This is the entire argument for differential testing in one picture. The 013 pattern (fails on the diagonal) and the 002 pattern (fails on exactly one off-diagonal cell) are different bug classes, and you can tell them apart by the geometry of the matrix without even inspecting the payloads. A single-implementation conformance suite only ever runs the diagonal — so it can see 013 (sort of, if it tests the boundary) but is structurally blind to 002, whose only failing cell is off-diagonal by definition.

Scale this to three adapters and it's a 3×3 grid per case; the Rust adapter (different number model again) turns each case into 9 pair-checks and adds a whole new row and column of handoffs where divergences can hide.

Thirteen hand-designed probe cases, aimed at known fault lines: empty containers, almost-uniform tables, string-lookalikes, Unicode edge cases, whitespace, and the number boundaries around 2⁵³. First run, two real divergences.

1. Integer 2⁵³+1 — silent precision loss across the boundary

Input {"unsafe": 9007199254740993}. The TS path rounds it to ...992; the Python path preserves it exactly. Round-trip each through itself and both pass — the loss only appears when you cross them, which is precisely what the matrix does.

The honest framing here matters: TOON's spec permits precision loss for numbers outside a host's safe range if the implementation documents it. So this isn't a flat bug — it's a documented-divergence boundary, and the right place to surface it is the tool's README, not an accusatory issue. Differential testing is what makes the boundary visible instead of theoretical.

2. Empty array — a genuine cross-implementation bug, both halves

This one's a real bug, and it has two sides.

The spec is explicit (§9.1): an empty array encodes as [0]:. But:

Encoder side (TS): encode([]) emits the bare [], with no length header — non-conformant to §9.1.
Decoder side (Python): decode("[]") returns the string '[' — a single character, with the ] silently dropped. Not the empty array, not an error, not even the literal string "[]". Just corrupted output flowing downstream with no signal.

decode("[]")        # -> '['            corrupted
decode("[0]:")      # -> []             correct (canonical form)

Both halves filed: toon-format/toon#322 (encoder) and toon-format/toon-python#61 (decoder). Each side individually "worked" — the bug only existed in the handoff.

Honest status: a differential probe, not yet a fuzzer

I want to be precise about what this is. Right now it's 13 curated inputs, not a generative fuzzer. The next step is a mutation-based generator that takes those seeds and pushes along the same fault lines — boundary integers, delimiter-adjacent strings, near-uniform tables, empty containers — then shrinks any failure to a minimal reproducer. That's what turns "I picked inputs that break things" into "the tool finds inputs nobody wrote." A third adapter (Rust, with its i64/u64/f64 split) widens the number-model coverage where the next divergences most likely hide.

The transferable idea

None of this is TOON-specific. If you maintain any format with multiple independent implementations — a serializer, a parser, a protocol codec — the same shape applies:

Cross, don't self-check. Run decode_B(encode_A(x)), not decode_A(encode_A(x)). Same-implementation round-trips hide boundary bugs by construction.
Don't let your oracle corrupt the evidence. If your comparison path rounds, normalizes, or coerces, it will mask the exact divergences you're hunting. Capture values losslessly (the ctx.source trick is a clean way to do it in JS) and decide your equality semantics deliberately.
Prove the judge independently. The oracle must pass its own self-test with zero dependency on the things it judges.

Repo: https://github.com/antrixy/toon-diff

Two real bugs, no fuzzer yet, on inputs a person hand-wrote. The interesting part wasn't the bugs — it was building a comparator honest enough to believe when it said FAIL.

Decades In, and a Date Field Still Got Me

Maverick Y — Wed, 01 Jul 2026 02:14:52 +0000

I've been writing software for a long time. Long enough to have shipped in languages people don't run anymore, long enough to have made just about every mistake in the book at least once. You'd think that by now the simple bugs would be behind me.

They're not. And this is a story about one of them — written, mostly, for the junior developers who think the dangerous bugs are the complicated ones.

They aren't. Let me explain.

What happened

A flow that very few people use, but where each transaction carries a serious dollar amount, quietly stopped letting anyone complete it. No crash. No alarm. No stack trace begging for attention. Just a validation check that calmly, confidently refused to let people finish.

The cause: a date comparison was treating the current month as if it were in the future. A perfectly valid entry got rejected as impossible. Nobody on the other end was doing anything wrong — the code had simply decided that "now" hadn't arrived yet.

Low traffic, high value. That combination matters. When millions of people hit a broken page, you find out in minutes. When a handful of people hit it — but each one represents real money on the line — the failure is quieter and the cost per failure is enormous. It's the kind of thing that doesn't trip your dashboards but absolutely lands on someone's desk.

And here's the part I have to own: I didn't find it. A user did. They were the ones who hit the wall, figured out something was wrong, and reported it. My tests were green. My dashboards were calm. My decades of instinct hadn't flagged a thing. The failure surfaced because someone on the outside ran into it and bothered to tell us.

Once it was in front of me, the diagnosis was fast and the fix took minutes. But I want to be honest about how it felt: not triumphant. The bug was an off-by-one in date logic, the kind of mistake I'd have sworn I was long past — and it took a user to point at it before I ever saw it.

The part I want the juniors to hear

Early in your career, you'll assume that big problems have big causes. That if something important breaks, the explanation must be appropriately sophisticated — a race condition, a subtle architectural flaw, something worthy of the damage.

I'm here, decades in, to tell you that's a comforting lie.

Software doesn't grade on a curve. A one-line date mistake and a thousand-line design failure can produce the identical outcome: a person blocked, money stalled, and no idea why. The user never sees whether the cause was elegant or embarrassing. They just see a door that won't open.

The bugs that have cost me — and the businesses I've worked for — the most over my career were almost never the clever ones. The clever ones get attention. They get design reviews, careful testing, three sets of eyes. The simple ones slip through precisely because they look too trivial to be wrong. Nobody scrutinizes a date check the way they scrutinize the scary concurrency code. That's exactly why the date check is the one that bites.

What I'd tell my younger self

Dates are a trap that never stops being a trap. Timezones, month boundaries, "today," inclusive vs. exclusive ranges, off-by-one at the edges. I have decades of experience and date logic still demands my full attention. Treat every comparison like it's hiding an edge case, because it is.
Low volume is not low risk. Don't measure a bug's severity by how many people hit it. Measure it by what each hit costs. A rarely used path guarding something valuable deserves more care than a busy path guarding something trivial.
The boring bugs are the dangerous ones. They survive code review because they look harmless. Your instinct will be to spend your scrutiny on the complex code. Fight that instinct and give the "obvious" code a second look.
Calm is a senior skill, and you can start building it now. I diagnosed this quickly not because I'm brilliant but because once it was in front of me I didn't burn twenty minutes panicking. I read the logic, traced the input, and trusted that the cause was probably mundane. It almost always is.
Your users are your last line of defense — which means your earlier lines failed. A user caught this, not me, not my tests, not my monitoring. That's a gift and a warning at the same time. Be grateful when someone reports a bug clearly; they just did your QA for free. But also ask the harder question: why did it have to reach them at all? Every user-reported bug is a quiet audit of the safety nets that didn't catch it.
Test the boundaries, not the middle. Bugs don't live at "obviously valid" or "obviously invalid." They live at the edges — the first of the month, the last second of the year, the value that's exactly equal. Write the test for the edge before you write the code for the middle.

The thing that took me years to accept

When you find a bug this simple, a voice shows up: how did I miss something so obvious?

I still hear that voice. The difference now is that I don't believe it anymore. Obvious-in-hindsight is the natural condition of nearly every bug ever fixed. Hindsight makes everything look inevitable.

The measure of a good engineer was never never writing the simple mistake. You will write it. I still write it, after all this time. The measure is whether you can stay calm, find it fast, and fix it before it costs someone something that matters.

To the juniors reading this: the unglamorous saves count. Quietly. Permanently. Most of the real work of this job looks exactly like this — not heroic, just careful. Get comfortable with that early, and the decades go a lot smoother.

Cut LLM prompt tokens on structured data — losslessly

Maverick Y — Sat, 27 Jun 2026 12:01:37 +0000

Cut LLM prompt tokens on structured data — losslessly

A small, dependency-free tool for shrinking logs, JSON, and CSV in prompts — without dropping a single byte.

Logs, JSON, and CSV are some of the bulkiest, most repetitive things we feed into LLMs. They're also where prompt-token costs quietly pile up.

The trouble with lossy compression

The usual fix is semantic compression: have a model summarize the input and drop "low-information" tokens. It works — until the question needs the data that got dropped.

Ask:

"How many errors are in this log?"
"What's the total across these 400 rows?"

…and a lossy compressor can hand back a confident, wrong answer — because the rows it discarded were exactly the ones you needed. The compression looks great. The answer is broken.

A different bet: lossless or no-op

ctxfold takes the opposite approach. Its single rule:

Lossless or no-op. Never lossy.

Instead of summarizing, it re-encodes structure. Logs, JSON arrays, and CSV are tables in disguise — the same keys, prefixes, and templates repeat on every line. ctxfold lifts those repeated parts into a one-time header and keeps only what varies per row, producing a compact, self-labeling table the model reads directly. Nothing is dropped.

The guarantee is enforced in code: every encoder ships with a decoder, and compress() verifies that decoding its output reproduces the input before returning it. If it can't, you get your original text back, untouched. It can't corrupt your data — worst case, it does nothing.

Does the model still read it?

For logs and JSON — yes. On real data, ctxfold cuts ~35–40% of tokens on templated logs and JSON arrays, fully losslessly. And because the output is plain, labeled text, the model reads it as well as the raw input — in lookup tests against GPT-4o-mini, answers off the compressed form matched answers off the raw data, field for field.

(Readability is validated per format, on GPT-4o-mini; the lossless guarantee is model-independent. CSV turned out differently — see the update below.)

Update (July 2026)

When I wrote this, CSV readability was the one unvalidated cell in the benchmark table. I've since measured it with the same harness used for JSON and logs — and it failed: folded CSV scored 0/24 on GPT-4o-mini and 6–9/24 on GPT-4o, against 24/24 raw.

The root cause is structural: JSON and logs fold syntax (keys, braces, templates) and every value stays verbatim in the row. CSV has no syntax to remove, so its folding factors the data itself — and models don't reliably reconstruct values through a header at read time.

So as of v0.1.4, CSV folding is documented as pipeline-mode: fold for lossless transit, decompress() before the model reads it. For direct model reading, send CSV raw. JSON and logs remain validated direct-readable.

v0.2.0 also shipped ctxfold --profile — it shows where your prompt's characters go and what folding would save, with the same measured-claims rules. Zero-setup demo: node examples/profile-demo.js after cloning the repo.

Full story of the CSV result and the profiler: I built a readability test for my own compression format. It scored 0/24.

Try it

npm install ctxfold

const { compress } = require("ctxfold");

const { text, stats } = compress(bigLogOrJsonOrCsv);
// send `text` instead of the original
console.log(`${(stats.tokenRatio * 100).toFixed(0)}% fewer tokens, lossless: ${stats.lossless}`);

It's a pure text transform — no API calls, no model, zero dependencies — so it works with any LLM.

Not a replacement — the other half

ctxfold isn't a competitor to semantic compression; it's the complement. Summarize to extract a subset; ctxfold to shrink repetition without losing anything. It shines on structured data, not prose.

Why I built it

This started from a simple frustration: lossy prompt compressors gave impressive token savings, but on aggregate questions — counts, totals, "find this record" — the answers came back wrong, because the data needed to answer had been summarized away. Great compression, broken results. The fix wasn't a smarter summarizer; it was to stop dropping data at all. Repetitive structured text is compressible losslessly — you just have to treat it as structure instead of prose.

If you push a lot of logs, JSON, or CSV into prompts, I'd genuinely like to know what your payloads look like and whether the lossless tradeoff fits your use case. What's eating the most tokens in your prompts right now? Questions, critique, and edge cases that break it are all welcome in the comments.

Repo & docs: https://github.com/antrixy/ctxfold · npm: npm install ctxfold · MIT licensed.