DEV Community: zxpmail

Six experiments on adversarial verification — and the 75% wall that didn't move

zxpmail — Tue, 14 Jul 2026 13:15:12 +0000

The argument, in one line: a reviewer is a mechanism for drawing a line. Every fix moves the line — but the line can't be eliminated, because it lives on a 3-dimensional surface where multiple defensible boundaries cross. So the 75% false-negative wall doesn't move, and the practical move is to stop trying to move it.

1. The wall

The setup was simple. Let an LLM review what an AI agent produced and judge whether it satisfies the task. Outputs were a mix of obvious garbage ("I am a little duck, quack quack", "。", TODO placeholders, zero collected tests) and legitimate work (research briefs, draft documents, passing test runs, code, translations). 8 scenarios in the first round, expanded to 30 in the second.

When the reviewer is sharp enough to catch all the garbage, it lands at 0% false positives and 75% false negatives — three out of four valid outputs rejected. This is the wall. GLM-5.2 and deepseek-v4-flash both hit it. Smaller models (qwen3:0.5b at ~25% FN, gemma3:4.3b at ~50% FN) sit earlier on the curve — letting some garbage through, rejecting less valid work. They're not better; they're just at a different operating point on the same curve.

I tried three standard moves to shift off the wall.

Rerun and majority-vote the same prompt. N=10 reruns per scenario. The verdict was unanimous on every scenario with enough valid calls. The 75% is systematic, not random — the model commits to the same wrong call every time. You can't vote away a verdict that doesn't vary.

Vote across different prompts. Strict, balanced, and lenient prompts judged each scenario. Split votes are a useful signal — they flag scenarios where the test set itself is contested. But majority voting still hits 75% false negatives, because all three prompts share the same bias direction. Why? Section 2's answer: the model's boundary is stable; prompt wording labels the line, it doesn't move it. Voting smooths noise; it doesn't fix bias.

Calibrate the prompt wording. A "balanced" prompt (v3) hit 100% accuracy on the 8 Phase Gate scenarios. The standard "calibrate your prompt" advice seemed to work. Expanded to 30 scenarios, v3 and the strict v2 returned identical verdicts on every valid call. The improvement on 8 was test-set composition bias — the original scenarios happened to favor v3's leniency.

The wall is real. None of the standard levers moved it.

2. Why the wall doesn't move

A reviewer is a mechanism for drawing a line. The line separates "sufficient output" from "insufficient output" — that's the whole job. Formal checks, LLM judgments, prompt wording — these are choices of where and how to draw it.

Here is the property that matters. A sharper line catches more garbage and rejects more marginal-valid output. Same sharpness, opposite effects on the two error types. Sharpen the line and false positives drop while false negatives rise. Dull it and the reverse. The precision-recall tradeoff isn't a model defect — it's the geometry of drawing a line with imperfect discrimination. A perfect reviewer wouldn't have this tradeoff; reviewers have opinions about where the boundary lives, and those opinions are noisy.

The six experiments drew the line in three different ways. Phase Gate drew it on form — file exists, exit code 0 — which is independent of content. Four pieces of garbage ("I am a little duck", "。", TODO placeholder, zero collected tests) sailed through. False positives: 50%. Adversarial verification drew the line on semantics with an LLM. Much sharper. Caught all the garbage (false positives → 0%), and the same sharpness rejected three out of four marginal-but-valid outputs (false negatives → 75%). Prompt calibration tried to move the line by changing the wording — strict vs. balanced vs. lenient. On 30 scenarios, v2 and v3 returned identical verdicts on every valid call. The line didn't move, because wording doesn't draw lines. Wording labels lines. The third attempt is the limit of the substitution approach: once you're using words to move a line the model already drew, you're not substituting anymore. You're decorating.

So why not find a sharper line — or a different kind of line — that catches garbage without burning valid work? Because the line doesn't live in a one-dimensional space.

The boundary between "sufficient" and "insufficient" depends on at least three independent questions. Who consumes the output — a junior engineer taking it at face value, or a senior reviewer who'll catch edge cases? Where it's deployed — a prototype thrown away next week, or production that runs for years? What fails if it's wrong — a demo that embarrasses you in a meeting, or a deploy that takes down the service?

These three dimensions are mostly independent, not perfectly orthogonal. They correlate — consumer type gives a weak hint about deployment context — but not enough to collapse into one axis. Knowing the consumer doesn't determine the deployment. Knowing the deployment doesn't determine the cost of failure. So the boundary isn't a point in 1D space; it's a surface in 3D space. And most real outputs land somewhere in the interior — where multiple defensible boundaries cross.

"Is this output sufficient?" doesn't have a single answer because the question is underspecified. Different consumers, contexts, and costs give different defensible answers. The fuzziness isn't a property of weak models. It's a property of the question.

The practical conclusion falls out of the geometry. If the fuzziness is in the question, no model removes it. No prompt removes it. No voting scheme removes it. They just draw lines in different places on the same surface. The 75% didn't move across four models because there's nowhere to move it to — moving the operating point along the surface trades FP for FN, but the surface itself doesn't disappear.

We weren't failing to find the right trick. We were looking for a trick that doesn't exist.

3. Design around the wall

So design around it. The move is not "fix the wall." The move is "stop trying to fix the wall" — and that acceptance changes the design.

If the 75% is structural, you stop spending LLM calls on garbage that rules can catch (keyword match catches "I am a little duck", length check catches "。"). You stop trying to vote your way out of a systematic bias. You stop calibrating prompt wording and pretending the model's boundary will follow. Instead, you put rules where rules work, one calibrated LLM where semantics actually matters, and humans where the 3D boundary surface gets fuzzy — which Section 2's dimension argument tells you is exactly where models disagree. In practice: cheap deterministic checks (length, keyword, format) catch the obvious garbage, one calibrated LLM call judges the semantic residual per requirement, and any split verdict escalates to a human. The LLM never sees the cases rules can handle — it sees only what rules can't.

And then you pick a side of the wall. This is not a TODO; it is the load-bearing decision the rest of the design implements. More false positives means more reviewer attention burned on valid work flagged as suspect. More false negatives means more defective work ships. The tradeoff is structural. The only mistake is pretending you don't have to choose.

4. The illusion kept moving

The series is called "Agent Determinism Illusions." Across six experiments, the illusion kept moving.

It started in output determinism — temp=0 was supposed to guarantee consistency, and it doesn't (20 different versions of the same listing on a structured task). Caught, the illusion moved into review standards — formal checks were supposed to guarantee quality, and they don't ("file exists" passes "I am a little duck, quack quack"). Caught again, it moved into solution complexity — surely multi-model voting, or calibrated prompts, or layered pipelines would help. They don't, not really; each layer inherits the same wall. Caught a third time, the illusion stopped hiding in technical assumptions and moved up a level: into the meta-expectation that enough experiments produce a clean conclusion. They produce the conclusion that there is no clean conclusion.

The illusion keeps moving because we keep chasing it. The work isn't to catch it. The work is to stop expecting it to stand still.

Experiment code: agent-determinism-illusions/scripts/phasegate-formalism-test.py, adversarial-verify-p1.py, consistency-test-p2.py, multi-perspective-vote-p3.py, prompt-calibration-p3b.py, p4-expanded-test.pyPrevious: An alternative to LLM quality gates: deterministic routing + samplingSeries start: I tested the 'deterministic agent loop' claims with four experiments. They all failed — including my own fix.*Full series: [GitHub

An alternative to LLM quality gates: deterministic routing + sampling

zxpmail — Thu, 09 Jul 2026 12:16:29 +0000

Every "agent quality gate" I tested shares one fatal assumption: that an LLM can judge whether an LLM did the right thing. This article drops that assumption. The alternative isn't a smarter judge — it's no judge at all, in the control layer.

Over the last three articles, I tested the popular "production agent loop" design across six separate experiments:

Lexical overlap ≠ semantics — 50% misclassification
Temperature 0 ≠ determinism — open output only 70% consistent
Phase gates ≠ task completion — 50% false positives
Embedding ≠ synonym/antonym separation — cosine diff 0.026
Stronger models trade false positives for false rejections — GLM-5.2 hit 0% FP but rejected 75% of valid work
Architecture diagrams ≠ solutions — my own human-in-the-loop Harness had 6 unvalidated assumptions

Six rounds of dismantling, all backed by reproducible experiments.

Then I asked myself the question every critic has to answer: "What's your alternative?"

Here it is. Not an architecture diagram — a set of four implementable strategies, all using deterministic code, zero new LLM dependencies.

The core insight shift

Every approach I tested or proposed shared a fatal assumption: a single module (LLM or human) can judge whether output is "correct." That binary judgment at the semantic layer is what creates the precision-recall trap that all three model tiers fell into.

The alternative: don't judge correctness. Judge risk. Route high-risk work out of the agent pipeline entirely. Auto-release low-risk work. Only show medium-risk work to a human — and when you do, make it a diff review, not a full-text read.

Four-layer architecture (all deterministic code)

Layer 1: task-type routing

Before a task enters the agent engine, a router classifies it by output type:

Type	Criterion	Strategy
A (verifiable)	Output is compilable / schema-validatable (code, JSON, SQL)	Compile check or schema validation as the deterministic gate, plus a sampled fraction routed to diff review (see Layer 4). No LLM quality inspector called for the gate itself.
B (high-risk)	Money, legal, privacy, external publishing	No agent execution. Prompt: "This task requires human handling." AI provides a draft only, never auto-executes.
C (low-risk content)	Internal briefs, first drafts, brainstorming	Auto-release. Tag as "draft" (80% default confidence). No quality queue.
D (medium-risk content)	Client-facing emails, external documents	Diff review. Don't judge content quality. Only show what changed.

Why this beats an "LLM quality inspector": it acknowledges the LLM's limit at the source. Use the LLM for what it can do (generate). Never use an LLM for what it does poorly (judge semantic quality).

Layer 2: diff review — replace "judge right/wrong"

This is the key operational alternative. For Type D tasks, don't show the reviewer the "final output." Show them what the agent changed from the previous version.

Implementation: after generation, the system diffs the output against the original (or a template) using difflib — no LLM needed.

Reviewer UI: only the modified lines are highlighted. The reviewer answers one question: "Does this change introduce an error?"

Measured cognitive load:

Method	Reading load	Cognitive demand	Time
Full-text quality judgment (500 words)	500 words	High	~60 s
Diff review (50-word change)	50 words	Low	~10 s (lab; production: 30–90 s — see Knife 3 below)

The shift: open-ended judgment ("Is this article good?") becomes closed-ended ("Did this paragraph break something?"). Cognitive demand drops significantly — though the lab "~10 s" compresses to 30–90 s in production once context-switching overhead is included (Knife 3 below).

Layer 3: statistical process control — replace semantic clustering

Semantic clustering failed because content about different topics (a research brief vs. a chapter) won't embed close together just because they were both falsely rejected.

Alternative: don't cluster by semantic features. Cluster by behavioral features.

For every task, record metadata — execution steps, tool-call count, output length, special-character ratio. Then detect anomalies with simple statistical thresholds:

15 execution steps vs. mean of 5 → flag
Output length = 0 → flag
Output is 100% punctuation (the duck and the period from Experiment 3) → flag

All code. Zero LLM cost. And it catches "quack quack" (length anomaly) and "." (special-char anomaly) without understanding a word.

Measured: I ran SPC on the 8 scenarios from Experiment E (content length, special-char ratio, CJK ratio, alpha ratio, digit ratio; statistical threshold = mean +/- 1.5sd):

Scenario	SPC result	Actual	Correct?
G1 (duck, garbage)	ANOMALY (high CJK ratio)	Garbage	✅
G2 (period, garbage)	ANOMALY (100% special char)	Garbage	✅
G3 (TODO, garbage)	ANOMALY (100% alphabetic)	Garbage	✅
G4 (zero-case, garbage)	NORMAL (same features as L4)	Garbage	❌ FN
L1-L4 (valid)	Normal (one mild false-positive)	Valid	✅

SPC catches format anomalies (period, TODO, duck). G4 (zero-case test log) has the exact same behavioral profile as L4 (valid test log) — SPC misses it 100%. This directly validates the stated blind spot: SPC catches format anomalies but not semantic traps. G4‑class failures can only be caught by sampling, never prevented.

Layer 4: fixed-rate sampling — replace confidence scores

Several approaches I initially proposed relied on a "confidence score" (> 95% auto-release, < 80% human review). The hidden cost: confidence requires a feedback loop to calibrate — database, ground-truth labeling, delayed updates. The same complexity I criticized in the closed-loop calibration critique.

Alternative: fixed-rate sampling. No confidence math.

Type	Handling	Sample rate
A (verifiable)	Compile / schema gate + sampled into diff review	X% (tuned)
B (high-risk)	Mandatory human	—
C (low-risk content)	Auto-release	0%
D (medium-risk content)	Diff review (all items)	100%
Zero-shot generation (no prior version, no template)	Sample review	Fixed 5%

Post-publication correction (raised by Dipankar Sarkar in the dev.to comments): the original version of this table had Type A at 0% sample rate. That quietly treated schema-validatable syntax as a stand-in for semantic correctness — schema-valid JSON with a plausible-but-wrong value clears the gate silently, code that compiles can still book the wrong flight. This is the same class as the G4 finding in Layer 3 above (format-channel gate kills format-channel failure, not semantic failure); I called it out for SPC and then let Type A make the same mistake one layer up. The 0% was an indefensible asymmetry: zero-shot content gets sampled because there's no prior version to diff against, but schema-validatable code doesn't? X% should be calibrated from defect-rate data using the same logic as zero-shot's 5%. Start at 1-2% in week one, tune from there.

I admit: 5% is a guess. But its mathematical properties are known and quantifiable — which is more than can be said for a confidence score with no feedback loop.

Relentless self-review (same ruler)

Before calling this "done," I applied the same six-cut standard to this design.

Finding 1: classification is not free

Type labels can't depend on business owners manually tagging every task. They don't know their own types — they'd label 70% as "D" to be safe.

Fix: In the MVP phase, use two hard rules for automatic classification: ① if the task text contains sensitive keywords (money/contract/compensation) → force B; ② if the tool-call chain hits "send/publish/submit" → force human confirmation. Everything else defaults to C. Tune thresholds after launch based on false-positive rate.

Finding 2: diff review covers a narrower range than "edit tasks"

Diff review only works when there's a clear prior version. Agent workflows often involve reading five source documents → writing a new one from scratch — there's no single "previous version" to diff against.

Fix: In this design, "edit task" means exactly "a prior version of the same document exists." Multi-document synthesis tasks go to "zero-shot generation" → fixed 5% sampling. This is an honest scope reduction.

Finding 3: 5% sampling has known detection probability

With 5% sampling on zero-shot tasks: if the real defect rate is 20% on a given day, the probability of detecting at least one defective item = 1 − (0.8)⁵ = 67%. That means 33% probability of zero detection on any single day — a silent degradation could slip through for days.

Fix: 5% for non-critical content is acceptable. For critical content, raise to 10–20% or use deterministic sampling (every Nth item). First week post-launch: use 20% sampling to collect baseline defect-rate data before tuning.

Finding 4: sensitive-tool interception is not free

Intercepting "send email" after the agent has already taken 4 steps is not zero-cost — those steps consumed inference budget.

Fix: Add a "preheat check" before the agent executes — scan the user's request text for sensitive verbs (send/modify/delete/submit) and pre-confirm with the user. Don't wait until runtime to pull the trigger.

Finding 5: engineering cost — I made the same mistake I criticized

I initially estimated 2 engineer-months for the MVP. Same flaw as the cost analysis I criticized in my previous article: I only counted the core modules, not the integration.

Honest breakdown:

Module	Effort
Diff review UI (visual diff + highlight + judgment button)	1 engineer-month (frontend)
SPC collector (metadata + thresholds + aggregation)	0.5 engineer-month (backend)
Sensitive-tool whitelist + runtime interceptor	0.5 engineer-month (full-stack, needs agent framework hooks)
Monitoring dashboard + alerts	1 engineer-month (full-stack)
Sampling queue + assignment + expiry	0.5 engineer-month (backend)
Total	3.5 engineer-months (MVP)

That's 30% cheaper than the 5 engineer-month human-in-the-loop Harness — not 60%. Less sexy, but real.

Honest close: what this design solves and what it doesn't

Does solve

ROI inversion: Type A deterministic gate + sampled diff review + C auto-release + D diff-only. The fraction requiring human review drops enough that 3.5 engineer-months of investment breaks even within a reasonable horizon for most mid-volume deployments.
Clustering failure: SPC on behavioral features replaces embedding clustering. Verifiable by code, zero LLM cost.
Human error: Diff review reduces cognitive load. It doesn't eliminate errors (semantic traps still need domain knowledge), but it measurably reduces the error rate.

Doesn't solve

G4-class semantic traps (zero-case test log). These are caught by sampling, not prevented. The honest difference from the original "deterministic agent" articles: they claimed prevention; we acknowledge detection.
Type A semantic traps (compile-pass-but-wrong). Compiles-but-books-wrong-flight is sampled into diff review, not prevented. Same class as G4 above. The Layer 4 table originally had Type A at 0% sample rate — an indefensible asymmetry, corrected above.
Humans are still the final decision layer. In sensitive operations and edit reviews, humans are not optional.
Zero-shot generation is sampled, not guaranteed. 5% sampling means 67% single-day detection probability at 20% defect rate. For critical content, raise to 20% (98% detection probability).
Classification is imperfect. Automatic keyword and tool-chain classification has measurable false-positive and false-negative rates that must be tuned post-launch.

The actual prerequisites

A router/whitelist implementation, SPC threshold configuration, diff review UI, sampling queue, and monitoring dashboard — all standard CRUD + regex + statistics. No LLM dependency.
Engineering investment: 3.5 engineer-months for an MVP.
Business acceptance: "high risk requires human," "zero-shot is sampled," "semantic traps are detected, not prevented." These three constraints are business decisions, not engineering ones. No design can substitute for them.

Final rating (same ruler)

Criterion	Rating
Unvalidated assumptions?	Yes, all stated (5% sampling = 67% detection probability, not 100%)
LLM dependency in control layers?	Zero. All control logic is deterministic code.
Engineering cost estimated?	Yes: 3.5 engineer-months (honest, with integration costs)
Honest boundary declarations?	Yes: G4 traps not prevented, zero-shot sampled, humans not free, classification imperfect
Self-dismantling?	Yes — the five findings above dismantle everything that could be dismantled, plus the post-publication correction on Type A's sample rate (raised by Dipankar Sarkar). What remains are engineering facts: Type A deterministic gate + sampling, sensitive-tool hard interception, SPC format anomaly detection, and diff review cognitive-load reduction.

Three more knives before production (round two of relentless review)

Before this design hits production, three operational problems surfaced that I hadn't fully addressed.

Knife 1: SPC cold-start baseline drift

SPC uses statistical thresholds (mean +/- 1.5sd). But where does the mean and sd come from on day one?

You need 500-1000 "normal" traces to establish a baseline. If week 1 has a bug that makes every trace abnormally long, the baseline is skewed — real anomalies later get absorbed into the "new normal."

Measured: I simulated three phases (normal → bug → recovery + new anomaly) to find the real risk:

Bug severity (mean)	Mixed threshold	Anomaly (20 steps) detected?	Static threshold (>10)
Normal(5) → Bug 8	9.7	Yes	Yes
Normal(5) → Bug 12	12.7	Yes	Yes
Normal(5) → Bug 16	16.7	Yes	Yes
Normal(5) → Bug 20	20.5	No (missed)	Yes
Normal(5) → Bug 21	21.6	No (missed)	Yes

Crossover: dynamic threshold only fails at 4x the normal mean (Bug mean >= 20). SPC is more robust against moderate drift (2–3x) than the original critique claimed.

Revised response: Not a two-phase switch ("static first, then dynamic"), but dual thresholds in parallel: a static absolute threshold (steps > 20 always flagged) plus a dynamic relative threshold (rolling 7-day window). Either triggers — no dependency on clean cold-start data.

Knife 2: context escape in sensitive-tool interception

Keyword-based scanning of the user's request text for "send," "email" — but this fails on:

"Simulate sending a quote email to the client for preview, don't actually send it."

The scanner fires — user gets blocked — forced into manual flow. The agent's actual call chain only had preview_email, never send_email.

In practice, keyword-based interception has a 30–50% false-positive rate (users say "pretend to send," "let me see first," "save as draft"). Every false block erodes user trust. High false-positive rates drive users to bypass the system entirely — copying the email to their external client and sending it there, defeating the control entirely.

Revised response (v1, at publication): Execution-time interception only. Block the agent at the point of tool invocation (send_email called = block; preview_email called = pass). Don't scan the user's request text. This sacrifices "early interception saves inference cost" but delivers zero false positives — the tool was either called or it wasn't, no ambiguity.

Revised response (v2, post-publication): The either/or framing in v1 drops a viable middle ground (raised by Nazar Boyko in the dev.to comments). Keep the request-text scan, but demote it to a soft signal that never blocks: scan fires → agent prompts user "this task looks like it ends in a send — confirm the plan before I spend steps on it." If the user says "actually send," the agent proceeds to the tool call where the hard gate still fires (zero FP). If the user says "just previewing," the agent routes to preview_email and never hits the gate.

The layered design takes both benefits v1 traded off:

Finding 4 preserved: soft signal fires before inference is spent, so the agent doesn't burn 4 steps before being stopped or redirected.
Knife 2's zero-FP-block preserved: the hard gate at tool invocation never false-positives.
Cost: extra UX friction on simulation requests — unavoidable, since the LLM itself can't reliably tell "simulate" from "real" either.

Measured (post-publication): scripts/knife2-fp-rate-test.py (N=40, zero-LLM) verified the original "30-50% FP" claim. Coverage on FP-prone scenarios (simulate / draft / conditional / discussion): 95% (19/20 — the miss was "submission" not matching the "submit" regex root, itself a keyword-scan blind spot). Implied FP rate under 50/50 real/sim mix: 48.7% — within the claimed band. The FP mechanism is real; the layered design is the right answer.

Knife 3: diff review "10 seconds" shrinks in real UI

The measured "50-character diff in 10 seconds" is pure reading time. In production, the reviewer's flow is:

See highlight → recall what the original said → think about context → judge whether the change introduces an error → click approve/reject

With context-switching overhead, real per-item time is 30–45 seconds. At 50 items/day: 25–37 minutes. Still manageable, but the "order-of-magnitude compression" only exists in the lab.

Revised estimate: Diff review time adjusted from "10 s/item" to "30 s (routine) / 90 s (deep review)." Impact on staffing: 0.3 FTE → 0.5 FTE. Not a collapse, but an honest correction.

Final honest table

Dimension	Original design	After all corrections
SPC cold start	Not addressed	Dual thresholds in parallel, robust to 4x drift
Sensitive-tool interception	Keyword scan (30-50% FP)	Layered: soft signal (request scan, non-blocking) + hard gate (tool invocation) — v2 post-pub
Diff review time	10 s	30-90 s (0.3 → 0.5 FTE)
Engineering cost	2 engineer-months	3.5 engineer-months
LLM dependency in control layers	None	None (verifiable, deterministic code throughout)

What remains are business decisions: accept "high risk = human"? accept "semantic traps caught by sampling, not prevention"? accept 30–90 second diff review cycles? These questions have no engineering answers — but the engineering baseline for answering them is now measurable.

"Don't judge correctness. Judge risk." — this isn't a smarter architecture. It's a more honest one. It doesn't claim to solve what it can't solve. It just makes the remaining manual work cheaper, faster, and less error-prone.

And after five rounds of measurement, falsification, self-correction, and reconstruction — that's as far as engineering can go. The rest is a business decision.

I designed a Harness to fix my agent's quality problem — then found 6 flaws in my own design

zxpmail — Mon, 06 Jul 2026 21:58:25 +0000

In my previous article (I tested 3 models as AI agent quality inspectors: the stronger the model, the more valid work it rejects - DEV Community), I measured three model tiers as agent output quality inspectors across 8 scenarios (4 valid, 4 garbage). The result was a clean precision-recall tradeoff:

qwen3:0.5b (weak model): 25% garbage pass-through, 50% false rejections
GLM-5.2 (strong model): 0% garbage pass-through, 75% false rejections

The honest conclusion: a quality gate isn't a solution — it's a risk-transfer layer. Each layer catches some failures and introduces new ones.

I didn't stop there. I asked myself: if you accept the human-in-the-loop cost and design a proper Harness — not an automatic fix, but a system that makes human review efficient — what does it look like?

I sketched a 4-module architecture:

Batch clustering: compress 750 flagged items into 100 groups by failure vector, review one representative per group
Closed-loop calibration: human verdicts → sample pool → scheduled few-shot updates → inspector gets smarter
Human as gold standard: final arbitration by a trained reviewer
Asynchronous batching: accumulate flagged items, review in batches

It looked complete. It looked like progress beyond the "it's all tradeoffs" conclusion.

Then I picked up the same ruler I used on the original production-agent articles, and measured this design.

Six flaws. Not one less.

Flaw 1: batch clustering — mathematically elegant, operationally dangerous

The proposal: "cluster 750 flagged items into 100 groups by failure vector; review one representative per group."

This assumes that "failure modes" can be correctly grouped by embedding clustering. But the 3 falsely-rejected scenarios from the GLM-5.2 experiment had three different reasons:

Scenario A (research brief): content too short (title + one sentence)Scenario B (draft): missing structure (no chapter divisions)Scenario C (chapter files): too many placeholders (all TODO)

These three have low embedding similarity — one about a research brief, one about a draft, one about file structure. They wouldn't cluster into the same group.

I ran a quick embedding experiment to confirm (Qwen3-embedding:0.6b on the actual 8 scenarios from Experiment E):

Pair	Cosine	Meaning
L1-L2 (brief vs draft, both valid)	0.610	Moderate
L1-L3 (brief vs chapter, both valid)	0.295	Unrelated
L2-L3 (draft vs chapter, both valid)	0.341	Unrelated
Average across all three	0.415	Won't cluster together

L1 and L3 share a cosine of 0.295 — they're both false rejects of the strong model, but that shared label doesn't pull two different topics closer in embedding space. The 750→100 compression ratio has no embedding basis. A practical clustering run would place these three into separate groups, each requiring its own review — the promised compression evaporates.

Worse: if the clustering algorithm accidentally groups 3 real-garbage items with 1 false-rejected valid item, and the reviewer clicks "approve" to correct the false rejection — all 3 garbage items slip through.

I found exactly this in the data: scenario L4 (valid test log, "42 passed") and scenario G4 (garbage zero-case log, "0 passed (no tests collected)") have a cosine of 0.861 — both are test-log formatted strings, and embedding reads format, not semantics. Clustering will absolutely group them together. A reviewer correcting L4's false rejection would simultaneously approve G4, pushing the false-positive rate back to 25% — the same as without the strong model.

Verdict: the clustering compression ratio has zero experimental support. It might reduce false rejections from 75% to 0% — but it could simultaneously raise the false-positive rate from 0% to 30%.

Flaw 2: closed-loop calibration — shifting the problem to "tuning"

The proposal: "human clicks → sample pool → scheduled few-shot updates → inspector gradually gets smarter."

This uses a second meta-LLM (the one that selects and formats few-shot examples) to correct the first LLM's bias. But that meta-LLM has the same temperature-0 instability — Experiment 2 showed open-ended output is only 70% consistent.

The "common patterns" it extracts from 100 valid samples might look like: "contains a heading," "length > 100 characters," "has paragraph breaks." If the business requirement changes tomorrow to "write a one-line summary," these historically learned patterns become a new source of false rejections.

A closed loop can only adapt to past data distribution. It cannot handle distribution shift.

More fundamentally: there is zero evidence that feeding more few-shot examples linearly reduces false-rejection rates. I tested this.

Setup: qwen3:0.5b, same 8 scenarios (4 valid + 4 garbage), N=5 runs each. Baseline: original prompt. Treatment: same prompt with 3 few-shot examples prepended (including "short but valid content → PASS").

Scenario	Baseline false-rejection rate	+Few-shot false-rejection rate	Change
L1 (brief, valid)	100%	40%	✅ improved
L2 (draft, valid)	0%	100%	❌ worse
L3 (chapter, valid)	80%	80%	=
L4 (test log, valid)	20%	100%	❌ worse
Aggregate false-rejection	50%	80%	❌
Garbage pass-through	20%	15%	—

L1 improved (the brief was exactly the kind of "short but valid" the examples taught). But L2 and L4 — scenarios that were correctly accepted at baseline — both jumped to 100% rejection. G2 (period character) went from 0% to 40% false positive — new holes opened. Few-shot is whack-a-mole: every fix trades off somewhere else.

You might feed 500 samples and GLM-5.2 still kills "short but valid" outputs. Its "strictness" bias is at the model-weight level — not something a few in-context examples can overwrite.

Verdict: I promised the closed loop would calibrate. That promise rests on an unvalidated assumption — that LLM bias is correctable through in-context examples. Experiment 2 already showed that temperature 0 is fundamentally unstable; adding few-shot just adds another layer of instability.

Flaw 3: the reviewer is the "gold standard" — the most subtle lie

Every human-in-the-loop solution has a silent assumption: humans don't make mistakes.

Reviewer fatigue: on item #100 of "TODO" and item #101 of "I am a little duck quack quack," they might misclick
Standard drift: strict in the morning, lenient in the afternoon (because it's almost quitting time)
UI bias: if "approve" is on the left and "reject" on the right, click-position alone may bias decisions

If human misjudgment is 5% (optimistic), then "human review" introduces 5% label noise. That noise flows back through the closed loop, contaminating the sample pools, and poisoning the few-shot examples the quality inspector learns from.

The honest question is: "who judges the reviewer's judgment?" — it's a recursive infinite regress. My design was silent on this.

Flaw 4: the fatal synchronous-vs-asynchronous blind spot

My design assumed tasks can be accumulated and reviewed in batches. That works for data exports, report generation, and other asynchronous jobs.

But most agent scenarios are synchronous — customer support, coding assistants. The user asks a question, the agent takes 3 seconds to respond, the quality inspector flags it as "uncertain" and puts it in the human queue — and the user is still waiting in the chat window.

Batch review means: how long does the user wait? 5 minutes? 1 hour? This turns a real-time assistant into a ticket system.

I didn't distinguish synchronous from asynchronous. I applied one architecture to both. This is a product-design-level omission.

Flaw 5: engineering cost vs. benefit — the biggest hole

I ran the numbers: "750 items → 100 groups → 1 reviewer."

What I didn't cost out was building the Harness itself:

Evidence-trace visualization frontend: 2 engineer-months
Clustering + vector-search backend: 1 engineer-month
Closed-loop feedback pipeline: 1 engineer-month
ICU dashboard + monitoring: 1 engineer-month

Total: 5 engineer-months. At typical dev cost, that's roughly $75K.

What does it save? (7.5 reviewers − 1 reviewer) = 6.5 reviewer salaries. At ~$40K/year each, about $21K/month saved.

Break-even: $75K ÷ $21K ≈ 3.5 months.

I built a sensitivity matrix across DAU, false-rejection rate (FRR), and review speed ($40K/yr per reviewer, $75K investment):

DAU	FRR	Daily false rejects	Headcount (w/o system)	Headcount (with system)	Break-even
100	25%	125	0.8	0.3	38 months
100	75%	375	2.3	0.8	12 months
300	50%	750	4.7	1.6	6 months
1000	25%	1,250	7.8	2.6	4 months
1000	50%	2,500	15.6	5.2	2 months
1000	75%	3,750	23.4	7.8	≈1 month

The "3.5 months" claim only holds at the extreme: DAU=1000, FRR=75%, 30-seconds/review. Drop DAU to 100, break‑even jumps to 34–38 months — cheaper to just hire.

More stringent: false-rejection rate itself is a decay function. If GLM-5.2's next update drops FRR from 75% to 40% (not unlikely):

Daily false rejects: 3,750 → 2,000
Headcount (with system): 7.8 → 4.2
Monthly savings: $21K → $12K
Break-even: 1 month → 2 months → 4 months

FRR halves; break-even quadruples. Model updates are the norm, not an exception.

"The problem will persist" is the most convenient and least-validated assumption in the entire design.

Flaw 6: "15 seconds vs. 3 minutes" — a fabricated efficiency claim

I wrote: "with the Harness, review time drops from 3 minutes to 15 seconds."

This number is completely made up. I constructed three realistic agent execution traces and measured reading time at a conservative 250 word/minute rate:

Trace scale	Characters	Minimum reading time	vs "15 seconds"
Simple (3 steps, 1 task)	332	21 seconds	+6s
Medium (12 steps, 3 subtasks)	1,154	48 seconds	+33s
Complex (28 steps, full pipeline)	1,110	44 seconds	+29s

Even the simplest trace takes 21 seconds — 40% over the claim. Real production traces (12–28 steps) take 44–48 seconds, 2–3x the "15 seconds." If I compress the trace into a summary, the summary itself loses information — and information loss drives misjudgment.

I ran zero user tests. I just picked "15 seconds" to make the design look sexy. This is the same marketing rhetoric as the Rust blog's "80% decided by code" claim.

Honest revision: if I rewrote this design from scratch

I would not propose a "4-module Harness" architecture. I would write:

State the boundary first: this Harness only applies to asynchronous, non-real-time, high-value tasks. For real-time conversations, skip all clustering — do "confidence < 0.9 → transfer to human," nothing fancy.

Give a cost matrix: a table of "DAU vs. false-rejection rate vs. engineering investment," so the reader can judge whether it's worth building for their scale. Not a single pre-cooked "1 reviewer handles it."

Admit that humans also misjudge: add a "reviewer consistency check" — randomly assign the same item to two reviewers; if they disagree, escalate to a third. State the cost of this explicitly.

Delete "15 seconds": replace with "review time depends on task complexity — must be measured in production."

Final self-assessment

My "human-in-the-loop Harness" proposal was more honest than the Rust blog — it acknowledged tradeoffs and costs. But it wasn't honest enough. After acknowledging the costs, it quietly dissolved them with a new set of unvalidated architectural promises — clustering compression, closed-loop calibration, 15-second decisions.

The same line I used against the original articles applies to my own design:

"Treating 'decided' as 'decided correctly' is a rhetorical trap."

I treated "architecture diagram drawn" as "problem solved." — it's the same rhetorical move in a different suit.

The hard conclusion remains:

Under the current stack, semantic correctness has no engineering solution. A Harness can make "human intervention" more efficient and more observable — but it cannot eliminate it. Any proposal that claims to "dramatically reduce human cost" needs at least 3 months of online A/B testing validation — not an architecture diagram.

Six articles. One ruler.

Part 1: measured the genre's "determinism" claims — all three illusions, data-falsified
Part 2: measured my own "embedding upgrade" — same disease, also failed
Part 3: measured three model tiers — not a solution, a precision-recall tradeoff
This one: measured my own architectural design — "architecture drawn" ≠ "problem solved"

The ruler went full circle and measured me three times, each pass sharper than the last.

**This isn't "I was right." It's "every time I thought I was done, the ruler showed me I wasn't."

I tested 3 models as AI agent quality inspectors: the stronger the model, the more valid work it rejects

zxpmail — Sun, 05 Jul 2026 22:06:00 +0000

In my previous article (I tested the 'deterministic agent loop' claims with four experiments. They all failed — including my own fix. - DEV Community), I tested the three "determinism" pillars that popular production-agent articles claim:

Lexical overlap as a proxy for semantic continuity — 50% misclassification.
Temperature 0 for output consistency — open-ended output only 70% consistent.
Phase gates as "objective task completion" — 50% false-positive rate on garbage content.

And I ended with a fourth experiment for humility: my proposed "upgrade" — swapping out the lexical-overlap threshold for an embedding model — also failed. Qwen3-embedding couldn't separate synonymy from antonymy (cosine diff: 0.026).

The honest conclusion I landed on: under the current stack, this problem has no clean engineering fix.

But the most common pushback I got was: "You used qwen3:0.5b — a 0.5B parameter model. Of course it fails. Try a real model."

Fair. I tried three tiers of model as the quality inspector, same 8 scenarios, same prompt template.

What I found made the conclusion more nuanced, not less. The data reveals a tradeoff that "just use a stronger model" doesn't solve.

Experiment design

Same 8 scenarios from Experiment 3 (the phase-gate test):

4 valid outputs — research brief, first draft, chapter files, test log (42 passed)
4 garbage outputs — "I am a little duck, quack quack", a period, "TODO", "0 passed (no tests collected), exit 0"

The phase gate (string-level checks) waves all 8 through — 50% false-positive rate on garbage, by design.

The question: can you add an LLM quality inspector on top to catch the garbage? And what does it cost?

I ran each scenario 3 times per model (majority vote) to smooth out the temp-0 sampling noise documented in Experiment 2.

Three model tiers, measured

Inspector	False-positive rate	False rejection of valid work	Vote instability
(None, phase gate only)	50%	0%	—
qwen3:0.5b (local, 0.5B)	25%	50%	2/8
gemma3:latest (local, 4.3B)	25%	50%	0/8
GLM-5.2 (API, ~hundreds of B)	0%	75%	0/8

Finding 1: false-positive rate drops monotonically with model size

qwen3 and gemma3 both hit the same ceiling: 25%. They both fail on the same case — G4 (this is the chat model qwen3:0.5b, not the embedding model qwen3-embedding from Experiment 4):

Task: Run the test script, ensure the tests actually pass*Output:* exit 0, 0 passed (no tests collected)

Both smaller models see "exit 0" + "passed" and let it through. GLM-5.2 understands the semantics: "No tests collected" means no tests actually ran.

This is a genuine model-size discontinuity. The weak models are fooled by a surface-level "passed" signal; the strong model reads the actual meaning.

Finding 2: the cost is a surge in false rejections

GLM-5.2 drops the false-positive rate to 0%. Sounds like a solution.

But it rejected 3 out of 4 perfectly valid outputs.

Research brief (valid, has content) → rejected
First draft (valid, ~2000 words) → rejected
Chapter files (valid, contains content) → rejected
Test log (valid, "42 passed") → passed

Only the test log with explicit pass counts gets through. Everything else is flagged as "insufficient."

This isn't a bug. The strong model is simply following instructions strictly — the output has to clearly demonstrate it meets the task requirement. Anything that reads as a sketch or fragment gets killed.

What's really happening: a precision-recall tradeoff

Put the two columns together and the pattern is clear:

Weak model: lets garbage through (high false positives), but doesn't over-reject legitimate work
Strong model: catches all garbage (zero false positives), but rejects most legitimate work too

This is a precision-recall tradeoff, not a solution. The model isn't "solving" the semantic problem; it's choosing a position on the curve. A quality gate that catches everything can trivially achieve 0% false positives — by rejecting everything.

The "0% false positive" mirage

This also explains something I wrote earlier. I previously had a note that "DeepSeek achieved 0% false positive rate on this test" and concluded the problem was solved.

I was looking at the wrong metric.

0% false positive looks great. But without looking at the false-rejection rate alongside it, it's the exact mirror of the original articles' error: they treated "file exists" as "task complete"; I was treating "no garbage slipped through" as "quality gate works."

A quality gate's job isn't just to keep garbage out — it's to keep good work in. The "0%" number masked the fact that the strong model was rejecting 75% of valid outputs.

Honest revision of the conclusion

My previous article said: the quality inspector just shifts the problem up one layer.

That was too harsh. The data shows the inspector does reduce false positives — from 50% to 0% with a strong model. But it's not a fix — it's a cost transfer. Every garbage catch costs one false rejection.

A more precise model of how this works:

Phase gate (free, leaks 50%) → LLM quality gate (reduces false positives, but introduces false rejections) → Human review (catches the false rejections)

No single layer "solves" the problem. Each layer transfers the remaining uncertainty to the next. The honest design is a chain of risk transfer, not a stack of deterministic guarantees.

And the practical implication: if you add an LLM quality gate, you must budget for the human time to review false rejections. The stronger the model, the more you'll pay in flags that turn out to be false alarms.

What this means for production

If you're building an agent loop with output verification:

Phase gates catch nothing on content. They're cheap, but they buy you zero quality signal. Expect 50%+ garbage pass-through.
A small-model quality gate (≤4B) catches some obvious garbage but misses subtle cases. Your false-positive rate drops from 50% to ~25%, but you'll false-reject ~50% of real work.
A strong-model quality gate (API-grade) catches everything — including edge cases small models miss. Your false-positive rate hits 0%. But you'll false-reject ~75% of real work. Budget human review accordingly.
The metric that matters is the full confusion matrix, not a single column. Anyone advertising "0% false positives" without showing false-rejection rates is selling the same oversimplification they claim to fix.

Reproducible script

The experiment script is parameterized for multi-model comparison:

Repo: github.com/zxpmail/blog → agent-determinism-illusions/scripts → harness-verify-test.py

Set environment variables to switch models:

VERIFY_MODEL=qwen3:0.5b (local Ollama, default)
VERIFY_MODEL=gemma3:latest (local Ollama)
VERIFY_MODEL=glm-5.2 with VERIFY_BASE_URL and VERIFY_API_KEY (API)

Each model runs the same 8 scenarios × N iterations (default 3, majority vote). Swap in your own valid and garbage samples.

I wrote the first article to measure a popular genre's determinism claims. The second to catch myself proposing the same kind of oversimplified fix. This third piece corrects both: the truth isn't "no solution" or "just use a bigger model" — it's "there's a tradeoff, and you have to pick where to hurt."

Same ruler, one more measurement.

I tested the 'deterministic agent loop' claims with four experiments. They all failed — including my own fix.

zxpmail — Sat, 04 Jul 2026 23:58:26 +0000

A certain genre of "production-grade AI agent" article has been making the rounds. You know the shape: it argues that ReAct loops break in production, so you have to stack deterministic constraints on top of the LLM's uncertainty — a pre-AL gate, an LLM-as-Judge at temperature 0, a phase gate, a decision state machine. The one I have in mind claims 7000+ lines of production Rust.

The direction is right. Agent loops do need engineering guardrails; you can't let the LLM declare victory on its own. Pulling "self-contained agents" out of academic fantasy and toward engineering reality is a valuable move.

The problem is the repeated use of words like deterministic, objective fact, code vetoes the LLM to manufacture confidence. Do those claims actually hold up?

I didn't argue. I ran four experiments. Conclusion: each of the three core mechanisms it uses to establish "determinism" is only formally deterministic — all of them fail at the semantic layer. And the "upgrade" I prepared to fix them failed too.

Here's the data.

Fair credit first

The most valuable thing in this genre is the problem awareness. Three real defects of bare ReAct loops: no termination condition, no interrupt handling, no idle-loop protection. The proposed direction — wrap the LLM's uncertainty in deterministic constraints — is correct.

The problem isn't the direction. It's the landing. These articles treat three specific mechanisms as solved answers, and their actual behavior doesn't survive measurement.

I tested exactly these three:

Lexical-overlap thresholds — deciding whether a user interjection is a new task or an addendum
Temperature-0 evaluators — deciding whether the agent is done
Phase gates — deciding whether task completion is an "objective fact"

Three experiments, all using the methods and parameters the articles themselves describe, falsifying the articles' own claims.

Illusion 1: lexical overlap = semantics?

Mid-loop on turn 5 the user interjects: "actually, change it to X." Is this an addendum to the old task, or a brand-new task?

The proposed fix: compute a "lexical overlap" score with two fixed thresholds — ≥0.24 means same task, ≤0.08 means new task, with the middle sent to the LLM. The claim is "80% decided by code, instantly."

Sounds engineering-grade. But lexical overlap reads characters, not meaning. I built 30 labeled pairs, applied its thresholds, ran three tokenizers.

Result: 50% hard misclassification.

The worst cases:

Current task: "continue writing the loop-engine article"User interjects: "delete the loop-engine article"Overlap 0.615 → judged same task

The user said delete; the engine decides "same as writing," and keeps writing. A reverse operation is treated as a continuation. This is incident-grade.

Current task: "fix the checkout bug"User interjects: "the payment page is throwing, can you look"Overlap 0.000 → judged new task

Any human sees one task. Jaccard gives 0. Paraphrase fails entirely — 6/6 wrong. Cross-lingual is worse: 6 same-task EN/ZH pairs all score 0.000, all judged new. In any bilingual shop this mechanism collapses on contact.

A defender might say: "code makes a call in 90% of cases, above the 80% we promised."

That's a bait-and-switch. The implicit promise of "80% decided by code" is "80% decided correctly." The reality: code issues a verdict in 27 cases and gets 12 right — 44% accuracy.

Treating "decided" as "decided correctly" is the most dangerous rhetorical move in the whole design.

The thresholds only work on easy samples (high-overlap same-task, low-overlap new-task): 12/12 correct. The three "common but hard" categories — paraphrase, cross-lingual, antonym — go 0/16. Strongly suggests the thresholds were tuned on the easy set. Any non-trivial sample distribution breaks them immediately.

Illusion 2: temperature 0 = determinism?

The article sets the evaluator to temperature 0.0, "output almost entirely determined," because "for the same input, the evaluation should be as consistent as possible."

This is testable in one sentence: same prompt, temperature 0, run it 20 times, check consistency.

I ran three prompt categories on GLM-5.2, 20 runs each.

Result: open-ended output is only 70% consistent; 30% diverges.

Prompt type	Exact-match rate	Distinct versions
Math (most stable)	100%	1
Structured listing	95%	2
Open-ended creative	70%	5

The open-ended row is the killer — same prompt, temperature 0, 20 runs, 5 different versions, lowest pairwise similarity 0.198:

"Always head Northbound for your daily cup of exceptional coffee.""Premium coffee for the journey ahead."

Almost no shared characters. And the LLM-as-Judge evaluator outputs exactly this kind of open text — done / phase_done / reason / evidence.

The article says "the evaluator isn't creative writing, it's judgment, so temperature must be 0." But the evaluator's reason and evidence fields are inherently open; measured divergence is on the same order as creative prompts.

Even "structured listing" is unstable: five adjectives in a different order. If evidence is a list and the order changes, downstream JSON changes, the decision changes.

The only 100%-deterministic case is "17×23=391." Which proves the rule: temperature-0 determinism holds only when the answer space is razor-thin. The moment the output has any openness, determinism breaks. Treating a narrow special case as a universal property is overgeneralization.

Evaluator reproducibility is the foundation of the entire loop engine. Unstable evaluation → unstable done signal to the phase gate → unstable decision state machine. The foundation shakes, and ten layers of "deterministic constraints" stacked on top are standing on a shaking base.

(Only tested one provider, GLM-5.2. But the article's claim is universal, so single-provider falsification suffices. OpenAI's temp-0 non-determinism is documented and independently confirmed; more providers would only strengthen this.)

Illusion 3: phase gate = task completion?

The most confident line in the genre: "task completion, transformed from an LLM's self-claim into a verifiable objective fact."

The phase gate checks four things: did the script exit 0, does the file exist, is the file count met, is there a user-confirmation record. All in code, all checking "objective facts."

The problem — these checks verify that an action happened, not that the result is correct.

I implemented the phase gate per the article's description and built 8 scenarios: 4 with correct content, 4 with garbage content that still satisfies the gate.

Result: 100% gate pass rate, 50% content correctness, 50% false-positive rate.

The four false positives, in their own words:

Task	Actual output	Gate verdict
Write a research brief	"I am a little duck, quack quack."	✅ pass → "complete"
Draft covering ≥3 mechanisms	"." (a single period)	✅ pass → "complete"
Generate 3 chapter files	3 files containing "TODO"	✅ pass → "complete"
Run the tests	`0 passed (no tests collected)`, exit 0	✅ pass → "complete"

A duck, a period, TODO, zero test cases — the phase gate waves all of them through. It has zero discrimination on content correctness.

This isn't an implementation bug. The four checks it describes don't read content by construction; any faithful implementation has the same blind spot. Exit 0 means the process didn't crash, not that the result is right. File-exists means the path is there, not that the content meets the requirement.

Packaging "file exists / script ran" as "task complete" is an over-extension of the claim. The truth: the phase gate turns "an action happened" into an objective fact. It does not turn "the task is done" into an objective fact. Between those two lies a semantic gap it cannot cross.

That gap is called content quality — which is exactly what production users care most about.

Three pillars, all cracked

The genre's thesis sentence: "stack deterministic constraints on top of the LLM's uncertainty."

Now all three "determinisms" are punched through by measurement:

Pillar	Article claim	Measured	Status
Lexical overlap = semantics	"80% decided by code"	50% misclassified, 44% accuracy	❌
Temperature 0 = determinism	"almost entirely determined"	Open output 70% consistent	❌
Phase gate = task completion	"verifiable objective fact"	50% false positives	❌

All three foundation layers leak. The ten layers of constraints above stand on a leaking base.

The 7000 lines of Rust are probably real. But they guard the symbolic layer — string matching, file paths, exit codes. The semantic layer (intent, content, quality) is still running naked.

Why this genre goes viral

It lands precisely on the anxiety of readers who've built a demo but never hit production. To someone who hasn't run an LLM system in production, the mechanism pile feels heavyweight and authoritative — they haven't seen these practices, and don't know they fail at the semantic layer.

Anyone who has run production reads it and thinks "the names are nicer than the contents": Pre-AL gate is prompt-injected state, temperature-0 LLM-as-Judge is evaluator hygiene, "determinism-first" is try/catch plus string matching, phase gate is validation logic, ten priority levels are an if-else chain. Every mechanism is correct and worth doing — but naming each one with a proprietary term to manufacture the impression of "an original framework" is rebranding, not innovation.

The harder wound: these articles open with "not pseudocode, not a concept diagram," then deliver zero lines of real code — only function names, constants, parameter values. Those are identifiers, not code. The promise isn't kept.

And the thing repeatedly cited as evidence of "production-grade" — "7000+ lines" — appears three times. Line count is the worst proxy for quality. A system that actually runs in production should produce SLO data, postmortems, load-test curves — not line counts.

Fourth cut: I lied too

The first three cuts target the genre's three pillars of "determinism." Data speaks; all three break.

But I have to be honest here: I had a "constructive upgrade" ready behind those three cuts — embedding to upgrade lexical overlap, multi-vote to patch temperature 0, a second LLM to backstop the phase gate. I thought it would lift the article from "criticism" to "construction."

I was wrong. That proposal has the same disease as the articles it criticizes: using complicated engineering to fake a semantic solution.

I ran an experiment to convince myself. Not on the target — on my own proposal. I used Qwen3-embedding:0.6b (a real neural embedding model, 1024 dimensions) on the exact same synonymy-vs-antonymy separation test.

Result:

Category	Mean	Min	Max
Synonyms (should be high)	0.766	0.490	0.977
Antonyms (should be mid-low)	0.739	0.582	0.881
Unrelated (should be low)	0.326	0.237	0.404

Synonyms (0.766) and antonyms (0.739) differ by 0.026 — too close to separate.

"optimize code performance" vs "don't optimize code performance" — cosine 0.881, higher than 10 of the 12 synonym pairs.

"build a login-registration feature" vs "add the account-auth piece" (these are synonyms) — cosine 0.490, lower than nearly every antonym pair.

The only separation a neural embedding can do is "related vs unrelated" — synonyms/antonyms both sit around 0.75, unrelated drops to 0.326. But the moment the topic is the same and the direction is opposite, embedding fails exactly like Jaccard.

So the entire separation chain — characters to statistics to neural vectors — fails by measurement:

Jaccard (Exp 1): 50% misclassified. Cannot separate.
TF-IDF char 2-gram: synonyms 0.072, antonyms 0.222 — direction reversed. Fails.
Qwen3-embedding (Exp 4): synonyms 0.766, antonyms 0.739, diff 0.026. Fails.

My "embedding upgrade" doesn't survive this data. I'm deleting it and replacing it with the honest version.

Honest conclusion: under the current stack, this problem has no engineering solution

The genre's three "determinism" pillars all collapse. My attempt to patch them with embedding, multi-vote, and a second LLM also fails:

Embedding cannot separate synonymy from antonymy — same topic, opposite direction produces near-identical vectors.
A second LLM doesn't fix the first one's unreliability — the inspector itself hallucinates; it just shifts the problem up one layer.

So: when a user interjects something directionally ambiguous (new task or addendum? same direction or opposite?) into the current topic, engineering should not let an algorithm decide unilaterally. Detect topic overlap, then ask the human. Don't auto-adjudicate.

This isn't cowardice. It's an honest choice of objective function: correctness outranks autonomy. If you want an unattended autonomous agent — neither the genre's design nor mine gets you there today. If you must guarantee no misclassification — human confirmation is the only known strategy.

"LLM does symbolic-layer work; humans override on semantic judgment" isn't sexy. But it doesn't lie.

The question to ask before implementing

If you read one of these articles and are about to build a similar system, ask yourself first:

Can your task's output be objectively verified for correctness — not just existence?

If "no" (most content-generation, analysis, and conversational tasks are no), most of the genre's design doesn't apply to you. You need strong human review, cross-model verification, and user-feedback loops — not file-existence checks.

If "yes," still re-tune the parameters yourself, redesign the acceptance criteria, and reserve plenty of human-fallback channels.

Don't copy 0.24/0.08. Don't trust temperature 0 to give you determinism. Don't assume a passed phase gate means the task is done. Don't assume swapping in an embedding model buys you semantics.

Each of those four "don'ts" has measured data behind it.

Reproducible scripts

All four scripts are public, one-click runnable, no cherry-picking. Swap in your own business data and rerun.

Repo: github.com/zxpmail/blog → agent-determinism-illusions/scripts:

Exp 1 (local, no API): lexical-overlap-test.py — 30 labeled pairs against the 0.24/0.08 thresholds
Exp 2 (needs API): temp0-determinism-test.py — same prompt × 20 runs, temperature 0
Exp 3 (local, no API): phasegate-formalism-test.py — duck / period / TODO / zero-tests false positives
Exp 4 (needs Ollama + Qwen3): embedding-semantic-test.py — synonymy/antonymy separation

If your business data produces a materially lower error rate than mine, tell me — it means the mechanism holds in some domain, and I'll update the conclusion.

The original target was a viral tech article. But the same standard turns back on me: does my critique survive the three criteria — constraint, data, reproducibility? All four scripts are public; anyone can swap samples and rerun. Being measurable by the ruler you hand out is the honesty technical criticism deserves.

What Google's "Microservices Are Dead" Paper Actually Said (And What It Missed About AI)

zxpmail — Sat, 04 Jul 2026 09:24:11 +0000

A 2023 HotOS paper by Sanjay Ghemawat (MapReduce/Bigtable co-author) and Amin Vahdat (Google Fellow) got repackaged by tech media as "microservices are dead." It said no such thing. Three years later, the misreading has traveled further than the paper itself.

This post does three things: reconstructs what the paper actually claims, maps its three structural gaps, and introduces a variable the authors couldn't have predicted — AI code generation — which, I'll argue, undermines the paper's central solution more than any of those gaps.

The AI section uses my own open-source project ReqForge as evidence. Flagging the conflict of interest up front: this isn't neutral analysis, it's a design rationale. Which is exactly why it's more honest than a hypothetical example.

What the paper actually said

The paper is Towards Modern Development of Cloud Applications (HotOS '23, 8 pages). Its core claim in one sentence:

The fundamental problem with microservices is that they bind the logical boundary to the physical boundary. You let "how the code is organized" dictate "how the code is deployed" — two questions that should never have been welded together.

From that claim, the paper proposes a three-layer solution:

Logical monolith — developers write a cleanly modularized monolith; deployment is someone else's problem.
Automated runtime — a smart platform that decides at runtime whether components should be merged or split, based on load.
Atomic deployment — all components on a request path share one consistent version, avoiding half-old/half-new.

Prototype numbers: 15× lower latency, 9× lower cost.

That's it. The paper never says "microservices are wrong," never says "everyone should go back to monoliths," and gives no implementable plan. It's a vision paper — written to provoke discussion at a workshop, not an engineering whitepaper.

A ruler

Before dissecting it, here's a ruler you can apply to any architectural claim (this is a common framing in the engineering literature — you're free to reject it):

Architecture is the management of complexity across four dimensions — logical, physical, temporal, organizational — under constraints, in service of quality attributes.

The full definition adds three layers (decisions, decision mechanisms, decision evolution), but the four dimensions are the skeleton.

Keep the ruler in hand for the next three acts.

Act 1: Why split, why want to go back

Picture a platform's core system — request routing, rule matching, model inference, data aggregation, all in one process. v1 is a monolith. As traffic grows, the team splits it into four independently deployed services.

The cost shows up immediately: one request now traverses four services, and network hops push latency from 8ms to 120ms; four teams scale independently, and machine cost nearly tenfolds. This is exactly the pain the paper describes. Someone slams the paper on the table: go back, return to a monolith.

But they can't.

Gap 1: The paper optimizes for one quality attribute — cost. Real systems have more. The inference team is ML engineers on Python+GPU; the routing team is backend on Go. Technical heterogeneity means they can't collapse into one deployment unit. Harder still: payments flow through this system, and the inference module's OOM must never take it down. Fault isolation isn't an optimization — it's a requirement.

Google's answer holds only in the cost-first quadrant. Step into another quadrant and the conclusion inverts. The paper's precision is both its greatness and its limitation.

Act 2: Conway's Law — the dimension the paper silently skips

Suppose they force-collapse back into a monolith, all four teams committing to one repo. Looks beautiful, until the first conflict.

The rules team wants to modify a shared cache interface to support a new promotional rule; the inference team depends on that cache's implicit "return order is stable" semantics. After the change, inference results drift silently in production — caught three days later. In the microservice era, "the interface is a contract" shielded them; once collapsed, every boundary becomes an internal call and contract protection vanishes.

Gap 2: The paper says nothing about organizational complexity. Conway's Law: system architecture is a mirror of organizational communication structure. The core driver of microservices was never technical — it was letting small teams ship and iterate independently. Google's proposal demands all teams collaborate on one logical monolith, which puts Conway's cost right back on the table.

Four teams in one codebase means cross-team syncs, merge-conflict arbitration, release-window coordination — and that eats every cent the microservices saved, plus a cycle of team attrition.

The paper covers two of four dimensions (logical, physical). Organizational is blank.

Act 3: Mechanism is not decision

The team ends up neither back in a monolith nor fully in microservices — they choose a hybrid: core transactional path physically isolated, peripheral services collapsed into a modular monolith.

That choice itself exposes Gap 3: The paper gives a placeholder for a mechanism, not a decision.

Booch said "architecture is decisions." The paper says "architecture should have an automated decision mechanism." These are very different things:

Decision = a choice already made ("we use a logical monolith, not microservices")
Decision mechanism = how choices get made ("the runtime decides merge/split based on load")

The real decision rests on constraints (payment fault isolation), quality attributes (core stability vs. peripheral iteration speed), and tradeoffs (two deployment pipelines). The paper's "auto-merge/split runtime" can't help — it optimizes only cost and latency, while the real decision variables are organizational structure and business risk. But I'm not going to demand that a vision paper hand us a decision — that's the engineer's job for a specific system. My critique lands where the paper should be held accountable: it never even stands up the mechanism itself. The paper admits the runtime "isn't magic," yet says nothing about how to build it — and nothing about who triggers a re-decision when constraints change. A vision paper can withhold decisions, but the central mechanism it proposes deserves at least a minimal feasibility argument — and this one has none.

The real cut: the variable Google never predicted — AI

The first three acts make the paper "not actionable." But what actually undermines its premise is a variable it never discusses: AI code generation.

The paper landed in June 2023, months into the ChatGPT coding wave — you can't blame the authors. But to argue how sharp this variable cuts, a hypothetical isn't enough. I'll use my own project as evidence.

Disclosure: ReqForge is an open-source project I maintain (github.com/zxpmail/ReqForge). What follows isn't neutral analysis — it's a design rationale. And because I'm accountable for these choices, it's more honest than any invented example.

The entire promise of the logical monolith rests on one assumption: module boundaries and interface contracts will be maintained by humans. AI-generated code is systematically breaking that assumption.

Humans don't read only type signatures. A veteran knows that some function "actually" has a call-frequency cap, holds implicit state, or can't be called inside a transaction. These implicit contracts aren't written in the signature, but they exist. The AI doesn't see them. It sees imports and types, then depends on internal details it shouldn't — and feels fine doing it, because the type check passes.

This isn't a bug in the AI. It's how it works by design. In the human era, the thing that broke contracts was a few careless commits. In the AI era, it's every generated line.

My project ReqForge is, end to end, an engineering response to this. Several of its design choices are live evidence for the claim that "physical isolation becomes necessary again in the AI era":

1. Logical/physical decoupling — literally the paper's Solution 1. ReqForge separates methodology (core/: skills, agents, hooks) from physical deployment (adapters/: claude-code, cursor, opencode, gemini-cli) — one core synced to four adapters. The paper says "letting code organization dictate deployment is wrong"; this project did exactly that from day one. Note, though: it implements Solution 1, not Solution 2 (the automated runtime). That "smart platform" — the paper itself only drew a box around it, and AI-generated code's implicit dependencies make that box even harder to fill.

2. The sub-agent context firewall — "AI context isolation" in miniature. ReqForge mandates that every Task gets a fresh sub-agent instance — no reuse, no inherited context. The orchestrator provides only the current task's context, never history. Why? Because once an AI's flawed assumption crosses a Task boundary, it cascades. It's the same principle as using physical boundaries to stop cross-service failure — just applied at the agent layer instead of the deployment layer.

3. "Don't let the AI cross the line" as a machine gate. Each Phase declares file boundaries (modify / readonly / outOfScope); forge-verify scope-check enforces them against git diff. The AI tries to "helpfully" edit a readonly file? The gate refuses. Since persuasion doesn't work, you use a physical boundary.

4. Don't fight the model — work with it. ReqForge rewrote all its anti-slop rules from "10 don'ts" into "3 perfect anchors + a light checklist," because "LLMs are pattern matchers, not rule followers." This concedes a brutal fact: in the AI era, you can't hold boundaries by telling the model the rules — you can only steer it by changing what it sees. Physical isolation is the hardest version of that.

These four together drive a conclusion the paper can't answer:

In the AI era, physical isolation becomes necessary again — but for a new reason. It's no longer just fault isolation; it's AI context isolation: split modules into separate repos so the model's context window literally cannot see other modules' internals, using physical boundaries to forge contracts the AI can't pierce.

Once physical isolation is necessary again for this new reason, the logical monolith's central promise — "you don't need physical isolation" — shrinks dramatically.

But this claim has its own scope, and I have to state it — otherwise I commit the same error as Google's paper, shouting a universal conclusion from a limited case. Physical isolation (separate repos) raises CI/CD and cross-module coordination costs; it only pays off when scale and complexity are large enough that AI's cost of piercing boundaries exceeds the cost of isolation. For small teams (≤10), single tech stacks, or projects with stable module boundaries, "disciplining the AI's prompt + code review" is often cheaper than physical isolation. And "context isolation" doesn't strictly require Git-repo physical separation — context-trimming in the AI toolchain, or scoping limits on sub-agents, are lighter approximations, just less hard than a physical boundary. My claim is that physical isolation has gained a new reason to exist — not that every module should be physically isolated.

This cut is more lethal than all three gaps. The gaps make the paper "incomplete." AI makes its central solution "questionable in the new era."

The paper's true position: a coordinate, not a map

Step back and look at the whole trajectory: monolith → microservices (cost explosion) → want to return to monolith (can't — heterogeneity, isolation) → hybrid (their own decision) → AI forces physical isolation back (a new reason: context isolation).

Along that path, Google's paper nails the first segment (the cost pain) and is useless for every segment after.

That's its real position: a coordinate, not a map.

Its greatest contribution was offering, at the peak of the microservices craze, a different angle from an industry giant — prompting the field to question whether microservices are the only answer. Its intellectual value exceeds its practical value. But it is not an engineering guide, and it is not an obituary.

What's worth taking away more than the paper itself is that ruler. Looking back, the paper got misread precisely because tech media skipped the ruler's three elements — never asked "which quality attribute," never checked "which dimensions," and mistook a vision for a decision. The next time an article tells you "choose A or B," measure it: which quality attribute does it serve, which dimensions does it cover, and is it giving you a decision or a decision mechanism?

Architecture isn't A vs. B. It's "under constraint X, for quality attribute Y, I chose Z, at cost W."

The ruler doesn't play favorites. It measures Google's paper — and it measures this post too, including my own-project evidence in the AI section. You can turn it back on me: do my claims survive the "constraint, quality attribute, cost" test? A piece willing to be measured by the ruler it hands out is the kind of honesty a tech commentary should aim for.

Any architectural conclusion that skips constraint, quality attribute, and cost isn't worth taking seriously — whether it comes from Google or from a blog.

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

zxpmail — Sun, 28 Jun 2026 23:06:31 +0000

Every LLM inference engineer hits this wall eventually.

You deployed a model, it works in testing, then production traffic arrives. Suddenly your 80GB A100 is OOM on a 70B model that "should fit."

The culprit is almost always the KV Cache. But most discussions stop at "it caches the Key and Value matrices" — which doesn't help you predict when you'll run out of memory.

This post gives you a quick estimator formula, explains when to worry, and what levers actually help.

The One-Number Formula

Here's the quick estimator:

KV Cache Memory (GB) = 2 × (layers) × (hidden_dim) × (context_length) × (bytes_per_param)

The leading 2 is because you cache both K and V.

For Llama 3.1 70B (80 layers, hidden_dim 8192, FP16):

Per token: 2 × 80 × 8192 × 2 bytes = 2.6 MB
At 8K context: 2.6 MB × 8192 = 21 GB
At 128K context: 2.6 MB × 131072 = 340 GB (doesn't fit on one A100)

That's right: the KV cache for a 70B model at 128K context requires 340GB of memory — more than the model weights themselves (140GB in FP16).

In most inference scenarios, the KV cache is the bottleneck, not the weights.

Why It Matters More Than Weights

Model weights are static. You load them once, they sit in VRAM. 70B in FP16 = ~140GB. That's a known cost.

KV Cache is dynamic. It grows linearly with:

Batch size — cached for every sequence in the batch
Context length — cached for every token position
Number of layers — cached for every transformer layer (the full stack)

The wall you'll hit first:

Scenario	Weights	KV Cache (8K)	KV Cache (128K)
70B, batch=1, FP16	140GB	21GB	340GB — OOM
70B, batch=4, FP16	140GB	84GB	1.3TB — OOM
7B, batch=32, 8K, FP16	14GB	9GB	150GB — OOM

At long contexts or high batch sizes, the KV cache dominates total memory — and it's the part that grows with traffic, not the part you can amortize.

If you're running Speculative Decoding (theory, benchmarks), both the draft model and the target model maintain their own KV caches. For a 7B draft + 70B target pair, the draft adds roughly 10-15% more KV cache memory on top of the target's — a factor worth including in your estimate.

What Actually Reduces KV Cache Memory

There are six levers, and they're not all created equal.

Lever 1: Multi-Query Attention (MQA) / Grouped Query Attention (GQA)

This is the most impactful architectural fix. Instead of caching K and V for every attention head, share K and V across query heads.

Original MHA: KV cache per layer = 2 × hidden_dim
GQA (8 groups): KV cache per layer = 2 × hidden_dim / group_size (where group_size = num_attn_heads / kv_heads, e.g. 64/8 = 8)
MQA (1 group): KV cache per layer = 2 × hidden_dim / num_attn_heads

In practice: Llama 3.1 70B uses GQA with 8 key-value heads. That reduces the KV cache to about 1/8 of what MHA would require — roughly 2.6 MB per token → 0.33 MB per token.

Architecture	KV per token (70B, FP16, 8192 hidden, 64 attn heads, head_dim=128)
MHA (64 KV heads)	2.6 MB
GQA (8 KV heads)	0.33 MB
MQA (1 KV head)	0.04 MB

GQA is a free lunch. It barely affects quality and cuts cache memory by 4-8×. If your model doesn't use it, consider switching.

Lever 2: Quantization (FP16 → FP8 → INT4)

KV Cache is less sensitive to quantization than weights. You can usually go to FP8 or INT4 without meaningful quality loss.

Precision	Bytes per param	KV cache for 7B, 8K, batch=16
FP16	2	18 GB
FP8	1	9 GB
INT4	0.5	4.5 GB

KV cache quantization is supported by most inference frameworks (TensorRT-LLM, vLLM, AWQ). The quality impact is minimal because KV cache errors are per-token, not accumulated across tokens.

Lever 3: Sliding Window Attention

Instead of caching all positions, only cache the last N tokens. For models that use ALiBi or Rotary Position Encoding without a strict context limit, this can cap KV cache growth.

The tradeoff: the model loses access to tokens beyond the window. For tasks that need long-range dependencies (summarization, document QA), this degrades quality.

For conversational or streaming use cases, sliding window is a no-brainer. For RAG, it depends on where in the context the relevant information sits.

Lever 4: PagedAttention (vLLM)

vLLM's contribution is memory management, not cache reduction. It fragments less.

Traditional inference allocates contiguous blocks per sequence. If a sequence has 512 tokens of cache and the allocator uses 1024-sized blocks, 50% is wasted.

PagedAttention allocates in smaller (16-256 token) pages, reducing fragmentation from 30-50% down to 1-4%.

Net effect: 30-50% effective memory gain on the same hardware, with no quality impact and no model changes.

This is why teams see such dramatic improvements switching to vLLM — it's not faster compute, it's better memory packing.

Lever 5: Reduce Context Length

This is the most brute-force lever, but sometimes the right one.

Max context	KV cache (7B, FP16, batch=16)
2K	2.3 GB
8K	9 GB
32K	36 GB
128K	144 GB

If 99% of your requests are under 4K tokens, don't support 128K context. Supporting a context length you don't use is burning VRAM for no reason.

Frameworks like vLLM support per-request context limits — you can set max_model_len to fit your workload rather than the model's theoretical maximum.

Lever 6: Use a Smaller Model

Sometimes the best optimization is admitting the model is too big for your use case.

A 7B model with full 128K context costs more in KV cache than a 70B model with 2K context. If your task needs long context, a smaller model at a higher context length may use less total memory than a large model at the same context.

The Quick Decision Tree

Run out of KV cache memory? Here's the order to try:

1. Switch to vLLM. ~30-50% effective memory gain. No model changes. Start here.

2. Quantize KV cache to FP8. ~2× memory reduction. Minimal quality impact.

3. Check GQA groups. If your model has full MHA, find a GQA variant. 4-8× reduction.

4. Implement sliding window or reduce max context. Only if your workload allows it.

5. Quantize to INT4. ~4× reduction from FP16. Test quality impact on your data first.

6. Reduce batch size. Last resort. Hurts throughput.

A Quick Estimator Script

def kv_cache_memory(layers, hidden_dim, context_len, batch_size, kv_heads, num_attn_heads, bytes_per_param=2):
    """
    Estimate KV cache memory in GB.

    layers: number of transformer layers
    hidden_dim: model hidden dimension
    context_len: max context length in tokens
    batch_size: number of concurrent sequences
    kv_heads: number of KV heads (1 for MQA, n for GQA, num_attn_heads for MHA)
    num_attn_heads: number of attention heads
    bytes_per_param: 2 for FP16, 1 for FP8, 0.5 for INT4
    """
    head_dim = hidden_dim // num_attn_heads
    kv_per_position = 2 * layers * kv_heads * head_dim * bytes_per_param
    total = kv_per_position * context_len * batch_size
    return total / (1024**3)  # convert to GB

# Example: Llama 3.1 70B, 8K context, batch=4, GQA-8
# layers=80, hidden_dim=8192, attn_heads=64, kv_heads=8
print(f"{kv_cache_memory(80, 8192, 8192, 4, 8, 64, 2):.1f} GB")  # ~10.0 GB

# Same model, MHA (kv_heads = attn_heads = 64)
print(f"{kv_cache_memory(80, 8192, 8192, 4, 64, 64, 2):.1f} GB")  # ~80.0 GB

Run it before you deploy. It's cheaper than an OOM at 3 AM.

Closing

The KV cache is the silent memory killer in LLM inference. Model weights get all the attention — they're static, visible, and easy to estimate. The KV cache is dynamic, grows with traffic, and often exceeds the weight memory at production batch sizes and context lengths.

The fix isn't one lever. It's knowing which lever to pull first.

Start with memory management (vLLM). Then quantization (FP8). Then architecture (GQA). Then context limits. In that order. Most teams will run out of problems before they run out of levers.

And if you're exploring Speculative Decoding — the acceleration technique comes with its own memory tax: both models need room for their KV caches. Make sure your estimate accounts for both.

KV cache memory estimation should be part of your pre-deployment checklist. Two lines of Python will tell you if a 3 A.M. page is waiting for you.

*June 2026. One formula, six levers, one decision tree. Estimate before you deploy — it's cheaper than an OOM at 3 AM.

I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

zxpmail — Sun, 28 Jun 2026 11:41:42 +0000

In my last post, I laid out the core inequality of Speculative Decoding:

a > 1 + α + β

Acceptance length a must exceed 1 plus the draft/target compute ratio α plus verification overhead β. If it does, SD wins. If it doesn't, SD loses.

That was theory. This post is the practice.

I ran a real A/B test on my machine. The results were worse than I expected — and more interesting.

What I Tested

Hardware: 12th Gen Intel, 64GB RAM (CPU only). Yes, this means SD was always going to lose on raw speed. That wasn't the point — the point was measuring the acceptance length a across different task types. The speed numbers are secondary: they confirm the inequality on CPU, but the a values are what transfer to GPU deployments.

Model pair: Qwen2.5-0.5B-Instruct (draft) → Qwen2.5-1.5B-Instruct (target). Same model family, same tokenizer — a "well-matched" pair by any measure.

Tasks (5 prompts each, 32 tokens per generation, greedy decoding):

code — "Write a Python function to check if a string is a palindrome"
json — "Generate a JSON schema for a user profile with name, email, addresses"
story — "Write a short story about a programmer who discovers their code can write itself"

Draft length: k = 5 (the default sweet spot)

I logged every round: draft length k, accepted count a, and wall time for both raw autoregressive generation and speculative decoding.

Sample size note: 5 prompts × 32 tokens = ~160 generated tokens per task type. Enough for directional signals and the qualitative patterns below — not enough for release-grade latency benchmarks. The a values converged within 3-4 prompts; the speed numbers are CPU-specific and should not be taken as absolute.

The Results

Task	Mean a	Median a	Zero-accept rounds	Raw t/s	SD t/s	Speedup
code	3.00	4.0	23.8%	1.9	0.8	-56%
json	3.50	5.0	15.8%	1.8	0.9	-49%
story	2.11	2.0	30.2%	2.2	0.8	-62%

Speculative Decoding was 49-62% slower across all three task types.

The acceptance lengths were well above the 1 + α + β threshold. But SD still lost, and it wasn't close.

Finding 1: Task Type Shapes a More Than Model Size

The acceptance length varied significantly by task:

JSON (structured): a = 3.50 — the draft model could reliably predict what the target would generate for well-defined formats
Code (semi-structured): a = 3.00 — still good, but more variability in naming and logic patterns
Story (creative): a = 2.11 — the draft model struggled to anticipate the target's word choices in open-ended text

This confirms the distribution shift argument from the theory post. The same model pair (0.5B → 1.5B, same family, same tokenizer) produces very different acceptance rates depending on what you're generating. Your SD speedup will vary more by task than by model size.

Finding 2: 16-30% of Draft Rounds Are Wasted

The most eye-opening metric wasn't the mean a — it was the zero-accept rate.

15-30% of draft rounds accepted exactly zero tokens. The draft model fired, generated 5 candidate tokens, and every single one was rejected. Those rounds are pure overhead: you paid for the draft run, paid for the verification run, and got nothing but a single token from the target model.

In a round where a = 0:

Raw mode: 1 target forward pass, 1 token generated
SD mode: 1 target + 1 draft forward pass, 1 token generated

SD cost 2x for the same output. And because draft models aren't free — even a 0.5B model has real compute cost — these zero-accept rounds are what drag down the average.

The histogram of per-round a values tells the story:

code:   [0 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5]
                            ^^^^^^^^^^                       ^^^^^^^^^^^^^^^^
                            23.8% wasted                     42.9% full hits

The mode is 5 (full hit) and the second mode is 0 (full miss). The mean (3.0) is somewhere in the middle, but the user experience is either "fast" or "very slow" — not 3.0.

For latency-sensitive applications, the p10 or p25 a matters more than the mean. If 25% of your requests hit zero-accept rounds, your p99 tail will be significantly worse than raw autoregressive.

Finding 3: On CPU, SD Can't Win

This is where the CPU "bug" became a feature.

My A/B test ran on CPU (12th Gen Intel, 64GB RAM). The 1.5B target model managed about 2 tokens/second. The 0.5B draft model managed about 6 tokens/second. That gives:

α (compute ratio on CPU): ≈ 0.3

Compare this to a GPU:

α (compute ratio on GPU, 7B→70B): ≈ 0.05–0.1

The inequality threshold shifts dramatically:

Platform	α	β	Threshold (1 + α + β)	Our a
CPU	~0.3	~0.10	1.40	2.1–3.5
GPU (A100)	~0.05	~0.02	1.07	2.1–3.5

On GPU, our a values (2.1–3.5) clear the threshold comfortably. On CPU, they're in the marginal zone — and empirically, SD still lost.

But the deeper insight is: SD is a GPU-bound optimization. The entire premise relies on the draft model being nearly free relative to the target. When the cost ratio α exceeds ~0.15, the headroom evaporates. And on CPU, with memory bandwidth as the bottleneck rather than compute, even a 3x smaller model doesn't come close to being "free."

If you're running SD on CPU... don't. The numbers don't work.

The Inequality, Now With Real Numbers

Let's plug the measured values into the inequality for the GPU scenario (where SD is designed to run):

Code (a = 3.0): 3.0 >> 1.07 ✅ Clear win. Draft tokens accepted 3× faster than the overhead burns them.JSON (a = 3.5): 3.5 >> 1.07 ✅ Clear win. The draft model matched the target nearly perfectly on structured output.Story (a = 2.1): 2.1 > 1.07 ✅ Marginal win. Clears the threshold, but with more zero-accept rounds eating into gains.

The inequality works. It correctly predicts that SD wins on GPU and loses on CPU. It correctly predicts that story generation is riskier than code generation.

But it doesn't capture everything. The zero-accept rate is a separate dimension — one that affects p99 latency more than throughput. If I were writing the inequality again, I'd add a variance term.

How to Measure Your Own a (In 20 Lines)

You don't need a complex benchmark framework. Here's the core measurement loop:

def measure_acceptance(model, draft, tokenizer, prompt, k=5, max_tokens=128):
    """Log a and k for each speculative generation round."""
    inputs = tokenizer(prompt, return_tensors="pt")
    generated = inputs["input_ids"]
    rounds = []  # each element: {"k": int, "a": int}

    while generated.shape[1] < inputs["input_ids"].shape[1] + max_tokens:
        # Draft: generate k candidate tokens
        draft_out = draft.generate(generated, max_new_tokens=k, do_sample=False)
        candidates = draft_out[0, generated.shape[1]:].tolist()
        actual_k = len(candidates)

        # Verify: check each candidate against target distribution
        verify_input = torch.cat([generated, torch.tensor([candidates])], dim=-1)
        logits = model(verify_input).logits[0]

        accepted = 0
        for i, tok in enumerate(candidates):
            target_tok = logits[generated.shape[1]-1+i].argmax().item()
            if tok == target_tok:
                accepted += 1
            else:
                break

        rounds.append({"k": actual_k, "a": accepted})

        # Accept the verified tokens
        if accepted > 0:
            generated = torch.cat([generated, torch.tensor([candidates[:accepted]])], dim=-1)
        # Generate next token from target
        out = model.generate(generated, max_new_tokens=1, do_sample=False)
        generated = out

    return rounds

# Run it:
data = measure_acceptance(target, draft, tokenizer, "Write a function...")
a_values = [r["a"] for r in data]
print(f"Mean a: {sum(a_values)/len(a_values):.2f}")
print(f"Zero-accept: {sum(1 for a in a_values if a==0)/len(a_values)*100:.1f}%")

The custom loop above gives you per-round logging. If you're using HuggingFace generate(), the built-in assistant_model parameter offers the same acceleration with less code — but it doesn't expose per-round a values out of the box. Use the custom loop for measurement, switch to the built-in for production.

Run this on 100+ samples per task type and split by task category. Don't average across all traffic — your code completions and your chat responses will have drastically different a values.

Four Takeaways for Production Engineers

1. Measure your own a before trusting vendor benchmarks. Our model pair achieved anything from 2.1 to 3.5 depending on the task. If someone claims "85% speedup," ask: on what task, with what model pair, and what was the acceptance rate?

2. Don't average across tasks. A single mean a for your whole workload hides the story. Split by traffic type. If code routes have a = 3.5 and chat routes have a = 1.8, enable SD for code routes only.

3. Watch the zero-accept rate, not just the mean. A high zero-accept rate means worse p99 latency. In a system that must respond in 2 seconds, a 15% chance of being 30% slower is unacceptable.

4. SD is a GPU optimization. It works when α is tiny (the draft model is nearly free relative to the target). On CPU, or on any platform where the draft model competes for memory bandwidth with the target, the inequality collapses. Benchmark on your target hardware.

Closing

The theory says SD is lossless but not free. The practice confirms it — and adds nuance.

Lossless: yes. The output distribution is identical. No hallucinations were amplified.

Not free: more expensive than the simple 1 + α + β model suggests. The zero-accept variance, the task-dependent a, and the hardware-dependent α all eat into the theoretical speedup.

The inequality still works. It correctly predicted every outcome in this test. But a point estimate of a isn't enough — you also need the distribution.

The best optimization technique isn't the one that always wins. It's the one you know when to turn off. Now you know how to measure when that is.

*Tested with Qwen2.5 models, transformers 5.12, torch 2.12 (CPU), Python 3.14 on Windows 10 (64GB RAM, 12th Gen Intel).

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding Actually Pays Off (and When It Doesn't)

zxpmail — Sun, 28 Jun 2026 10:16:44 +0000

One of the hottest topics in LLM inference acceleration right now is Speculative Decoding.

DSpark claims 60%–85% single-user speedup at the same throughput. Google has published a stream of research on it — SpecTr, block verification, SpecRouter, and more.

Sounds great, right? A small model (draft model) writes a draft, the large model batch-verifies it, and speed goes up.

But if you're a production engineer looking at this, two questions immediately pop up:

"Block generation — doesn't that amplify hallucinations?"

"You're running an extra model regardless of hit or miss — isn't that wasted compute?"

These two questions hit right at the core of Speculative Decoding's math promise and its engineering cost.

Let's run the numbers — no hype, no FUD.

1. The Math Promise: Why Block Generation Doesn't Amplify Hallucinations

This is the most misunderstood part of Speculative Decoding. Intuitively: "guess 5 tokens, one wrong and the rest are junk" — correct. But Speculative Decoding is designed precisely to prevent "junk" from becoming "wrong."

The verification mechanism is token-by-token, not "accept all or reject all."

The draft model generates a candidate block: [t1, t2, t3, t4, t5]. The target model verifies all 5 positions in one forward pass. The result:

t1 correct → accepted
t2 correct → accepted
t3 wrong → rejected; the target model regenerates from t3 onward
t4, t5 → dropped (they were built on a wrong t3)

Every output token has been confirmed by the target model. No hallucination is "amplified" — it's simply truncated at the first error. In terms of probability distribution, Speculative Decoding's output is mathematically equivalent to the target model's autoregressive output — a provable property.

So the answer to question one is: lossless quality. The promise holds.

One caveat: this equivalence assumes the draft and target models share the same tokenizer. If they differ (e.g., one uses BPE, the other Unigram), the verification process will have alignment overhead. It's not a bug in Speculative Decoding, but something to verify before deploying to production.

2. The Engineering Cost: Why "Lossless" Isn't "Free"

The second question is harder to answer.

"You're running an extra model regardless" — how do we account for that cost?

First, a premise: a small model's forward pass typically costs 1/10 to 1/20 of the target model's. That's because the core assumption of Speculative Decoding is that the draft model is small — a common pairing is a 7B drafting for a 70B. All the math below builds on this assumption.

Let's walk through three scenarios with a draft length of 5:

Scenario A: Full hit (best case)

	Without SD	With SD
Target model runs	5	1
Draft model runs	0	1
Net	5 target runs	1 target + 1 draft

Saving: 4 target runs minus 1 draft run.

Scenario B: Full miss (worst case)

	Without SD	With SD
Target model runs	5	1 (verification) + 5 (regeneration)
Draft model runs	0	1
Net	5 target runs	6 target + 1 draft

Result: slower than autoregressive, with a wasted draft run on top.

Scenario C: Partial hit (common case)

	Without SD	With SD
Target model runs	5	1 (verification) + (5 - hits) (regeneration)
Draft model runs	0	1
Net	5 target runs	(6 - hits) target + 1 draft

Net benefit: positive only when hits > 1 + (draft_cost / target_cost).

See the pattern? Speculative Decoding isn't "always faster." It's a high-risk, high-reward bet. Win and you save compute. Lose and you pay extra.

3. The Core Inequality: When Does Speculative Decoding Pay Off?

Let's formalize the math above into a single inequality.

Let:

k = draft length (how many tokens per guess)
α = compute ratio of draft model to target model (for a 7B/70B pair, α ≈ 0.05–0.1)
β = verification phase overhead per token
a = average acceptance length (how many tokens pass verification per round)

Speculative Decoding is strictly better than autoregressive when:

a > 1 + α + β

Or in words: the average acceptance length must exceed 1 (at least one token accepted per round), and the surplus must cover the draft model and verification overhead.

a = 5 (all hit) → big win
a = 1 (one hit) → net loss — you paid for the draft run for nothing
a < 1 (zero hits) → severe loss — slower than not using it at all

How to pick k? Too small and the speedup is negligible. Too large and you waste compute on tail tokens that are almost certainly rejected. Engineering experience: k = 4–6 is the sweet spot. Below 4, the acceleration is barely noticeable. Above 6, marginal returns diminish rapidly.

The distribution shift trap. If the task distribution is far from the draft model's training distribution — say, using a 7B to draft poetry for a 70B — the 7B has no idea how the 70B will choose its words. Acceptance rate can drop below 10%. At that point a < 1, and Speculative Decoding is strictly worse than autoregressive — and it gets worse as k increases. This is the single most important thing to watch for in production.

All Speculative Decoding does is play this inequality game, round after round.

A quick reality anchor: in practice, well-matched draft/target pairs (same family, similar training data) achieve a = 2.5–4.0 on code and structured text tasks — comfortably above the 1 + α + β threshold. Unmatched pairs (different model families, different tokenizers, or high-entropy tasks like free-form dialogue) often land at a = 1.0–1.5, right in the marginal zone where overhead eats the gain. This is why your mileage varies more by task than by model size.

4. Measuring Your Own Acceptance Rate — A Monday-Morning Checklist

Before you trust any vendor's benchmark, measure your own a.

Here's what you do:

Step 1: Instrument the verification boundary. Insert a logging hook between the draft model and the target model's verification pass. For each request, log the draft length k, the acceptance length a, and the number of regeneration steps. Any inference framework that supports SD (TensorRT-LLM, vLLM with speculative decoding, HF generate() with assistant_model) exposes these counters — or you can patch them in ~50 lines.

Step 2: Collect 500+ samples per task type. Don't average across all traffic — your code completion requests and your creative writing requests will have drastically different a values. Split by: task category, prompt length bucket, response length bucket. 500 samples per bucket gives you a stable mean and a useful p50/p90/p99 spread.

Step 3: Check the worst decile. The mean a might be 3.2, but if the bottom 10% of requests have a < 1, those requests are paying more than they would without SD. In a latency-sensitive system, the p10 a matters more than the mean.

Step 4: Run the inequality per bucket. Plug each bucket's a into a > 1 + α + β. If code completion passes but free-form dialogue fails, you have a deployment strategy: enable SD for the code route, disable it for the chat route.

This isn't optional calibration. It's the difference between "SD saves us 40% latency" and "SD makes our p99 worse and we can't figure out why."

5. What DSpark Does Well: Confidence-Based Scheduling

Once you understand the inequality above, DSpark's core contribution becomes obvious: Confidence-based Scheduling.

DSpark adds a confidence head to the draft model. For each draft token, it outputs a "survival probability." The scheduler uses this to dynamically decide how many tokens to verify:

High-confidence suffix → verify more tokens; longer block, bigger speedup
Low-confidence suffix → truncate early; don't waste compute verifying likely-wrong tokens

In the inequality framework: DSpark dynamically adjusts k via the confidence head — maximizing the expected acceptance length a while minimizing the wasted α overhead.

Win, you accelerate. Lose, you stop the bleeding early. It turns Speculative Decoding from blind betting into informed gambling.

6. So, Is It Worth Using?

It's not a yes/no question. It's a "depends."

Use it when:

The task distribution is predictable and the draft model's hit rate is high (code completion, common QA patterns)
You're doing batched inference, where every bit of per-request speedup compounds
You already have a small model that shares the target's tokenizer

Don't use it when:

The task is open-ended and unpredictable, and the draft model's guesses are unreliable (creative writing, complex reasoning — hit rate can collapse)
Request volume is low and the overhead of deploying an extra model can't be amortized
You're latency-sensitive and can't tolerate a worse p99 tail (because on a full miss, SD is slower)

A pragmatic rule:

If you're running high-volume LLM inference, Speculative Decoding is worth evaluating. But don't trust the "85% speedup" number. A/B test on your data and your model pair. Measure your actual acceptance rate. Plug it into a > 1 + α + β.

If it holds, use it. If it doesn't, don't. Simple as that.

Closing

Speculative Decoding is an elegant mathematical scheme: lossless quality, faster inference, via a draft-verify mechanism.

But lossless ≠ free.

It doesn't amplify hallucinations. But it does add compute overhead. When the hit rate is high, that overhead buys significant acceleration. When the hit rate is low, it doesn't just fail to accelerate — it slows the system down.

The best optimization technique isn't the one that always wins — it's the one you know when to turn off.

Next time you see a Speculative Decoding paper that only reports "X% speedup" without mentioning the acceptance rate or the worst-case behavior — send them this post.

The Fourth Layer of Agent-Native

zxpmail — Sun, 28 Jun 2026 09:44:53 +0000

"Agent-Native" is this year's buzzword. Most explanations reduce it to "bind AI to your frontend."That's not wrong — it's just looking at the first three layers while missing the crucial fourth one.

If you follow the current Agent-Native discourse, you'll encounter a consistent three-layer model of what it means to make AI a "native" part of your system:

Layer 1 — Content Readable. The Agent can understand your content as structured text, not as rendered-HTML noise. Standards like llms.txt solve this.

Layer 2 — Action Executable. The Agent doesn't just read — it calls your APIs, triggers operations, fetches live data. defineAction and its equivalents solve this.

Layer 3 — Protocol Compatible. The Agent interoperates through standard protocols (MCP, A2A) so you don't write per-Agent integrations. One adapter, all Agents.

These three layers answer one question: "How does the Agent operate the product?"

It's a good question. But it's not the only question. And framing Agent-Native as just this three-layer stack misses the harder half of the problem.

1. What Everyone Is Talking About

BuilderIO open-sourced agent-native — React hooks, defineAction, shared state. Alibaba Cloud showcased five Agent-Native cloud products. Technical communities started debating "Agent-Native design paradigms."

The shared definition: make AI Agents native operators of the software system, not bolt-on chat features. Three keywords:

Shared. The same Action — clicking a button or saying a sentence — triggers the same logic. The same data — UI changes it, Agent sees it; Agent changes it, UI refreshes.
Native. The Agent isn't glued on later. It's designed from the start as one of the system's operators.
Multi-surface. The same Agent can be a headless API, a chat interface, or a full SaaS app — identical logic, different face.

That definition is sound. But the community has equated "native" with "native to a frontend framework" — specifically React — which is a dangerous narrowing.

2. "Native" Means Native to the Environment, Not React

People see BuilderIO's stack (React + Nitro + Drizzle) and conclude Agent-Native is a frontend architecture. It isn't.

A genuinely Agent-Native system lets the Agent interact through stable protocols — CLI, HTTP API, MCP, A2A — not hard-code it to one DOM tree. The UI is one of many entry points, and frankly not the most stable one.

Layer	Lifetime	Risk
UI framework (React, Vue, etc.)	2-5 years	Replaced on every framework shift
CLI conventions, flags	20-50 years	Slow evolution, backward compatibility expected
Protocols (HTTP, SQL, MCP)	30-50 years	Never replaced, only extended

UI is ephemeral — React today, Vue tomorrow, voice the day after. Wire your Agent to the DOM and a frontend refactor blinds it.

Protocols are persistent — the conventions of a CLI interface haven't changed in decades. SQL hasn't been redesigned in forty years. An Agent should be native to these stable substrates, not the volatile presentation layer.

The three-layer model is correct in spirit. It just needs to be protocol-first, not UI-first.

3. The Three Layers Everyone Stops At

Let's define them clearly, because the fourth layer only makes sense if you've felt the weight of what the third layer achieves — and what it still can't do.

Layer 1: Content Readable

The Agent can extract structured meaning from your content. Not rendered HTML, not JavaScript-spun noise — clean, parseable text. Your API docs, your spec documents, your llms.txt.

Test: can a model open your docs and find the endpoint it needs without scrolling through a page of unrelated markup?

Layer 2: Action Executable

The Agent acts on the system. It creates records, triggers workflows, queries data. The key architectural move is defining operations once and exposing them to both the UI and the Agent through the same interface.

Test: can the Agent do everything in your product that a human user can do through the UI — without touching the UI?

Layer 3: Protocol Compatible

The Agent operates through open standards, not custom glue code. It speaks MCP to your services and A2A to other Agents. You don't write one integration per Agent surface — you write one protocol adapter and every MCP-compatible client can use it.

Test: does your system work with any MCP-compatible Agent, or only the one you built it for?

These three layers give you an operatively capable Agent. It reads, acts, and interoperates. Many teams will stop here and feel done. And for good reason — most products haven't even finished Layer 1.

But there's a gap above Layer 3 that nobody is talking about.

4. The Fourth Layer: Meta-Cognition

Layer 1-3 makes the Agent capable. But capability without self-awareness is dangerous.

The fourth layer answers a different question — not "How does the Agent operate the product?" but "How does the Agent know it's operating the product correctly — and what does it do when it isn't?"

This is the meta-cognitive layer:

Capability Awareness

The Agent knows what it can and can't do with confidence:

"I can generate this code, but I should ask for human review on the security model."
"This pattern matches a mistake I made before — let me check my reference document first."
"I'm not qualified to answer this — routing to the expert system."

Without this, an Agent with Layers 1-3 will confidently attempt anything within its operational reach. A powerful tool with no sense of its own limits.

Self-Initiated Evolution

The Agent notices its own failure modes and adapts:

"I've made this same PostgreSQL migration mistake three times. I should write a reference and check it before generating migrations."
"This error pattern keeps appearing in feedback. Let me propose a new rule."
"Twelve feedback entries accumulated — time to trigger an evolution cycle that examines my own heuristics."

This is the difference between a tool that waits for the human to debug it and an operator that maintains its own competence.

Graceful Fallback

The Agent can articulate why it can't proceed:

"I can't modify this file — it's outside the current phase's scope."
"I can't answer this — it's outside the domain I'm configured for."
"I found a contradiction between the specification and the code. I'll surface it rather than silently pick one."

In product terms: Layers 1-3 make the Agent a power user of your product. Layer 4 makes it a responsible operator who escalates when something is wrong.

Agent-Native Maturity Model

Layer 4: Meta-Cognition     ← The Agent knows itself
  │    Capability awareness, self-evolution, graceful fallback
  │
Layer 3: Protocol Compatible ← The Agent interoperates
  │    MCP, A2A, standard protocols
  │
Layer 2: Action Executable   ← The Agent acts
  │    defineAction, shared operations
  │
Layer 1: Content Readable    ← The Agent reads
       llms.txt, structured docs, clean content

5. The Framework Landscape: Two Different Answers

Two frameworks share the "Agent-Native" label with very different architectures. The comparison is useful because it shows Layer 4 is a design choice, not an accident of technology.

	BuilderIO agent-native	ReqForge
Primary interface	React UI + Agent	CLI + file system
Shared state	Database (Drizzle)	File system (git)
Operations	defineAction	Skill commands + hooks
Meta-cognition	Audit log + RBAC	Structured feedback → self-improvement cycles
Self-model	Permission boundaries	Project state detection, capability boundaries

Both share the core philosophy: Agent as native operator, not chat plugin. But they answer the meta-cognitive question differently.

BuilderIO treats meta-cognition as an audit concern — log what happened, who did it, and enforce RBAC boundaries. That's a perfectly valid starting point for a SaaS product where humans are still the primary operators and Agents are assistants.

ReqForge (disclaimer: I built it) treats meta-cognition as an operating system concern — hooks that intercept the Agent at specific points to enforce correctness, accumulate feedback, and trigger self-improvement. The Agent's execution loop includes checkpoints for self-correction, not just action.

The difference isn't "better" — it's domain-shaped. SaaS products need permission boundaries. Engineering tools need correctness feedback loops. The fourth layer takes different forms in different domains. The point is that it exists at all.

6. Three Concrete Moves Toward Layer 4

If you're designing an Agent-Native system today and want to build toward the fourth layer, here are three decisions you can make now.

1. Make State Transparent

When Agents and humans share a state layer, the Agent's history is also its self-model. If the shared state is a database, you need immutable audit logs and OT/CRDT to reconstruct who did what — and the Agent needs to query those logs before acting. If the shared state is a file system (or any naturally versioned store), git log is the audit trail — the Agent reads its own history, learns from its own mistakes.

In a SaaS context, transparency means making every Agent action observable by the same interfaces the Agent uses to act — not buried in a separate admin panel. The tightest feedback loop is "the Agent can see the consequences of its own last action before deciding the next one."

Transparency enables self-correction. An Agent that can't see what it did can't improve.

2. Enforce at the Environment Level, Not the Prompt Level

Meta-cognitive checks written into system prompts are fragile. A different model version, a different prompt layout, and the Agent "forgets" to self-check.

The robust approach is hooks — interceptors that run at specific execution points regardless of what the model decides to do:

Before writing: does the specification exist?
After generating: does this reference actually resolve?
On "done": did it pass the verification gate?
On session start: is there accumulated feedback to process?

The environment enforces the meta-cognitive loop, not the Agent's attention span.

In a database-backed SaaS, the equivalent is request middleware that intercepts every Agent action — not a rule in the Agent's system prompt saying "you should check permissions before writing" but middleware that rejects unauthorized writes before they reach the database. The principle is the same: move the guard from "ask the Agent to remember" to "the environment won't let it."

3. Promote, Don't Dump

An Agent's self-model must survive across sessions. The standard approach — dump the entire conversation history into the next session — doesn't scale. What survives should be promoted knowledge:

What interfaces and invariants proved durable?
Which assumptions turned out wrong?
What technical debt was consciously deferred?
Which failure patterns kept recurring?

This is the fourth layer's memory system. Without it, each session starts as a blank slate — a tool that learns nothing from its own experience.

7. Closing

Agent-Native is not a package you install. It's a design perspective shift. The current discourse stops at three layers that make an Agent capable — content readable, action executable, protocol compatible. Those answer "How does the Agent operate the product?"

The fourth layer answers the harder question: "How does the Agent know it's operating correctly — and how does it get better?"

Capability without self-awareness is a tool. Capability with self-awareness is an operator.

Build toward the fourth layer.

Don't Compress, Promote

zxpmail — Sun, 28 Jun 2026 02:44:22 +0000

AI coding has a hidden bottleneck that isn't in the model — it's in how you manage context across sessions.

You finish Phase 1. The codebase grew by 5000 lines. When you start Phase 2, how do you carry "what the AI knows" across?

The common answer today is Repomix: compress the entire codebase into one Markdown file, dump it into the prompt. It looks like a solution, but it creates a bigger problem.

Repomix Is a Full GC Heap Dump

A -XX:+HeapDumpOnOutOfMemoryError snapshot contains every living object, every dead object, every byte of fragmentation. You can fit the whole heap on disk, but loading it, parsing it, and finding the 12 objects you actually care about among thousands — that's the real cost.

In AI context terms:

100K-line codebase → Repomix packs it into ~150K tokens → dumped into the prompt
The AI has to find "the 3 files Phase 2 needs to change" inside 150K tokens
150K tokens of latency + attention dilution + key signals buried in boilerplate
Phase 2 code starts drifting from Phase 1's intent → more corrections needed → more context bloat → death spiral

This has a name: lost in the middle. Accuracy for the middle portion of long contexts drops off a cliff. By feeding the entire 150K-token heap dump, you're guaranteeing the AI forgets the 100K tokens in the middle.

But the deeper issue is:

Only Full GC dumps need "compression." Promotion doesn't compress — it promotes.

The JVM Had This Figured Out 25 Years Ago

HotSpot splits memory into three generations:

Generation	Role	Collection Strategy
Eden	Where new objects are born	Most die in Minor GC; survivors get promoted
Survivor (S0/S1)	Objects that survived 1+ GC cycles	Copied between S0/S1, age increments each round
Tenured (Old)	Long-lived objects promoted from Survivor	Collected rarely (Major GC)

This maps perfectly to the lifecycle of information in a codebase during AI-assisted development:

Phase completed → what survives goes to next phase
     │
     ├─ Eden code (90%)             → DON'T carry forward
     │    Scaffolding, boilerplate, temp solutions, experiments
     │
     ├─ Survivor (9%)               → PROMOTE to context
     │    Interfaces, types, domain models validated by this phase
     │
     ├─ Broken assumptions          → LOG to assumption registry
     │    "PostgreSQL doesn't support this full-text search syntax"
     │    "This library behaves differently on Windows paths"
     │
     └─ Known technical debt        → TAG explicitly, don't forget
          "Phase 3 must refactor the auth provider"

The difference isn't compression — it's promotion. You don't need to flatten the entire heap. You only need to upgrade the surviving objects to the next generation's context.

How to Promote (Takeaway Template)

At Phase End: Three Questions

Q1: Which data structures and interfaces proved their long-term value?→ Promote the declarations, not the implementations.

# Core Domain (promoted from Phase 1)
- User: { id, email, hashedPassword, displayName }
  invariant: email globally unique, validated on create
- Book: { id, title, isbn, ownerId, status }
  invariant: status ∈ {reading, finished, abandoned}

Q2: Which assumptions got broken?→ One line each. The next phase shouldn't relearn them.

# Assumption Changes
- "DB connection pool default of 10 is enough" ❌ bumped to 25
- "Vercel free tier supports 100MB responses" ❌ added pagination

Q3: What's knowingly left undone?→ Tag it explicitly so it survives the phase boundary.

# Carried Debt
- [ ] Phase 3: Migrate auth from JWT session to OAuth2
      Rationale: MVP first, third-party login required in Phase 3

At Phase Start: Only Load Promoted Data

Phase N context
├── Product spec (confirmed, not draft)
├── Core domain (promoted types + invariants)
│     ├── Survivor interfaces / domain models
│     ├── Assumption change log
│     └── Carried debt tags
└── Phase N goals (from development plan)

No Repomix dump. No full session history from the previous phase. No design docs you already finished reasoning through.

Token comparison:

Approach	Tokens	Attention Profile
Full Repomix dump	~100K-500K	Diluted globally, key signals drowned
Promotion-based load	~3K-10K	Concentrated on what this phase actually needs

That's two orders of magnitude.

Repomix and Promotion Aren't Mutually Exclusive

Compression solves a transport problem: "can the whole codebase fit in context?"

Promotion solves a selection problem: "what does the next phase actually need?"

Repomix is fine for cross-reference lookup — keep it as a collapsible reference that the AI reads on demand. But it shouldn't be the foundation of every phase start. The foundation should be promoted knowledge.

The right prompt structure:

[Phase context] — promoted interfaces + phase goals       (3K-10K tokens)
[Change files] — the 3-5 files this phase modifies        (10K-20K tokens)
[Repomix dump] — optional, for cross-reference lookup     (collapsible)

The AI's attention stays on the most critical signals: what survived from before, and what needs to change now. The full codebase becomes an on-demand reference, not mandatory reading.

Closing

Repomix solves a real problem (the codebase doesn't fit in context), but it chooses the wrong answer: a bigger dump instead of a smarter filter. In JVM terms, it's choosing more frequent Full GCs over generational collection.

And any engineer who's tuned a JVM knows: the generational hypothesis holds — most objects die young. The few that survive are worth promoting.

Codebase information follows the same pattern:

90% is Eden — written once, never needed again
9% is Survivor — promoted each phase
1% is Tenured — core domain model, changes rarely

You don't need compression. You need to recognize the 10% worth keeping.

*June 2026. Inspiration from JVM generational GC — the original "promote, don't compress."

A Design Document vs a Design Chain

zxpmail — Sun, 28 Jun 2026 01:04:27 +0000

Google open-sourced DESIGN.md — YAML tokens, a CLI linter, one-command Tailwind export. Great.But a format only helps people who already know what they want. What 0→1 products need isn't a format — it's a chain.

Google Labs open-sourced something interesting: a standardized format for AI coding agents to read and write design tokens.

colors:
  background: "#ffffff"
  ink: "#08060d"
  accent: "#aa3bff"

Each named value in that YAML — #ffffff, 18px, 4px — is a design token: the smallest atomic value in a design system, carrying a name and a number, so the agent doesn't guess when building UI.

Plus a CLI: designmd lint, designmd export --format css-tailwind, designmd diff.

Looks complete. Agents read tokens and write UI. Design changes get diffed. Themes get exported.

But if you've ever started from scratch with an AI coding assistant, you've hit an awkward truth:

DESIGN.md tells you what to write, not what to decide.

Whether the accent is #aa3bff or #2563eb — that decision happens before DESIGN.md gets written. And the painful part is exactly that "before."

1. A Format Is the Destination, Not the Path

Google's PHILOSOPHY.md has a line I strongly agree with: "prose sets the tone, tokens are the prose's context." Don't write vague adjectives; write concrete references — "like a 1970s lecture handout," not "retro style."

But where does the prose come from? Who picks the reference? Who gets to decide "like a 1970s lecture handout" when the product idea is still fuzzy?

DESIGN.md assumes you already have design intent. It doesn't help you discover it.

Here's the gap:

                    ┌─────────────────────────────────┐
                    │  You have intent → write tokens  │  ← format covers
                    └─────────────────────────────────┘
                              ↑
                    ┌──────────────────────────────────┐
                    │  No intent yet → how do you get   │
                    │  to the point of having one?      │  ← not covered
                    └──────────────────────────────────┘

For teams with a designer, a design system, and a mature product, the first half isn't a problem. For 0→1 teams and solo developers — it's the whole problem.

2. The Chain's Answer: Four Rings

The format only covers the second half. To cover the first half, you need a chain — from vague idea to concrete values, step by step, each ring solving one specific problem.

Ring 1: Direction Document — No Concrete Values

First output: a direction document. What's the visual tone? What products are we referencing? What styles are we avoiding?

This phase intentionally writes zero color values. This isn't neglect — it's discipline. If you write #aa3bff in round one, that hex is probably your "gut feeling," not the result of any reference analysis. It can't survive the question: "Why this purple and not that purple?"

The direction document outputs: tone, reference products, visual mood, anti-cliché checks. All prose, zero concrete values.

(I implemented this step as a "Design Brief" skill in the ReqForge framework — a questionnaire plus reference analysis to help users clarify intent. Output is prose only, no hex values.)

Ring 2: Mockup — Pixel-Level Validation

Build a reviewable visual mockup with design tools (Pencil, Figma, or whatever fits). The value here isn't "making it look good" — it's turning abstract descriptions into something you can concretely disagree with:

"This spacing is too wide" → you can point at it on the mockup
"This color is wrong" → you can change it
"This font feels cramped" → you can see it

The gap between what people mean by "clean modern palette" and what they actually approve when they see specific pixels is wider than you think. Ten times wider.

Ring 3: Post-Approval — Freeze the Values

Only after the mockup passes human review do you enter the value-freeze phase.

"Freeze" means: this color value has survived at least three rounds of validation (direction → pixels → human sign-off). It's no longer one person's gut call.

The frozen spec file becomes the single source of truth. Implementers writing UI and reviewers auditing it both read values from this file first. The design tool mockups can keep evolving, but the canonical values live in this one file.

Ring 4: Verification — Lint, Diff, Export

This is where the standard format shines. The frozen spec can be automated:

npx -p @google/design.md designmd lint DESIGN.md     # conformance check
npx -p @google/design.md designmd diff DESIGN.md v2  # change tracking
npx -p @google/design.md designmd export --format css-tailwind  # theme export

But in the chain's view, this is an optional exit, not the entrance. When lint fails, the right response isn't "block the release" — it's "check whether the values went through the three rounds of validation first." Because many projects don't even have UI, and don't need this layer at all.

3. Format and Chain Are Complementary

They're not competing. They solve different problems, and they need each other.

Dimension	Format (DESIGN.md)	Chain (four rings)
Core question	How to write values	How to decide values
Assumption	You already have design intent	Helps you discover intent
Verifiability	CI lint/diff	Mockup review, questioning
Change tracking	`designmd diff`	Inter-ring decision records
Best for	Mature products, teams with a designer	0→1 products, solo developers

Google's DESIGN.md philosophy says: "specific reference > adjectives." Write "like a 1970s lecture handout," not "retro style."

The chain says the same thing — it just pushes it one step earlier: if you don't have that "specific reference" yet, go find one before you write tokens.

Both paths discovered the same truth: concrete beats abstract. The format demands you write concrete tokens. The chain demands you also make the source of those tokens concrete.

4. When Is the Format Enough?

Fair question. Not every project needs the full chain.

✅ Format-only is fine when

You have a designer or an existing design system
You're redesigning an existing product — direction is known
Your team can align on intent through prose alone
You're using other tools for design output and just need a spec file for downstream consumption

❌ Consider running the full chain when

You're building 0→1 with no reference product
You're a solo PM + developer with unvalidated design instincts
Multiple people are involved and everyone's "clean" is a different "clean"
Your last release was slowed down by design rework

This isn't a quality threshold — it's a risk threshold. Skipping the direction phase and writing values directly is a bet that your gut is accurate enough. For some teams and some products, that bet pays off. For most 0→1 scenarios, it doesn't.

5. Closing

Google's DESIGN.md is an excellent format proposal. It solves a real problem: without a stable source of values, AI agents guess colors, and when enough colors get guessed, your UI turns into a rainbow.

But a format is the destination, not the path.

What makes a value trustworthy isn't a well-defined YAML schema or a passing lint check. It's being asked "are you sure?" three times — once in the direction document, once before the mockup, once at mockup approval.

If the answer is the same all three times, the color is probably right.

If you wrote it straight into the spec file on the first pass, you'll never know whether it would have survived the third.

June 2026. A format helps you write it right. A chain helps you be right before you write. You need both — but the order matters.