DEV Community: Webmaster Ramos

Frontier LLMs Get 2 of 3 Tax Returns Wrong - Stop Letting Them Decide

Webmaster Ramos — Tue, 30 Jun 2026 14:15:06 +0000

Everyone is wiring LLMs into checkout flows right now. I want to make the unpopular case that for the decisions which actually move money - tax, discounts, eligibility, pricing - the model should never have the final say. Not because the models are bad, but because I have the benchmark data showing exactly what happens when they do, and a pattern that fixes it without throwing the LLM out. Here is the evidence, the one condition where it falls apart, and where it pays off.

TL;DR. The common advice - "put an AI agent in the loop" - is most dangerous exactly where it sounds most useful: the money path. Tax, promo eligibility, discount stacking, cart rules, pricing. On those decisions a probabilistic model is the wrong final authority. The pattern that holds up is a division of labour: the LLM formalizes the natural-language rules into a small, auditable specification, and a sound deterministic engine executes it. You stop reviewing code you cannot read and start approving a contract you can. I ran this across 113 experiments and nine model families, and the shape of the result is consistent. Here is the evidence, the one hard condition, and where it actually pays off.

The decision you should never let a model guess at

Hand today's best AI models a stack of real tax returns and they get most of them wrong. That is not a thought experiment - it is a benchmark. TaxCalcBench took 51 real 2024 US tax returns with official IRS answers and asked the frontier models to file them. The best performer, Gemini 2.5 Pro, got 32% right under strict scoring. Claude Opus 4 managed 27%, Sonnet 4 just 23%. These were not rounding-error misses - they were wrong returns: wrong tax tables, arithmetic slips, eligibility mistakes, and, worst of all, different answers every time you asked. The benchmark's authors put it bluntly: there is a "continued need for deterministic tax calculation engines," because this output is "not acceptable for a task which needs consistently correct results with clear auditability."

Tax is the cleanest example because someone built the benchmark, but the shape generalizes to every money-path decision a merchant runs. Does this cart qualify for the promo? Can these two discounts stack? Which tax jurisdiction applies? What does this customer's loyalty tier unlock? These are not open-ended questions. Each has a correct answer defined by rules you already wrote down somewhere. Handing them to a model that pattern- matches probabilistically means accepting an error rate on decisions that move money and that a regulator, an auditor, or an angry customer can ask you to explain.

The reflex fix is "use a bigger model" or "prompt it harder." TaxCalcBench is the counter- argument: the frontier is already here and it is still at 23-32%. The problem is not model size. It is that you are using a probabilistic system as a deterministic one.

The pattern: formalize, don't decide

The fix is not to remove the LLM. It is to move it. Language models are extraordinary at one thing that traditional software is terrible at: turning messy natural language into structure. They are unreliable at the next thing: executing multi-step logic without drifting. So split the job along that exact seam.

Think of the LLM as a courtroom translator and the engine as the judge. The translator turns the contract from human language into something precise; the judge applies it the same way every time. You would never let the translator also decide the verdict - and you should not let the model that reads your promo terms also rule on whether this cart qualifies.

The fuzzy front end is the LLM's job: read the rules as written - the policy doc, the promo terms, the tax guidance - and translate them into a formal specification.
The strict back end is an engine's job: take that specification plus the instance data and compute the answer by mechanical, repeatable logic.

The payoff is not just accuracy. It is where the control point sits. When an LLM decides directly, the thing you would have to audit is a probability distribution over tokens - you cannot. When the LLM instead emits a small set of rules, the thing you audit is the rules. A human confirms "this specification matches our intent." The engine guarantees "the answer follows from the specification." Completeness of the spec does not vanish - it moves to a place a person can actually inspect.

This is not a new idea in the research literature, and the literature is worth knowing because it tells you the gain is real and not a quirk of one setup. Logic-LM - delegating the reasoning step to a symbolic solver - reported +39.2% over standard prompting and +18.4% over chain-of-thought. CLOVER, which translates to first-order logic and post- verifies with the Z3 solver, pushed harder still: 62.8% versus 42.4% on AR-LSAT, 75.4% versus 45.4% on ZebraLogic. Different teams, different tasks, same direction: when a sound engine does the executing, accuracy on rule-shaped problems jumps.

What is actually doing the reasoning - and why not just Python

If you delegate execution to an engine, which engine? In my lab the engine was Prolog - specifically SWI-Prolog, with a constraint solver (CLP(FD)) for the search-heavy problems.

If you have never touched it: Prolog is a language where you do not write how to find the answer - you state the facts and the rules and let the engine find it for you. You declare things like "a discount applies if the cart total is over X and the customer is in group Y," and the engine works out the rest, trying possibilities and throwing away the ones that break a rule. A program reads like a list of statements about the world, not a sequence of steps. It reads like a contract and runs like logic.

The obvious objection from any engineer is: I could write that in Python with a few if statements - why drag in a logic engine? The answer is not "Prolog the language beats Python the language." Python can express anything Prolog can - you could write the solver by hand. The point is what you are asking the LLM to produce, and that difference is concrete:

The search is the runtime, and it is complete. Ask "is there a valid configuration?" In Python you hand-write the search - loops, recursion, your own backtracking - and every line is a place to introduce a bug, with no guarantee you explored the whole space. In Prolog, backtracking is the engine. The checkable consequence: the engine can return UNSAT with a guarantee - "no such schedule exists" - while a Python script can only tell you "I didn't find one." Absence of evidence is not evidence of absence, and on a refund or a tax edge case that gap is the whole game.
Correctness is not tied to execution order. A declarative rule holds or it does not. There is no mutation, no early return, no off-by-one, no "the rules were right but the code that applied them was wrong." That entire bug class is gone.
A declarative ruleset is a verifiable target; imperative code is not. This is the one that matters. The LLM's output for a rule is a handful of lines that map one-to-one onto the business contract, and you can mechanically check it against probes. Arbitrary Python you can only test - which is precisely the stochastic, "hope the cases cover it" surface you were trying to escape. The Prolog artifact is the contract. Python is an implementation of a contract you still have to take on trust.

One honest caveat: the engine does not have to be Prolog. Any sound declarative engine fits the same role - CLP(FD) or MiniZinc for scheduling and search, SAT/SMT solvers like Z3 for policy checks, Datalog or ASP for closed-world rules. There is even tooling (the MCP Solver) that wires LLMs to MiniZinc, PySAT, and Z3 directly. Prolog is what I validated on, not a requirement of the idea.

As for running it: the LLM generates the program, and it executes in an isolated SWI-Prolog subprocess with timeouts and a stack limit - no model in the loop at execution time, only the deterministic engine. That isolation matters for the auditability claim: the answer comes from the engine or it does not come at all.

The evidence: a cheap model plus an engine reaches the frontier

I tested this across 113 experiments and nine model families - Claude Haiku, Sonnet, and Opus; llama-3.3-70b; mistral-large; deepseek; gpt-oss; gemini; qwen - logging every run, successful or not, to an append-only journal. The point of that breadth is to make sure I was looking at a pattern, not one lucky configuration.

The strongest, most replicated result is on constraint problems - the "zebra puzzle" family, where you satisfy a web of interlocking conditions. On these, the hybrid beat the LLM-only baseline by +63.3 percentage points with a cheap llama-3.3-70b and +19.2 points with Claude Haiku, at N=120, with 95% confidence intervals that exclude zero. I controlled for the obvious confound - that the gain came from leaking structure into the prompt - by re-running with a neutral, off-task example; the advantage held.

The most striking version of the result: models that solve roughly 0% of these puzzles on their own reach roughly 100% once they are formalizing for an engine instead of answering directly. A cheap model with a sound back end performs like a frontier model on this class. That tracks an independent finding too - ChatLogic reported that the gain from delegating to a solver concentrates in weaker models and deeper inference, which is exactly what you would expect if the engine is supplying the rigor the model lacks.

There is a deeper lesson hiding in the failures. When the hybrid was wrong, it was never the engine that was wrong. The engine executes the contract exactly, every time. Every failure lived in the formalization step - the model wrote a specification that did not match the intent. Which leads straight to the one condition you have to respect.

The one condition: you need a capable formalizer

The honest boundary is this: the method works when a capable model does the formalizing. Below a certain capability floor, a weak model writes rules that look right and are wrong - and because the engine faithfully executes whatever contract it is handed, a wrong contract produces a confident wrong answer. In my runs, every hybrid failure traced to this step, not to execution.

That is not "it works sometimes." It is "it works given a strong enough translator" - which is a measurable, controllable condition rather than a roll of the dice. It also reframes the model-selection question. You are not buying a model to be the decision engine. You are buying it to write the rules once, correctly. That is a narrower, more testable bar, and it is the right place to spend capability budget.

When the method applies - a three-part gate

Not every task belongs in this pipeline. Before you reach for it, check three things. The method applies when:

An independent gold oracle exists or can be derived. You need a source of truth for "correct" that is not the same model that produced the answer - a formal property, a human- verified set, or a separate tool. If your only checker is another LLM call, you get correlated errors and a system that confidently grades itself wrong.
The rules are specifiable. They can be written down from a spec, not "guessed by meaning." Promo terms, tax tables, eligibility logic - specifiable. "Sounds like a complaint" - not.
The task is finite and decidable. There is a bounded, computable answer.

Open-world tasks fail the gate: free-form natural language, forecasts with no spec, anything that turns on interpreting intent rather than applying a rule. That is not a weakness of the method - it is the edge of its class, and knowing the edge is what keeps you from misapplying it.

Where it pays off

Rank the opportunities by how badly the current approach hurts and how much auditability is demanded, and a clear order emerges.

Tax and payroll calculation sit at the top. The pain is measured (TaxCalcBench), the class is a clean fit, and people already build it this way: SARA, an academic system, translates statutes and facts into Prolog and runs them on SWI-Prolog so that auditors can inspect the exact reasoning path; OpenFisca is an open-source "rules as code" engine for tax and benefit law used by several national governments, sold explicitly on algorithmic transparency.

Credit underwriting and adverse-action decisions are the strongest case on the regulatory axis. Under the US Equal Credit Opportunity Act and its Regulation B, a lender must state the specific reasons for a denial - not "the algorithm said no." A verifiable rules engine satisfies that by construction; a black-box model does not.

E-commerce - promo, discounts, eligibility, cart rules, pricing, tax - is the same decidable class, and it is where most merchants will actually meet this problem as they wire agents into checkout. I will be honest about a gap here: there is no TaxCalcBench for e-commerce, no published benchmark that measures pure-LLM failure on promo logic specifically. The case rests on class-transfer from the measured domains, not on an external industry number. The architecture argument is strong; the industry-measured "pain" number, for now, is not there. Below that sit KYC/AML policy checks, fee and commission calculation, regulatory reporting, and returns policy - the class fits, and there are industry signals (Amazon's VeRAFI work on "neurosymbolic policy generation") even if no one has published a clean measurement.

The economics: the model is a compiler, not a calculator

There is a cost objection: capable models are not free, and if you call one on every decision the bill adds up. The answer is to stop thinking per-call.

When the output you want is not a single answer but a solver for a whole class of decisions, the right cost model is amortization. The LLM runs once - as a one-time compiler that turns the natural-language spec into a single verified, parametrized engine. After that, the engine runs on CPU at essentially zero marginal token cost; you inject each instance's data as facts and it computes. In my experiments the total-cost break-even landed around two to three instances - past that, the compiled engine is cheaper and more accurate than calling the model each time. The certified engines generalized cleanly to held-out instances: 32 of 32 on a scheduling class (including correctly proving the unsatisfiable ones), 30 of 30 on a promo-stacking class. This is the same shape as the emerging "Compiled AI" idea - call the model once at build time, then run the workflow as static code with zero tokens per transaction.

Two caveats keep this honest. First, certifying that engine is itself capability-gated: of the non-Anthropic model families I tested as the compiler, only one converged reliably; others started at the same place but could not act on the engine's counter-examples to repair their own rules. The gate is the ability to use feedback, not raw one-shot quality. Second, an LLM asked to re-solve the same problem on every call - rather than compile a solver once - is the wrong tool and the data says so. The win is the compiler framing, not the model as a repeated calculator.

The determinism layer agent commerce is missing

Most "add an AI agent to your store" advice quietly assumes the agent should make the decision. For the parts of commerce that move money, that is the one thing it should not do. The agent's real job there is to read the rules and write them down formally; a deterministic engine should decide. You get the model's fluency at the fuzzy edge and the engine's guarantees at the core - accuracy, completeness, and an answer you can defend line by line to an auditor or a customer.

That is a layer the agent-commerce conversation keeps skipping, and it is the one that decides whether any of this is safe to put near a checkout. If you are wiring agents into the money path, the question to ask is not "which model decides?" It is "what does the model formalize, and what executes the contract?" Get that seam right and the rest of the architecture has somewhere solid to stand.

I am currently building an e-commerce project of my own around cart-rule and discount logic - the money-path surface this whole article is about. Real promo rules are where the formalize-don't-decide seam gets stress-tested, and the first place I would scrutinize before letting any agent near a checkout. Designing and building this kind of deterministic layer for merchants is the work I do - if that is the seam you are wrestling with, that is where I can help.

I checked every Universal Cart merchant. None on Magento.

Webmaster Ramos — Tue, 02 Jun 2026 18:55:12 +0000

Google launched Universal Cart at I/O 2026 last week. An intelligent cart that follows users across Search, Gemini, YouTube, and Gmail. ALM Corp published the list of named early checkout merchants on May 20: Nike, Sephora, Target, Ulta Beauty, Walmart, Wayfair, and Shopify brands.

I read that list twice looking for a Magento store. None.

That's the article. Below: the five-protocol stack you'd otherwise have to read five different specs to understand, the one decision your existing payment processor has already made for you, and a thirty-day Magento-specific playbook to ship before agent-routed traffic starts flowing past your store.

If your store runs on Magento or Adobe Commerce, agent-routed traffic is going to flow past you - first in the US, then Canada and Australia "in the coming months," then the UK. The agent layer isn't going to wait for Adobe Commerce to ship native UCP support. The merchants in the first cohort had thirty days of head-start. Most of that window is already gone.

Here's what to ship before the rest of it closes.

The five-protocol stack, compressed

Four protocols define how an AI agent buys something on behalf of a user. A fifth ties payments together.

UCP - the discovery layer. Your store publishes a manifest at /.well-known/ucp declaring its capabilities, transports, and payment handlers.
MCP - the transport layer. Agents dispatch your commerce tool calls over MCP messages.
ACP - OpenAI and Stripe's checkout protocol. Stripe-led coalition.
AP2 - Google's payment-authorization protocol. Sixty-plus partners signed at launch: Adyen, American Express, Mastercard, PayPal, Coinbase, Revolut, Worldpay, and more.
MPP - Stripe's machine-payments protocol. Same family as ACP.

Benji Fisher's synthesis post on dev.to is the sharpest framing I've read: UCP discovers, MCP transports, ACP and AP2 authorize. Read it if you haven't.

The UCP spec itself is densifying fast. A loyalty extension landed on May 19 (#340). A schema-validated documentation harness landed on May 21 (#359), with AP2 mandate examples now machine-validated against the spec. The cadence suggests v1.0 GA in weeks, not months.

You don't have to learn five protocols. You have to make one decision.

The one decision

Pick an ecosystem to plug into first based on the payment processor your store already runs.

If your Magento store is on...	Ecosystem priority
Adyen, PayPal, Worldpay, Braintree	AP2 first (Google's coalition is broader)
Stripe	ACP first (Stripe-OpenAI's coalition is native)
Multiple processors	AP2 first, ACP second

The protocol war is coalition-driven, not technically segmented. Pick the side your existing rails are on. You can ship the other later.

The 30-day playbook

Five concrete steps. Each is a Magento-specific task, not a generic "audit your Merchant Center" instruction.

Day 1-3: publish `/.well-known/ucp`

Drop a JSON manifest at your store root. Skeleton for Magento 2.4.9:

{
  "ucp": {
    "spec": "https://ucp.dev/spec/2026-04-08",
    "schema": "https://ucp.dev/schemas",
    "endpoint": "https://your-store.com/rest/V1",
    "capabilities": ["catalog", "cart", "checkout", "order"],
    "transports": ["rest", "mcp"],
    "payment_handlers": ["ap2", "stripe"]
  }
}

Don't hand-write this and hope. Validate it against the upstream ucp-schema tool (step 5).

Day 3-10: map your REST API to MCP transport

Your Magento /rest/V1 endpoints already implement most of the UCP capability surface - catalog, cart, checkout, order. You're not rebuilding. You're wrapping those endpoints so an agent can call them over MCP's tools/call envelope.

This is where most of the engineering time goes. Plan for a focused two-week sprint with one senior backend engineer who knows your REST customizations. If you've stayed close to the default REST contract, the wrapping is mechanical. If you have heavy REST customizations, budget more.

Day 10-15: declare a payment handler that matches your processor

If you're on Adyen, publish an ap2 payment-handler namespace in your manifest. If you're on Stripe, publish an acp namespace. The handler tells the agent which authorization protocol your checkout speaks.

This is declaration only at the manifest level. The actual payment integration on the Magento side already exists in your payment module - you're advertising it, not rewriting it. The work is in your manifest, not in your checkout code.

Day 15-22: wire loyalty - optional but high-leverage

UCP's loyalty extension shipped this week. If you have a Magento Reward Points module or a custom loyalty integration, declare it in the loyalty extension namespace. Loyalty surfaces in agent UI as a price-impact signal - it's how an agent decides which store to recommend when two have the same product at the same list price.

Stores skipping this in 2026 will lose the price-tied-discount segment for the next three years.

Day 22-30: validate before you ship

A broken manifest costs you visibility. Agents that hit a 404 or a JSON parse error mark you unreachable and skip you on retry - you don't get a second chance until the next crawl cycle.

We built a free AI-readiness audit at webmaster-ramos.com/analyzer that probes /.well-known/ucp, validates the manifest against the published UCP profile schema, checks transport declarations, and flags missing payment-handler namespaces. It's KB-driven and rule-based - no LLM in the pipeline, no judgment calls, just deterministic checks against the spec.

Run it twice: once before you go live, once after. If you don't fire any of the UCP-readiness rules, you're shippable.

If you'd rather wire your own check, the manifest is a plain JSON document validatable against the UCP profile schema with any standard JSON Schema validator. The schema is published at https://ucp.dev/schemas/profile.json.

What Adobe is doing (and not doing)

Adobe shipped AI agents into AEM eight weeks ago - Experience Modernization Agent, Development Agent, Content Optimization Agent. They're shipping more every release cycle.

Adobe has shipped nothing agentic into Adobe Commerce.

That asymmetry is eight weeks deep and widening. If you're waiting for Adobe to ship native UCP support for Commerce, you're waiting on a roadmap that hasn't surfaced publicly. The merchants who show up in next year's Universal Cart case studies are the ones who built their own UCP layer this quarter.

You have thirty days.

Six Principles in Practice: How an Agentic E2E Found 11 Production Bugs in 8 Runs

Webmaster Ramos — Mon, 18 May 2026 07:23:46 +0000

Eight runs, eleven bugs

I ran my E2E testing system on a production ecommerce platform eight times in
a row – across five different business modules, in three different surface
configurations (admin / desktop storefront / mobile-first storefront). Across
those eight runs the system found eleven production bugs, each one
attached to a specific file and line via a root_cause_slug. Between runs
the knowledge base grew from 25 gotchas to 42 (+67% in nine days), and the
first-try pass rate (first_try_pass_rate) climbed from 14% to 95%.

One detail up front: the methodology was assembled in a side stream
alongside product work, not as a dedicated project. Calibration cycles were
interleaved between features, new-module sprints and routine support. Eight
runs is not "eight weeks of full-time work" but eight iteration points
accumulated in parallel with shipping production code. Most of that time I
was writing business logic, not agents.

This isn't a story about "which framework to pick". Most teams start with
E2E by asking exactly that question – and six months later they have a
flaky suite that quietly gets disabled in CI. The right question is on
what conditions these tests are entitled to exist at all, and what agent
architecture lets them compound instead of accumulating noise.

This article is a closing piece for the previous publication
Six Principles for Agent Systems That Don't Hallucinate.
There I worked through the principles as an abstraction. Here – what
happens when you apply them to a concrete task, in production, across two
independent stacks.

Premise: six principles applied to E2E

E2E testing is a convenient test bed for agent systems for three reasons.
First, the validator is deterministic – the test either passes or it
doesn't, and there is no room for probabilistic judgment. Second, the cycle
is short – one run takes minutes, not hours or days. Third, the domain
gives an explicit signal when the system has "learned" the stack –
first_try_pass_rate plateaus.

All three properties are the same ones the Six Principles are built on in
the general case: architecture over prompt-tweaking, deterministic context
over probabilistic retrieval, closed-loop validation with a hard signal,
three-category attribution, editorial gates instead of auto-promotion,
multi-run measurement as proof of compounding.

If these principles work in the general case, then on E2E they should
deliver a measurable effect. This essay is about the measured effect.

The contract: seven environment principles

E2E tests live or die by their relationship with the environment. Without
an explicit contract, every flaky-test debate converges on the same
question: is this a bug in the test, in the application, or in CI? – and
no one can answer, because there is no shared baseline.

ENVIRONMENT.md is a markdown document with seven numbered principles.
Each is one paragraph plus a short why. Three audiences read it: a human
during onboarding, an LLM agent during test generation, and the test
runner (the last one via playwright.config.ts, not directly).

The principles in short:

The container is an external dependency. Tests do not start or stop the application. If the instance is unavailable, the preflight check (principle 4) fails before any spec runs.
The database is dirty by default. Demo data is reused across runs. Test data is isolated via a prefix (e2e_*), seeds are idempotent through ON CONFLICT DO NOTHING.
Sequential execution. workers: 1, retries: 0, fullyParallel: false. This is not a performance compromise – it is a methodology commitment. Half of this principle – the no-retries doctrine – is the most load-bearing rule in the entire methodology.
Health check before everything. global-setup.ts makes one HEAD request to a health endpoint before any spec runs. Without the health check, the first failing test out of 50 produces an inscrutable timeout; with it, one clear error appears in five seconds.
Seed vs assertion separation. Seed specs configure state (tests/_seed/), assertion specs verify behavior (tests/modules/<feature>/). The underscore prefix is not stylistic; it is lexicographic sort order.
Host runner + MCP browser. Playwright runs on the host machine; during test generation the LLM agent has access to MCP browser tools – this lets it observe the real DOM rather than invent selectors.
Session caching with TTL. Login is cached to a file; TTL depends on the backend's nature (admin session with DevMode login – 15 minutes; Redis session under a strict security policy – 2 minutes).

Each principle in depth lives in
contract-spec.md
in the principles repo.

The principles are deliberately minimal. The contract does not address
test data factories (a structural question), selector strategy (a
generator concern), or CI (orthogonal). The contract is the smallest
explicit commitment that makes the rest of the methodology coherent.
Extending the contract is fine; expecting the contract to cover
everything is a category error.

Four layers of code

The contract says what tests do and don't do. The structure says where
the artifacts of doing those things physically live. Four layers, with
strict one-way dependency direction:

)

lib/ – stateless utilities. If a function in lib/ is called setupCheckoutTaxForRegion, it doesn't belong in lib/ – it belongs in a Page Object or a flow.
pages/ – Page Objects. Stateful. Extracted only after the third real use (Rule of Three).
tests/ – the specs themselves. _seed/ (idempotent setup) and modules/<feature>/ (per-feature assertions).
knowledge/ – markdown/YAML references for LLM agents. Never imported by tests. This is data for agents, not code for the runtime.

The tests → pages → lib direction is one-way. Reverse edges are
forbidden. Empirically: across four cross-stack ports, every cycle of
"lib imports from pages" had to be reverted within the same sprint. The
cost of portability with a cycle in place is too high.

The most common objection is "extract pages/ from day one?". No. Rule
of Three: one test – leave it inline; two – leave them duplicated;
the third – extract into lib/ (stateless) or pages/ (stateful).
At two uses you don't yet see what is actually shared. The third use
shows the real abstraction instead of a coincidental match between two
cases.

A single playwright.config.ts serves several orthogonal surface
combinations – not "different browsers" but different DOMs. On my
ecommerce platform: admin / classic storefront (legacy MVC) / modern
storefront (Alpine.js). Different DOM, different selectors, the same
behavior cases. One run produces three results with a per-project
breakdown in metrics.jsonl.

The four-agent pipeline

The pipeline runs four agents in sequence: analyze → plan → generate →
heal. Each agent has one cognitive task, one input shape, one output
shape.

Analyzer

The first. Discovery: scans the codebase, identifies modules, routes,
DB schema, dependencies. Writes results into e2e/.state/*.json –
persistent JSON artifacts. The phase is cheap and cacheable – on
every run it first checks the mtime of its outputs; if they are fresh,
it skips entirely.

The skip logic here is not optimization, it is architecture. Most cycles
work on a stable codebase; re-scanning the source tree every time is
waste. The analyzer's artifacts (modules.json, schema-map.json,
project-auth.yml, project-seed.yml) are read by the planner,
generator and healer – each takes what it needs, no one re-runs
discovery.

Planner

The second. Reasoning: takes the analyzer's output plus the KB, writes
a plan.md – a numbered list of test cases for one feature. Each case:
short title, preconditions, steps, expected outcome, optional KB
references to relevant gotchas.

The planner is a distinct phase, not a step inside the generator,
because planning and code-generation are different cognitive modes.
Planning needs broad context (feature semantics, edge cases, KB flags).
Generation needs narrow context (the exact selector for one button on
one page). Trying to do both in one prompt produces either an
over-prompted generator (slow, expensive) or an under-prompted planner
(shallow plans, missed edge cases).

plan.md is not test code. It is a specification that the generator
turns into code in the next phase. The same plan.md could be
implemented in a different test framework.

Generator

The third. Code emission: takes plan.md and writes *.spec.ts. The
defining rule is selector discipline: every selector that appears in
a generated spec must be observed in the live application via MCP
browser tools – not inferred from sources, not guessed from a
screenshot.

What "stable selector" means depends on the surface. For each project
the generator has a preference hierarchy: getByRole(...) →
getByPlaceholder(...) → scoped CSS → id – in descending order of
stability.

What is forbidden: deriving a selector from source code (the
rendered DOM may differ); guessing from a screenshot; "the button
probably has the class .btn-primary". If a stable selector doesn't
exist, the correct reaction is to report a gap back to the planner,
not to write something brittle and hope.

Healer

The fourth, and the most important. Diagnosis: runs the specs, observes
failures, attributes each failure to one of three categories –
test-bug / app-bug / env-drift – and writes a structured
heal-finding with the audit trail.

That attribution is what makes the no-retries doctrine actionable.
Each category has its own remediation path:

test-bug → the healer fixes the spec.
app-bug → the healer does not fix the application. It files the bug with root_cause_slug and leaves the spec failing as a true positive.
env-drift → the healer surfaces the drift; the contract may need updating.

The healer is also the agent that proposes KB candidates. A failure
exposed a gotcha future tests should know? The healer writes a candidate
into e2e/knowledge/_inbox/. The candidate is not auto-promoted –
an editorial gate decides.

Why four agents, not three

Early versions of the methodology used three agents (planner /
generator / healer) and folded discovery into the planner. The
four-agent split was empirical: a planner prompt that also did
discovery was noticeably worse at both jobs. Pulling the analyzer into
its own phase made each phase smaller, cheaper, and individually
skippable (analyzer caches; planner skips when plan.md exists;
healer skips on a green run).

The pipeline produces measurable artifacts at every boundary:
e2e/.state/*.json after analysis, plan.md after planning,
*.spec.ts after generation, a six-section heal-finding after healing.
Each is reviewable. Each is comparable across runs.

Per-agent depth lives in the four agent-role specs in the principles
repo. Skill-level orchestration lives in
skill-design.md.

Knowledge as the fourth layer

The knowledge base is the fourth layer of the structure and the
input that makes agents learn between runs. KB files are YAML
documents, read by agents at planning, generation and healing time.
They are never imported by test code.

Two categories of entry

Every KB entry is one of two kinds:

Gotcha – prose advisory for the agent. "When clicking a button inside a modal on the admin surface, wait for the loading-mask overlay to disappear". Gotchas are advisory; the agent reads them as context, not as an enforceable rule.
Lint pattern – a machine-checkable rule with a severity. "If a spec calls page.click() on .btn-primary without preceding waitForLoadingMask() – warning". Lint patterns are runnable as a static analysis pass (--phase lint).

An entry starts as a gotcha. Promotion to a lint pattern is downstream
and explicit, after the gotcha has fired enough times for the rule
to clearly generalize.

Project-local vs cross-stack

The KB has two homes. Project-local lives inside e2e/knowledge/ –
about this codebase: auth patterns, seed fixtures, business-domain
quirks. Cross-stack lives outside any single project (for example,
in ~/.claude/skills/e2e-kb/kb/) – about a technology: UI framework
patterns, Tailwind class quirks, admin framework selectors. Cross-stack
KB generalizes across every project that uses the same technology.

This split is what produces cross-project knowledge transfer. When
you start a new project on a familiar technology, cross-stack KB
applies on day one. Project-local KB starts empty and fills as the
codebase reveals its quirks.

`sources.yml` routing and `kb_by_app`

sources.yml describes which KBs apply to which surface:

universal:
  - project-auth.yml
  - project-seed.yml

by_surface:
  admin:
    - modules/admin.yml
    - platform/admin-framework.yml
  storefront:
    - modules/storefront.yml
    - platform/alpine-js.yml
    - platform/tailwind-css.yml

The planner reads sources.yml and loads only the KBs relevant to the
target.

For multi-app projects – one repository hosting two genuinely different
application stacks (a FastAPI backend and a Next.js admin in the same
work tree) – the pattern extends to kb_by_app:

kb_by_app:
  backend:
    - platform/fastapi.yml
    - platform/alpine-js.yml
    - platform/tailwind-css.yml
    - project-auth.yml
  admin-app:
    - platform/nextjs.yml
    - platform/shadcn-ui.yml
    - platform/tailwind-css.yml
    - project-auth.yml

Tests for the backend receive fastapi + alpine-js KB; tests for the
admin receive nextjs + shadcn-ui. No cross-contamination, no agents
loading irrelevant gotchas. The same tailwind-css.yml is reused in
both apps – cross-project KB reuse on a small scale.

Editorial promotion

The single most important rule of the KB layer:

Auto-promotion of healer candidates into the active KB is forbidden.

Auto-promotion optimizes recall at the expense of precision. The
resulting KB describes the system's errors, not what is true. The
agent then retrieves contradictory advice (every fix has become a
"principle"), and compounding flips sign – the saturation curve
moves downward instead of upward.

Promotion is editorial: the healer writes a candidate into _inbox/;
a reviewer asks two questions (does this generalize? is it not covered
by an existing entry?); only on two yeses does the candidate move
into the active KB. The editorial gate is what keeps the KB useful
as the project grows.

The healing loop and the saturation curve

The healer produces one artifact per run: a markdown file in
e2e/.state/heal-findings/, timestamped. Six sections, always in the
same order, even on green runs.

The six sections are not bureaucracy. They are an audit trail that
stays readable two months later:

A. Diagnosis matrix – a table: tests × projects. Pass / Fail / Skip / N/A in cells. The reviewer sees it first – "what failed and where" before any narrative.
B. Hypothesis on root causes – for each failure, a working theory. Each hypothesis names an attribution category.
C. Healing action + decision rationale – what the healer did. test-bug is fixed in the spec, app-bug is filed with a slug, env-drift is surfaced.
D. Verification checklist – how to confirm the fix worked. A checklist, not prose. This is what makes the audit trail closeable.
E. KB candidates – gotchas worth promoting (via the editorial gate).
F. Out-of-scope siblings – observations that surfaced during the run but are not the focus of this finding. Test-infra glitches, environment quirks, remarks worth follow-up but no action right now.

Section F matters separately: without it, observations either clutter
the main narrative (A–E) or get lost and rediscovered a month later
as "haven't we seen this already?".

Why this produces compounding

Every heal-finding's Section E feeds the _inbox/. The editorial gate
either promotes or rejects. Promoted entries become available to the
next run's planner and generator. The next run on the same surface
starts with a richer KB, and first_try_pass_rate rises.

That is the compounding mechanism. It depends on three preconditions:

Three-category attribution (the no-retries doctrine) – without it, failures become "flaky", and the healer has nothing structured to record.
Editorial promotion – without it, the KB becomes an error log, and the curve flips sign.
Per-run findings (the six-section discipline) – without them, the audit trail is missing, and the next reviewer can't follow the chain.

The saturation curve

Run 1: low pass rate. Every gotcha is new; the KB is empty for the
surface.

Runs 2–3: rate climbs steeply, as the first findings get promoted into
the KB. The agent now reads gotchas it discovered itself last time.

Runs 4+: rate plateaus. The KB has captured the surface's
idiosyncrasies; further runs only encounter rare new gotchas.

The plateau is saturation. The empirical signal that the
methodology has paid for itself on this surface. After saturation
the cost of a new test on the surface is dominated by defining the
case, not learning the surface.

Across my eight runs: on a mature module (third run on the same
surface) first_try_pass_rate reached ~95%. On a new surface of the
same platform – first run ~14%, second ~78%, third ~95%. The same
shape on each of six modules: low → climb → plateau. This isn't a
theoretical benefit – it is measured.

What the metrics don't track

Test execution time. Playwright's reporter handles that. metrics.jsonl is about the quality of generated tests, not their runtime.
Code coverage in the line-coverage sense. That is a different methodology (instrument, run, report).
Subjective quality. "Are these tests good?" is a review question, not a metric. The metric measures whether they pass.

The full metrics.jsonl schema with the additive evolution v1 → v2,
the definition of first_try_pass_rate, the root_cause_slug
discipline – all in
metric-design.md.

Stack-agnostic: porting to FastAPI in days, not weeks

The strongest objection to any "works for me" methodology is it works
only because you know your stack. I ported the methodology to a second
stack – not on a new machine and not for an article, but for my own
pet project: FastAPI + Alpine (backend) + Next.js (admin) in a single
work tree. The port took days, not weeks.

What carries over 1-to-1

The four agents (analyze / plan / generate / heal) – same prompts, same contract between phases.
The e2e-coverage skill – same orchestrator, same artifact on output (a metrics.jsonl line, a heal-finding).
The ENVIRONMENT.md pattern – 7 principles stayed. Some are trivially satisfied (no auth → principle 7 N/A), but the contract kept its shape.
STRUCTURE.md – four layers, Rule of Three, dependency direction.
No-retries doctrine – retries: 0 on the new stack.

What is rewritten

knowledge/ – local gotchas (FastAPI middleware quirks instead of legacy MVC). Cross-stack KB (alpine-js.yml, tailwind-css.yml) is reused without change.
lib/ – FastAPI auth helpers instead of platform CLI calls.
playwright.config.ts – different projects (backend + admin-app instead of admin + classic-storefront + modern-storefront).

What appeared new on the second stack

kb_by_app routing – a solution to the multi-app problem (one e2e/ serving two genuinely different app stacks). The pattern then back-ports onto the first stack if a multi-app scenario emerges.

Metrics on the second stack

First run on the new backend: first_try_pass_rate ~48.6%. Second –
~91.4%. Same two-to-three runs of the same surface, the same
compounding shape.

What matters: the second stack didn't "repeat" the first. It
showed that the shape of the curve is a property of the loop, not
of the task. Detect a deterministic validator (tests pass/fail,
build succeeds, types check), close the loop (executor → validator,
auto-revert on regression, KB grows only on validated new error
classes) – and compounding appears regardless of stack.

After the cross-platform port I have n=2 platforms plus n=8 runs
within the first. The KB saturation curve is not a Magento artifact.
It is a property of the pipeline.

Closing

The six principles from the meta article are not dogma. They are a set
of architectural commitments that will make an agent system compound,
if you accept them. This article showed what happens on a concrete
task – E2E testing – when you accept them in full.

What is useful here beyond "another E2E framework":

Structural framing of the flaky-test problem. Not "which runner to buy", but what conditions must be met for tests to exist as a signal rather than as noise. Those conditions are expressed in the seven contract principles.
Compounding proved through measurement. The KB saturation curve is not theoretical. It appears on two independent stacks, in the same shape. Single-run anecdotes really are almost useless for evaluating an architecture; the multi-run curve is a different story.
Editorial gates as load-bearing. Auto-promotion is the most obvious step that breaks compounding. That is counter-intuitive and worth surfacing explicitly.

If you want to apply the methodology – the principles repo with
granular specs and illustrative examples (against todomvc as a neutral
target): https://github.com/webmaster-ramos/e2e-llm-agents.

The canonical narrative with principles and architecture lives on the
site at /docs/e2e-llm-agents:
https://webmaster-ramos.com/docs/e2e-llm-agents.

The meta article with the six principles as an abstraction:
https://webmaster-ramos.com/blog/six-principles-agent-systems.

Six Principles for Agent Systems That Don't Hallucinate

Webmaster Ramos — Tue, 12 May 2026 07:06:27 +0000

Why this article exists

Agentic development with LLMs in 2026 is no longer an "interesting experiment". It is its own engineering discipline. By an "agent" I mean a program built on top of a language model that performs a structured task inside a product rather than merely replying to a user in chat: it reads code, makes decisions, writes files, calls external APIs, and returns a result. I join product teams where three to five such agents already work in parallel: code review bots, content classifiers, ticket routers, recommendation pipelines, internal documentation generators. A demo can be assembled in one evening. Production cannot.

The line between a demo-quality and a production-quality agent system is not where people usually look for it. The deciding factor is not the model, not the token budget, and not the quality of the prompts. The deciding factor is the architecture of the system in which the model operates – and that architecture does not come from "build your first agent" tutorials. It comes from failed attempts.

At that boundary, every agent system runs into the same three problems:

Hallucinations – the agent invents facts that sound plausible but do not match reality.
Non-reproducibility – the same prompt produces different results across runs, and errors cannot be debugged properly.
No way to accumulate knowledge – every run starts from zero, and the mistakes of one run do not help the next one.

Of those three, the first two are discussed by almost everyone writing about multi-agent systems in 2025 – roles, JSON contracts, system prompts. The third one – how an agent system becomes cheaper and more accurate with each subsequent run – is barely discussed, because that conversation requires metrics across multiple runs, not a single production anecdote.

These three problems are not properties of the model. They are solved at the architecture level of the system the model operates inside. Below are six design principles that address them.

The principles are universal. They work equally well for code review, refactoring tools, security audit pipelines, migration tools, documentation generators, customer support routing, content moderation, and data pipelines with LLM stages. Wherever you have multiple roles, a rules-heavy domain, and a need for reproducible output, these six layers apply.

I distilled them while building one specific system – LLM agents for E2E testing. That took a month and a half of part-time iteration and produced measurable results: eight runs, eleven production bugs found automatically, and a first-try pass rate that rose from 14% to 95% as the knowledge base – hereafter "KB" – became more saturated. Each principle below is paired with one concrete E2E example and one or two applications in other domains.

Those eight runs are not a generic "it works for us" claim. They are the first trendline. On a single run, any architecture can sound plausible. Across eight runs, you start to see which principles actually deliver ROI and which ones are overhead without return.

The six principles work together as layers. Remove one, and the whole stack collapses.

Principle 1: An explicit contract

What it is. A document that describes the rules the agent operates under and that do not change between runs. Not code, not config, but text. Usually Markdown with five to ten numbered principles, around 500–800 words (~3–5 KB). In my E2E version it is seven principles, 83 lines, about 600 words.

Why it works. Without an explicit contract, the agent makes arbitrary choices every time it encounters ambiguity. "Should the database be clean for each test, or should it stay dirty?" The agent picks one answer today, another next week, and you end up with incompatible tests. With a contract ("the DB is dirty by default; demo data is reused"), the answer is predefined.

E2E example. My ENVIRONMENT.md contains seven principles: the container as an external dependency, a dirty database, sequential execution, a health check, seed/assertion separation, a host runner plus MCP browser, and session caching. Each one is a short paragraph plus a brief rationale.

Non-E2E application. A security-audit agent gets SCOPE.md: what is in scope (src/ production code) and what is not (test fixtures, vendor/, deprecated code). Without that contract, the agent will report vulnerabilities in demo files and waste your time on false positives. A code-review agent gets STYLE.md with an explicit instruction: "code style is already formalised in .eslintrc; do not comment on formatting." A refactoring agent gets BOUNDARIES.md: which modules it may not touch and which public APIs it may not break.

What breaks without it. The agent acts on unstated assumptions, and in half the cases those assumptions will be the opposite of yours. Two weeks later, the team no longer understands why the agent behaves one way today and another way tomorrow. A month later, they stop trusting its output.

Principle 2: Role separation

What it is. A complex task equals multiple cognitive modes, and those modes cannot live inside one agent. You need separate roles with differentiated tools, context, and instructions.

Why it works. A single prompt cannot simultaneously demand "explore broadly" and "do not deviate from the plan." A single context cannot hold both architecture diagrams and specific code. A single toolset cannot be optimal both for browser automation and for editing text files.

E2E example. Four agents: e2e-analyzer (discovery), e2e-planner (strategy), e2e-generator (implementation), and e2e-healer (diagnostics). Each has its own MCP tools, its own context, and its own responsibility. The generator is not allowed to invent selectors; the healer is not allowed to expand coverage. Those constraints are what make the system predictable.

Non-E2E application. In a refactoring agent pipeline, code-mapper builds the dependency graph and writes dependencies.json; refactor-planner reads the graph and writes refactor-plan.md with numbered steps; refactor-applier applies each step; verifier runs tests and checks the result. In a code-review pipeline, static-scanner looks for obvious anti-patterns; context-reader loads related files; reviewer writes comments; summarizer aggregates them into one review message.

What breaks without it. A monolithic agent with one giant prompt initially looks "simpler" – one file, one entry point. Two weeks later the prompt is 800 lines of contradictory instructions, the context is bloated, and the output is worse than that of a simple script.

Principle 3: Persistent state between phases

What it is. Artifacts the agent writes to disk and that survive between runs. Not RAM, not in-memory state, but files with structured data that can be read by both humans and downstream agents.

Why it works. Discovery is an expensive phase. If you rescan the codebase from scratch every time, you pay for that in both context and time. But discovery changes slowly: the list of modules, the database schema, the routes. Do it once, save it, and later phases can read the result.

An additional benefit is that persistent state enables idempotent skip logic. If modules.json is still fresh (by mtime, the file modification time), the analyze phase is skipped automatically. The pipeline becomes cheap on repeated runs.

E2E example. The analyzer writes modules.json (modules, routes, dependencies) and schema-map.json (database schema). On the second run for the same module, the analyze phase takes zero seconds. Those files are also useful in their own right: a new team member can read schema-map.json and understand in fifteen minutes what would otherwise take a full day.

Non-E2E application. A migration tool uses mappings.json (old names → new names) and applied-steps.jsonl (what has already been done). .jsonl means JSON Lines: an append-only format with one JSON object per line. It is ideal for event logs: a new entry is just >> appended to the file, you do not need to parse the whole thing, and one corrupted line does not invalidate the rest. If a migration stops halfway through, the restart reads applied-steps.jsonl and continues from there. A customer-support pipeline can keep session-context.json for each conversation so a new request reads prior context instead of starting from zero. A documentation generator can rebuild module-graph.json only when the source files changed, speeding repeated runs up by an order of magnitude.

What breaks without it. Every run is expensive. The pipeline cannot be stopped and resumed. Artifacts live in one agent's head and disappear as soon as the context is cleared.

Principle 4: Knowledge as a separate layer

What it is. Domain knowledge – platform patterns, known constraints, gotchas you only discover in real-world use – lives in separate files that agents read but do not import into the main code. Curated Markdown or YAML, not an embedding vector store where texts are pre-translated into numeric representations and retrieved by similarity.

Why it works. Domain knowledge changes on a different rhythm than the code itself. A UI framework might update once a year; your code changes every week. If the knowledge is baked into the code, a framework upgrade becomes a migration. If it lives in a separate layer, you change one YAML file and everything else stays intact.

A curated KB is also deterministic. RAG chooses top-k documents by embedding similarity, and if an important paragraph misses the retrieval cut, the agent runs without it. A flat KB is either entirely present in context or it is not – and that is immediately visible.

E2E example. On my ecommerce project, the local KB is 12 Markdown files (admin, classic-storefront, modern-storefront), plus 9 YAML files in a global cross-stack KB (tailwind-css, alpine-js, fastapi, nextjs, and so on). When I ported the method to FastAPI + NextJS, tailwind-css.yml, alpine-js.yml, and mailpit.yml just worked on the new stack without modification. That is cross-project KB reuse: platform knowledge isolated into its own layer travels across projects.

This is a rare kind of evidence in the current multi-agent literature – almost every public case study shows one system on one stack. Portability is what confirms that the split between code, KB, and agents is not cosmetic but architectural: the KB layer behaves like a self-contained component.

Non-E2E application. A security-audit KB can cover CVE categories, OWASP patterns, and framework-specific gotchas (XSS in template engines, SQL injection in ORM bypasses). A customer-support KB can encode ticket types, escalation rules, and refund policies. A documentation generator KB can define documentation formats (JSDoc, RST, OpenAPI) and conventions for each language.

What breaks without it. Knowledge gets smeared across prompts and code. Every agent ends up with its own copy of the rules, and those copies drift apart over time. When the platform changes, there is no single place to update.

When RAG is actually needed

A flat KB stops working at one of three thresholds: around 200k tokens (too expensive to load in full), uncurated sources (code, tickets, logs), or history-driven retrieval (when the agent benefits from the top-k most similar prior cases). At those thresholds, the KB evolves into RAG – but that is a change of tool, not of methodology. The contract, role separation, and persistent state still remain.

Principle 5: Closed-loop learning (knowledge compounding)

What it is. Every failure or error is turned into a structured artifact – not "fixed a selector," but a completed template with diagnosis, hypotheses, action taken, verification, KB candidates, and out-of-scope items. Those artifacts then feed back into the KB, so the next agent run already sees them.

Why it works. Without a closed loop, every run rediscovers the same failures. With one, you get knowledge compounding. The KB grows by the same logic as compound interest: the system becomes cheaper and more accurate on every pass.

E2E example. The healer writes per-run files under heal-findings/<date>-<module>.md with six sections: A (diagnosis), B (hypotheses), C (action), D (verification), E (KB candidates), F (out-of-scope siblings). Section E is the promotion path into the KB. On my project, across eight runs, the KB grew by 67% (from 25 gotchas to 42), and first_try_pass_rate rose from 14% (a new module) to 95% (the third run of the same module). That is the KB saturation curve: same agents, same prompts, different feed.

Non-E2E application. In a code-review pipeline, each rejected agent comment becomes structured feedback ("false positive: the agent flagged X, but X is allowed in this module under line N of CONTRACT.md") and is then promoted into the KB, so the next run sees the rule. In a migration tool, each failed migration becomes a markdown report with the root cause, then a rule in migration-gotchas.yml, so the next migration does not repeat the mistake. In a security audit, each false positive becomes a rule in audit-exceptions.yml, improving signal-to-noise.

What breaks without it. Agents do not learn between runs. The tenth run is as expensive as the first. Every failure requires manual diagnosis from scratch.

Principle 6: Additive instrumentation

What it is. Metrics after each run are written to a file with an evolving schema: new fields are added, old fields stay. v1 records remain valid after v2 fields are introduced. No breaking changes, no migrations.

Why it works. Without quantitative feedback, "is it getting better?" is an unanswerable question. The feeling that "it seems faster now" is not data. With metrics.jsonl, you can actually see the trendline.

There is a second benefit: an additive schema lets you learn gradually which metrics matter. I did not know in advance that first_try_pass_rate would become a key metric; it only appeared on the third run, when I noticed that the number of healing iterations was a proxy for KB maturity. If the schema had been rigid, I would have needed a migration for older records. With an additive schema, I simply added the field and the old records stayed valid.

E2E example. metrics.jsonl v1 (the first two runs) contains timestamp, target, stack, phases, kb_updates, and volume. v2 (from the third run onward) adds first_try_pass_rate, real_app_bugs_found[], test_churn, kb_hits, patterns_added, and wall_clock_ms. The v1 records remained valid, which lets me query across all eight runs.

Non-E2E application. In an ML training pipeline, experiments.jsonl can record hyperparameters, dataset version, and metrics. In a refactoring tool, refactor-runs.jsonl can track the number of changed files, tests broken or restored, and review time. In customer support, tickets.jsonl can store time-to-first-response, escalation depth, and resolution type.

What breaks without it. You cannot say objectively whether the system is improving. Debates about whether it got better or worse get resolved by intuition instead of data. When a new agent introduces an unexpected regression, you do not see it until complaints accumulate.

What these principles give you together

Each principle on its own is a useful pattern. Together they produce a system with specific properties:

Accuracy. Contract + source reading + role separation cut down the space for improvisation. The agent works from ground truth – what is actually in the code – rather than guesses about how it might be organised.
Fewer hallucinations. Persistent state provides stable context; the KB provides deterministic rules; the closed loop catches hallucinations and prevents them from recurring.
Reproducibility. The same input artifact plus the same KB snapshot should produce the same output. Different results across runs are treated as a bug to investigate, not as "the nature of LLMs."
Knowledge accumulation. Closed-loop learning plus additive metrics turn every run into data. After ten runs, you know more about your system than after a hundred one-off GPT calls driven by a single prompt.
Portability. The same six principles work for E2E testing, code review, refactoring, security audit, and migration tools. Only the KB and helpers are platform-specific; the architecture is not.

What these principles do not give you

I would not present this as a silver bullet. The principles solve a specific class of problems – accuracy and reproducibility in multi-agent systems – and do not solve others.

They do not make the agent smarter. GPT does not turn into an expert just because you wrapped six layers around it. If the task requires creativity or deep understanding, the agent stays limited by the model.
They do not work well for very short tasks. The payback starts after three to five runs. If you only run the system once, the overhead is not justified.
They do not replace review. Closed-loop learning catches errors that the agent or the system itself already noticed. Errors nobody recognised as errors still stay in the code.
They require discipline. Six-section heal findings, an explicit contract, persistent state – all of that is work. If the team is not willing to maintain those artifacts, the method turns into dead weight.

What comes next

I am now applying these six principles to a third independent domain – knowledge work (planning, learning, content), not software development. This is a deliberate attempt to eliminate the method's software bias: the first two validations were in E2E testing, and it is still unclear which principles are code-specific and which are truly domain-agnostic.

If you are applying a similar architecture in another domain – or, conversely, if you found where it stops working – I would love to hear about it. I am especially interested in cases where a principle did not work. Those cases show the boundaries of the method more clearly than successful implementations do.

P.S. In parallel, I am writing a more technical deep-dive series on one concrete application of these principles – E2E testing: a month and a half of iteration, eight runs, a six-section healing protocol, and a breakdown of KB-saturation metrics. I am also preparing an open-source companion repo with a reference implementation of the six principles – framework, four agents, metrics schema, and skeleton KBs. Announcements for new articles and the repo launch go out on LinkedIn; the articles themselves are published on the blog.

One Nav, Two Stacks: A Microfrontend Between Magento and Laravel Without Replatforming

Webmaster Ramos — Thu, 30 Apr 2026 21:04:50 +0000

A working reference implementation on two production-grade stacks (Magento 2.4 + Laravel 11), with the host integration shape shown below and a server-rendered nav skeleton shipped on day one - not retrofitted after GSC panic.

TL;DR

Mid-market ecommerce rarely lives on one stack. The industry answer - "replatform everything onto one stack" - is a $100-500k, 6-12 month project most of them cannot afford.

I shipped a smaller answer on a real client stack: a 15-20 kb Preact microfrontend that mounts into both Magento 2.4 and Laravel 11 via a manifest. This is not a Module Federation hello-world - it is two real host integrations, around 120 lines of PHP on Magento and around 90 lines on Laravel, with one pnpm build and both sites updating in under a minute.

The opinionated part: microfrontends failed as a product architecture but work as a repair strategy. Greenfield product teams drown in coordination cost. Repair across inherited stacks is a different problem - and the pattern solves it cleanly, if you get the SEO contract right before shipping.

The proof point is deliberately concrete: a before/after crawler diff on identical URLs and user-agents. Not a modelled SEO score, not a Lighthouse proxy, but raw HTML facts - bytes, anchor counts, and whether the navigation exists in initial markup for non-rendering crawlers.

The problem nobody names out loud

A mid-market ecommerce group with multiple brands, accrued over years:

A Magento 2.4 storefront - catalogue, cart, checkout.
A Laravel 11 marketing site - brand story, awards programme, editorial.
A handful of single-purpose SPAs on top.

Each stack has its own header and footer. When marketing adds a top-level category, it ships in one stack in a week and in the other in three. When design changes the logo, it takes two sprints to roll out across everything.

The cost is not engineering hours. The cost is that the brand is visibly inconsistent to customers, the teams know it, and every cross-team sync about the nav takes an hour.

Why "just consolidate on one stack" is not the answer

The standard advice is a monorepo or a headless rewrite. Both are correct on paper and wrong in the field.

Monorepos assume teams that want to converge. Inherited teams - Magento folks who have been on that stack for seven years, a Laravel team that came with an acquisition - do not want to converge. They have skill investment, release cadences, and on-call rotations built around their stack. A monorepo migration is a political project before it is an engineering one, and most mid-market companies cannot push one through.

Headless replatforming is the same project in a different wrapper. Twelve-month runway, executive buy-in, and a new front end that has to outpace the rate at which the old ones break the business.

A shared nav microfrontend does not compete with monorepo architecturally. It competes with doing nothing - which is what the organisation was going to do for the next two years anyway.

Why repair is different from design

Spotify publicly rolled back its extreme squad autonomy. The failure mode is always the same: teams own product-level vertical slices, those slices compose into one surface, coordination cost explodes, UX inconsistency becomes a feature of the architecture.

That is a real lesson. It does not mean no microfrontend is ever right.

Repair is a different problem than design. You are not building the surface - the surface already exists, in two incompatible implementations. You are installing the narrowest possible shared layer that brings them back into visual alignment. The nav is exactly that narrow: no business logic, no routing, no data dependencies beyond a flat link tree.

Everything the microfrontend critique flags - duplicate runtime, fragmented UX ownership, coordination overhead - either does not apply to a 15 kb shell (runtime is negligible) or applies less than the status quo (UX is already fragmented; we are reducing coordination by centralising the decision).

Shell architecture: 15-20 kb, one build, one file

A Preact tree built with Vite's library mode into one IIFE script and one CSS file, both with content-hashed filenames. A manifest.json maps logical names to hashed URLs.

Preact over React - ~10 kb gzipped vs ~45 kb. Non-negotiable at a 15-20 kb budget.
IIFE over ES modules - works in Magento's RequireJS environment without extra config, and in any <script> tag on any stack.
cssCodeSplit: false - one file, one request, no FOUC.
Tailwind with a prefix - scoped classes, no collision with host CSS.
Content-hashed URLs via manifest - immutable caching. Hosts read the manifest at render time and emit <link href="/nav/shared-nav.abc123.css">.

pnpm build takes ~8 seconds. Hosts pick up new hashes within their cache TTL. One bugfix lands on both sites in ~1 minute.

Host integration: Magento 2.4

Around 120 lines of new PHP, three files:

Acme\Theme\Model\SharedNavManifest (~85 lines) - HTTP-fetches the manifest with Magento's cache backend, falls back to a non-hashed shared-nav.iife.js on fetch failure so the nav never disappears, only loses cache-busting.
Acme\Theme\ViewModel\SharedNavAssets (~26 lines) - the ViewModel that phtml templates talk to. CSS goes through a ViewModel rather than static <css> layout XML because the URL has a hash in it.
Acme\Theme\etc\frontend\di.xml (~7 lines) - wires the manifest URL through deploy config.

Two phtml partials - header and footer - emit the mount divs and asset tags. Included from default.xml, so every page type inherits the shared nav.

Host integration: Laravel 11

Around 90 lines. Smaller because the service container carries more weight.

App\Services\SharedNavManifest (~65 lines) - HTTP-fetches the manifest, caches via Cache::remember('shared_nav.manifest', 60, ...), logs and falls back to the unhashed bundle on fetch failure.
config/services.php - three lines exposing services.shared_nav.manifest_url as env-driven config.
Two Blade layouts - public and a secondary layout for older marketing pages - emit <link> and <script> tags from the manifest service.

The 60-second cache TTL controls how fast a pnpm build propagates - aggressive enough for release cadence, conservative enough that manifest fetches are one request per minute per worker.

Representative code shape (abridged)

The full production classes are client code, so I am not publishing them verbatim here. But the integration should not stay abstract either. This is the shape of the two host adapters - abridged to show the contract rather than every guardrail and framework detail.

A note on the manifest keys: Vite indexes manifest.json by entry source path and asset name - src/main.tsx and style.css in our build - not by the output filename. The host lookups use those keys; the unhashed filenames (shared-nav.iife.js, shared-nav.css) are only used as fallbacks when the manifest fetch fails.

Magento 2.4 - manifest service shape:

class SharedNavManifest
{
    public function getJsUrl(): string
    {
        return '/nav/' . ($this->manifest()['src/main.tsx']['file'] ?? $this->fallbackJs);
    }

    public function getCssUrl(): string
    {
        return '/nav/' . ($this->manifest()['style.css']['file'] ?? $this->fallbackCss);
    }

    // SSR fallback: fetched once from the shell, cached in Magento's cache
    // backend, and inlined into the mount div at render time.
    public function getHeaderHtml(): string
    {
        return $this->snapshotHtml('header.html');
    }

    public function getFooterHtml(): string
    {
        return $this->snapshotHtml('footer.html');
    }

    private function manifest(): array
    {
        // 1) read cached manifest
        // 2) on miss, fetch remote manifest URL
        // 3) cache parsed JSON
        // 4) on failure, log and fall back to unhashed asset names
    }

    private function snapshotHtml(string $key): string
    {
        // 1) read cached snapshot for $key
        // 2) on miss, fetch rendered HTML from the shell (e.g. /nav/header.html)
        // 3) cache body with a short TTL
        // 4) on failure, return '' so the shell still hydrates later
    }
}

Laravel 11 - manifest service shape:

class SharedNavManifest
{
    public function manifest(): array
    {
        return Cache::remember('shared_nav.manifest', 60, function () {
            // GET config('services.shared_nav.manifest_url'), parse JSON.
            // On failure, log and return [] so fallback filenames kick in.
        });
    }

    public function jsUrl(): string
    {
        return '/nav/' . ($this->manifest()['src/main.tsx']['file'] ?? 'shared-nav.iife.js');
    }

    public function cssUrl(): string
    {
        return '/nav/' . ($this->manifest()['style.css']['file'] ?? 'shared-nav.css');
    }

    public function headerHtml(): string
    {
        return $this->snapshotHtml('header.html');
    }

    public function footerHtml(): string
    {
        return $this->snapshotHtml('footer.html');
    }

    private function snapshotHtml(string $key): string
    {
        return Cache::remember("shared_nav.snapshot:{$key}", 60, function () use ($key) {
            // fetch rendered HTML from the shell (e.g. /nav/header.html)
            // return '' on failure so the shell still hydrates later
        });
    }
}

That is why the line counts matter. The host code is small enough to be reviewable, boring enough to be supportable, and explicit enough that another PHP team can own it without learning a front-end platform first.

The host <-> shell contract

The two sides agree on a tiny surface:

Host provides:

<link rel="stylesheet" href="{{ $nav->cssUrl() }}">
<div id="sa-header" style="min-height: 80px;">{!! $nav->headerHtml() !!}</div>
<!-- page body -->
<div id="sa-footer">{!! $nav->footerHtml() !!}</div>
<script defer src="{{ $nav->jsUrl() }}"></script>

Shell provides:

A nav-fallback.html emitted at build time, split into header and footer snippets the host inlines into the mount divs (the SSR fallback).
Client-side mount into #sa-header and #sa-footer that replaces the SSR snapshot with the interactive tree (dropdowns, mobile menu, state).
One CSS file, one JS file, no global pollution (IIFE scope).
No knowledge of Magento or Laravel. No runtime config, no feature flags.

Everything else - routing, authentication, cart state, checkout - stays on the host. The nav does not know the host exists. The host does not know the nav is Preact. That is the whole integration.

The min-height: 80px on the header mount is anti-CLS insurance - the slot reserves its space before hydration, so Core Web Vitals do not punish the deferred render.

The SEO question, answered honestly

This is the part every microfrontend post skips or hand-waves. I will not.

Also, this section is intentionally based on observable crawler facts, not modelled SEO metrics. I am not claiming a ranking uplift from a synthetic score. I am showing what a crawler can and cannot see in the initial HTML before and after the fallback ships.

Without a fallback, initial HTML is two empty divs:

<div id="sa-header" style="min-height: 80px;"></div>
<div id="sa-footer"></div>

Googlebot renders JavaScript (eventually) and sees the nav - with a delay measured in days. But GPTBot, ClaudeBot, and PerplexityBot do not render JavaScript. They see the empty divs. As far as AI search is concerned, the site has no nav.

I measured this before shipping the SSR fallback. Three pages, five user-agents, identical curl invocations. Same URLs, same crawl method, same parsing rule - only the fallback changed.

Before SSR fallback:

Metric	Homepage	/about	/portfolio
Bytes	35,050	35,050	97,521
`<a href>` total	12	12	12
Anchors from nav	0	0	0

Twelve anchors per page, none of them structural. Every page - no matter how deep - exposed the same twelve inline body links to a non-rendering crawler. Sitemap.xml covered URL discovery, but not the four things nav does beyond discovery:

Link equity - a multi-level nav is hundreds of internal links per page pointing at categories. Without it, category pages lose authority.
Crawl budget - Googlebot prioritises pages by incoming-link density. Sitemap-only pages get crawled less often.
Topic hierarchy - sitemap is flat. Nav signals semantic structure ("Shop -> Men -> Shoes").
AI assistant context - ChatGPT and Perplexity build mental models from HTML, often ignoring sitemaps. Without nav in HTML, AI knows your URLs but not your structure.

The three-level mitigation ladder:

<noscript> fallback with critical links inside the mount div (hours of work).
SSR skeleton - Vite emits a nav-fallback.html at build time; hosts inline it into the mount divs before hydration replaces it (a day or two).
Full SSR service - a Node process renders each nav request server-side (a week, plus a new production dependency).

Level 2 is the sweet spot for an ecommerce group this size. We shipped it before the first production release. Same curl invocations, four days later:

After SSR fallback:

Metric	Homepage	/about	/portfolio
Bytes	98,881	98,881	161,348
`<a href>` total	112	112	112
Anchors from nav (`#sa-header`)	31	31	31
Anchors from footer (`#sa-footer`)	69	69	69

All five user-agents received byte-identical HTML (the only per-request variance is the Laravel CSRF meta token). The nav and footer tree are in initial HTML - 100 additional anchors per page, constant across every page, visible to every crawler that can parse HTML.

That matters methodologically. A crawler can disagree with my interpretation of the SEO impact, but it cannot disagree with 35,050 -> 98,881 bytes or 12 -> 112 anchors under the same crawl conditions. This is a reusable audit method, not a one-off anecdote.

The gap closed on release day. No retroactive GSC panic, no "we measured a drop and here's how we fixed it" narrative. The honest framing is "we knew the risk, we closed it before shipping".

What this article proves today - and what it does not yet

This article proves three things:

The integration pattern is real on two production-grade PHP stacks.
The SEO risk is real if the shell ships with empty mount points only.
A Level 2 fallback closes that crawler-visibility gap on day one.

What it does not prove yet is a 90-day business outcome story. I do not have a "three months later, here are CrUX and GSC deltas" chart in this draft, because that would require waiting for the post-release window to mature. I would rather publish the implementation pattern and the crawler evidence honestly than pretend I have impact numbers I do not have yet.

That makes this a build-and-ship case study, not a finished growth narrative. When post-launch search-console and field-performance data are mature enough to be worth showing, they belong in a follow-up article.

What this pattern does not solve

Not overselling: the shared nav is the minimum viable shared surface - that is its strength and its ceiling.

Primary page content still diverges. Magento renders products; Laravel renders marketing copy.
Shared checkout - not solved. Checkout lives in Magento; marketing links into it via cookies on a common parent domain.
Shared authentication - not solved. Cookies, redirects, OAuth handshakes - all host-specific.
Shared search - could be built on top of the shell, but we did not. Search UX is coupled to Magento-native catalogue data.

A shared nav is not a distributed-front-end strategy. It is a band-aid across a healed fracture. If you need a distributed front end, you need a different architecture.

When this pattern fits

Short checklist. If you check fewer than three boxes, do something else.

You have two or more existing stacks with established teams you cannot realistically move.
There is no budget or appetite for a full front-end unification right now.
The primary pain is UX inconsistency, not performance or architectural debt.
Nobody on the executive side is willing to own a "unified portal" programme.
You need to be AI-agent ready - which means the nav must be in initial HTML, not only after JS runs.

If all five apply, the pattern pays for itself in weeks, not quarters.

What's next

The same shell is about to land on two more stacks in the same group - a greenfield Magento storefront rewrite and a full Laravel marketing rewrite. Both will consume the existing manifest.json unchanged. Zero additional shell work, the same integration footprint per host. That is the portability proof.

If the pattern looks like it might fit your stack, the interesting conversation is not "how do I build a shell" - Vite's library mode docs will get you there in an afternoon. The interesting conversation is the SEO contract and shipping Level 2 on day one instead of retrofitting it after Google Search Console punishes you.

For a mid-market team, that is usually the real decision framework:

Level 1: <noscript> links when the risk window is small and the nav is shallow.
Level 2: build-time SSR fallback when you need full crawler-visible structure without adding a Node service.
Level 3: full SSR service when the nav is dynamic enough that static fallback HTML becomes a maintenance problem.

That is where outside perspective usually helps, and where I spend most of my consulting time.

Everyone Says MCP Beats CLI. The AWS Benchmark Disagrees.

Webmaster Ramos — Tue, 21 Apr 2026 19:12:16 +0000

Full code, aggregated numbers (n=10 across 5 tasks and 5 transports), and a curated selection of 8 hand-picked runs live in the mcp-vs-cli-aws-benchmark repo. This article is a dense version of docs/findings.md from the same repo, rewritten for a reader who doesn't have an hour to study the test harness.

TL;DR

The question in the title is wrong. "MCP or CLI?" assumes they have the same use case and one of them is objectively better. In reality it's a trade-off between two currencies: engineering time vs. input tokens per run, and you need both numbers to decide.

I compared raw aws CLI against the official awslabs.aws-api-mcp-server on five read-only tasks against a real production AWS account. The model is Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (no Claude Code and no claude-agent-sdk, to avoid poisoning the context). Ground truth is collected via boto3, verification is automatic. n=10 per (task, transport) cell.

Result: a well-designed CLI tool beats awslabs MCP by 43-60% on input tokens on every one of the five tasks, at equal success rate. But it takes half a day of engineering work per service.

If you run 200 agent invocations a day - put MCP in and forget about it. If you run 200 thousand - sit down and write your own tool wrapper following the checklist at the end of the article.

Where this whole debate comes from

Since February 2026, dev Twitter and dev.to have been flooded with posts carrying the same message: "MCP loses to CLI, here are the numbers". Titles like «Why CLI Tools Are Beating MCP for AI Agents», «MCP vs CLI: Benchmarking AI Agent Cost & Reliability», «Why CLI is the New MCP for AI Agents». They all cite the same Scalekit benchmark, which reported:

MCP is 10-32x more expensive than CLI on input tokens.
Reliability: CLI 100%, MCP 72% (the cause of all 28% of failures is TCP timeouts connecting to the GitHub Copilot MCP server).
Example: a simple "what language is this repo?" query - CLI 1,365 tokens, MCP 44,026 tokens.

The authors' explanation: schema dump. The GitHub Copilot MCP server dumps descriptions of all 43 of its tools into the model's context on startup, and 42 of them are unused in any given query.

The problem is that this benchmark is n=1 on a single service, with one kind of MCP server ("fat", per-resource). From that, people draw "MCP loses" conclusions - that's roughly like measuring internet speed on a single website and concluding "IPv6 is slower than IPv4". There is a useful signal, but no grounds for generalisation.

I decided to reproduce the comparison on a different service (AWS), with a larger n, and in a setting where the MCP server is not designed as a "fat" directory.

AWS has already done its homework

The first thing I found when I went to look at awslabs/mcp was not what I had expected. Following the Scalekit GitHub Copilot MCP analogy, I was expecting to see dozens of per-resource MCP servers: awslabs/ec2, awslabs/s3, awslabs/iam, each with their own 20-30 tools (describe_instances, run_instances, terminate_instances, modify_instance_attribute...). That would have been a clean schema dump in the context of a single task.

In reality, the main AWS MCP server - awslabs.aws-api-mcp-server - is built very differently. It exposes three tools:

call_aws - takes an aws CLI command string (or an array of up to 20 commands for batch mode) and runs it.
suggest_aws_commands - natural language to a list of candidate aws CLI commands. The authors explicitly mark it as FALLBACK.
get_execution_plan - multi-step plans, experimental, gated behind an environment variable.

By default two are published (without get_execution_plan). And there is a built-in READ_OPERATIONS_ONLY=true switch - you can tell the server "describe/list/get only" and it will cut everything else off at its own level.

This is an important engineering choice: AWS itself acknowledged the schema-dump problem and opted out of a fat MCP server in favour of a wrapper over the CLI living under the MCP protocol. Comparing such a wrapper against "raw CLI" is a far more honest experiment than repeating Scalekit on the GitHub MCP.

Methodology

The details (runner code, ground-truth script, whitelist) are in the repo. Here - compressed.

5 read-only tasks against a production-like AWS account:

ID	Category	Task	What it tests
`ec2_running`	simple	List running EC2 in `us-west-2`	One API call + filtering
`s3_bucket_policy`	edge	Bucket policy for a single bucket	Handling of an optional resource
`s3_bucket_regions`	chained	All S3 buckets + region of each	List + per-item lookup
`iam_admin_roles`	filter	IAM roles with `AdministratorAccess` policy	Pagination + content filtering
`ec2_cpu_last_hour`	chained	CloudWatch CPU over 60 min for running EC2	Composition + time windows

The correct answer for iam_admin_roles in my account is an empty list. A separate honesty test: will the model make up role names.

Model: Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (~150 lines). Why not claude-agent-sdk or Claude Code - see the "methodology notes" section below, this choice cost me a day and a half.
Transports: CLI - subprocess.run(['aws', ...]) behind a whitelist. MCP - the mcp python lib, which boots awslabs.aws-api-mcp-server via uvx stdio and performs a real MCP handshake.
Safety: a dedicated IAM user mcp-benchmark with ReadOnlyAccess + a local command whitelist. Two lines of defense - in case the model tries to break something.
Verification: a boto3 script captures ground truth before the benchmark, a verifier compares the model's JSON response automatically.
n=10 per cell, median on the main metrics.

First attempt: CLI loses everywhere

Spoiler for anyone who won't read to the end: everything you are about to see - CLI failing two tasks, 60% success rate, a naive strategy with 36 tool calls - turned into the opposite result three days later: a CLI that beats MCP by 43-60% on tokens. But to get there I had to walk through five failed hypotheses and one bug in my own code. This part of the article is here for the detective story, not for the numbers. The numbers are at the end.

On the pilot run with three transports (plain cli, cli with an enriched description, mcp) the picture looked like a confirmation of the Scalekit narrative. On iam_admin_roles:

cli plain: 36 tool calls, 20k input tokens, 68 seconds. Strategy: list-roles + list-attached-role-policies for each of the 34 roles in the account.
mcp: 1 tool call, 5k input tokens, 4 seconds. One command: iam list-entities-for-policy --policy-arn ... --entity-filter Role.

The same model on the same prompt made a different command choice. On MCP - perfect; on CLI - the naive, linear-complexity path.

Even scarier was ec2_cpu_last_hour. CLI failed in 60% of cases: it hit the max_turns limit trying to guess the correct timestamp for CloudWatch get-metric-statistics. I looked at the logs and saw commands with --start-time 2025-05-16T..., --start-time 2025-07-14T... - the model clearly had no idea what year it was.

MCP in the same conditions made 3 calls, always with correct 2026 timestamps, 100% success.

This looked like a ready-made "CLI loses" article. Good thing I didn't stop there.

Five hypotheses, five ablation experiments

Before publishing results like that, I wanted to understand why. "MCP is smarter" is not an explanation, it's a description. Sonnet 4.6 has no way to know which transport it's using to talk to AWS: the agent loop is the same, the prompt is the same. Something structural in the MCP transport was making the model behave differently.

What follows is five controlled experiments. Each time I took the CLI transport and added one trait from the MCP world to test an isolated hypothesis.

Hypothesis 1: tool description length and structure. awslabs's call_aws description is ~3000 characters with examples and best practices. My aws_cli was ~500. I wrote tools_cli_rich.py with a description of the same length, including a direct hint: "For 'find roles attached to policy X', use iam list-entities-for-policy --policy-arn ... --entity-filter Role instead of listing every role and inspecting each one."

Result on iam_admin_roles: 37 tool calls, the same naive strategy. The model read the description (you can tell by the input tokens: they grew), but didn't follow it.

Hypothesis 2: the presence of a second "hinter" tool. Besides call_aws, awslabs exposes suggest_aws_commands, whose description includes an example: "List all IAM users who have AdministratorAccess policy". Maybe the mere presence of this description in context works as "scaffolding", even if the model never actually calls suggest_aws_commands itself?

I made tools_cli_with_fake_suggest.py: a second tool that returns an error when called, with a verbatim copy of awslabs's suggest_aws_commands description. Result: 35 tool calls, the same naive strategy. The model did not call the fake suggest_aws_commands (because the description says in black and white "use only when uncertain") - it just read it. And that didn't help.

Hypothesis 3: tool and parameter names. awslabs's tool is called call_aws with a cli_command parameter. Mine was aws_cli with a command parameter. Maybe "call_aws" semantically nudges the model towards "API-style" thinking, while "aws_cli" nudges it towards "shell-style"?

tools_cli_renamed.py: renamed everything, even added a max_results parameter for full parity. Result: 39 tool calls, naive strategy. This hypothesis was a miss too.

Hypothesis 4: MCP capabilities / prompts / resources. Maybe the MCP server passes something beyond the tool list to the model? The protocol has three other channels: prompts (system prompts from the server), resources (documents for RAG) and instructions (system-level instructions).

I wrote a diagnostic script and asked the server directly:

capabilities: experimental={} logging=LoggingCapability()
              prompts=PromptsCapability(listChanged=False)
              resources=ResourcesCapability(subscribe=False, listChanged=False)
              tools=ToolsCapability(listChanged=True)
instructions: None
prompts: 0
resources: 0

The server declares the capability but publishes nothing. instructions is None. It really does send the model only the tool list and nothing else.

Hypothesis 5: runtime context in the system prompt. This was the most productive one. I made a cli-ctx transport - the same aws_cli, but with four extra lines in the system prompt:

Runtime context (provided by the runner, not by the tool):
- Current UTC time: 2026-04-08T23:06:57Z
- Default AWS region: us-west-2
- This account is real and live; commands return real data.

Four lines. 118 tokens.

And here is what happened on ec2_cpu_last_hour, n=3:

Variant	Calls	Input tokens	Wall	Success
`cli` plain	13-15	26-55k	50-70s	50%
`mcp`	3	13.4k	14s	100%
`cli-ctx`	2	4.1k	10s	100%

cli-ctx didn't just catch up with MCP - it beat it. Three times fewer input tokens and faster wall-clock.

Where did the effect come from? I went into the MCP server logs and looked at what exactly it returns to the model in each tool result. And here's what was in the very first call_aws response:

"ResponseMetadata": {
  "RequestId": "...",
  "HTTPStatusCode": 200,
  "HTTPHeaders": {
    "date": "Wed, 08 Apr 2026 00:15:21 GMT",
    ...
  }
}

The awslabs MCP server passes the full HTTP headers from the AWS API back, including date. Raw aws CLI v2 returns only the response body without headers. The model on MCP knows, from the very first tool call, what today's date is; the model on raw CLI does not, because its training cutoff is somewhere in 2025, and it honestly assumes it's still 2025.

The entire gap on ec2_cpu_last_hour was explained by an HTTP Date header leaking through the MCP abstraction. Four lines in the system prompt reproduce the effect for free.

That was the moment I rethought all the previous results.

Three mechanisms I found and closed

The first mechanism - effect A, HTTP metadata - is already covered in the previous section. Runtime context in the system prompt closed the failures on ec2_cpu_last_hour, and that's the most important of the three effects. But on iam_admin_roles (36 vs 1) and s3_bucket_regions (16 vs 2) the gap remained. So there had to be at least one more thing going on.

Effect B: batch calling

On s3_bucket_regions in the MCP run I looked at the second tool call and saw this:

call_aws(cli_command=[
  "aws s3api get-bucket-location --bucket bucket-1",
  "aws s3api get-bucket-location --bucket bucket-2",
  ... (15 items total)
])

An array of 15 commands. In a single call. I went to the call_aws description and found this section:

Batch Running: The tool can also run multiple independent commands at the same time. Call this tool with multiple CLI commands whenever possible. You can call at most 20 CLI commands in batch mode.

So cli_command accepts anyOf string | array of strings, and the server executes them in parallel inside its own process, returning the results together. The model reads this and uses it.

My original aws_cli accepted only a string. I wrote tools_cli_v2.py: added batch support to the input schema, rewrote the description following the same structure as awslabs's, and added parallel execution via asyncio.gather.

On s3_bucket_regions this instantly cut the tool call count from 16 to 2 - exactly like MCP.

Effect C: "smart" command choice - turned out to be a benchmark bug

But on iam_admin_roles the effect remained. The model on cli-v2 kept doing 36 calls. I was convinced this was some subtle feature of how the model models command selection, and I was preparing an "unexplained mystery" section for the article.

Then I ran cli-v2 iam_admin_roles again and carefully looked at the raw trace instead of the aggregated numbers. Here is the first tool call:

1. aws_cli (0ms, error=True)
   aws iam list-entities-for-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
     --entity-filter Role --output json

Execution time 0ms. error=True. The model immediately tried the right command - exactly the same one MCP uses. And got an error. Not from AWS - the error never reached AWS. The error came from my own safety.py:

ALLOWED = {
    "iam": {
        "list-roles",
        "list-attached-role-policies",
        # list-entities-for-policy WAS NOT IN THIS LIST
    },
    ...
}

I wrote the whitelist based on how I pictured this task being solved. And I put in exactly the commands needed for the naive path. The model on CLI tried the optimal command, got rejected, fell back to the naive path and conscientiously walked through all 36 roles.

The awslabs MCP server has its own allowlist - significantly broader. And list-entities-for-policy is allowed there.

This was a benchmark bug, not a property of MCP. I added one line to the whitelist:

"iam": {
    "list-roles",
    "list-attached-role-policies",
    "list-entities-for-policy",  # <- this one
},

And re-ran cli-v2 iam_admin_roles:

Variant	Calls	Input tokens	Wall
`cli` plain	36	20k	68s
`mcp`	1	5k	4s
`cli-v2` (whitelist fixed)	1	2.8k	4s

Exactly one tool call. And at the same time fewer input tokens than MCP, because we have one tool description of ~3000 characters and MCP has two descriptions totalling ~5800 characters.

This is a methodologically important point for anyone who wants to reproduce a benchmark like this: your own whitelist can silently determine the outcome. If the allowlist only covers the commands needed for the naive strategy, you aren't measuring the transport, you're measuring your whitelist.

Final table: cli-full vs mcp at n=10

cli-full is the union of all three improvements in a single transport:

Batch input (cli-v2 tool spec).
Rich tool description with batch examples and best practices (cli-v2).
Runtime context in the system prompt (cli-ctx).
Broad whitelist with list-entities-for-policy and everything else needed for the optimal path.

At n=10 per cell, median:

Task	cli-full input	mcp input	Δ input	cli-full calls	mcp calls	cli-full ok%	mcp ok%
`ec2_running`	3,053	5,368	-43%	1	1	90%*	100%
`s3_bucket_policy`	2,975	5,425	-45%	1	1	100%	100%
`s3_bucket_regions`	5,801	14,317	-60%	2	2	100%	100%
`iam_admin_roles`	2,934	5,213	-44%	1	1	100%	100%
`ec2_cpu_last_hour`	5,345	9,461	-44%	2	2	100%	100%

* the single failure on ec2_running cli-full #9 was an HTTP 529 Overloaded from the Anthropic API. That's infrastructure noise, not a transport problem. I deliberately did not retry failed runs to avoid masking real failures - and this lone 529 made it into the stats as 90%. MCP could just as easily have caught the same 529; it just got lucky.

cli-full beats MCP on input tokens on every one of the five tasks, 43-60%. Success rate - parity.

On wall clock MCP wins on 4 of 5 tasks. Reason: wall clock is dominated by AWS API call time, not by model turn time. Tokens don't translate directly into seconds. The only wall-clock win for CLI is s3_bucket_regions, where MCP spends time marshalling a 15-item batch through its protocol layer, and my asyncio.gather does not.

The right question: how much is your engineering time worth

This is where the popular "CLI is better than MCP" narrative breaks.

My cli-full is a few hundred lines of code and half a day of debugging. A tool wrapper with a whitelist, a rich description copied from awslabs best practices, batch support via asyncio.gather, a system prompt with runtime context, verify + ground truth for a specific task. And that's only for AWS. For GCP, for Linear, for Notion - everything from scratch.

awslabs.aws-api-mcp-server is one command (uvx awslabs.aws-api-mcp-server@latest) and one environment variable. Works with every AWS service, not with five tasks. Best practices are already baked in by the authors (who know AWS better than I do). Updates come with @latest. Read-only mode is an environment variable.

MCP pays with service knowledge, CLI pays with engineering labour. It's a question of which currency you pay for your agent in: person-hours or tokens.

When to choose MCP

High velocity, low QPS. New project, the agent has to work tomorrow. MCP installs in 30 seconds and covers everything.
Broad surface. The agent pokes at EC2, S3, IAM, Lambda, CloudWatch, RDS, ECS. Writing a CLI wrapper for each service is an unrealistic budget.
Polyglot environment. AWS today, GCP tomorrow, Notion the day after. Per-service CLI wrappers don't scale; one MCP server per service does.
You're not an expert on the service. You don't know by heart that list-entities-for-policy is more efficient than list-attached-role-policies in a loop. The awslabs authors do. You reuse their knowledge by paying a few extra tokens.
Low QPS. A few hundred agent invocations a day. Saving 8k tokens per request is a few dollars a month. Engineering time costs more.

When to choose a purpose-built CLI

High-QPS production. A million calls a day x 8k extra tokens x $3/M input = $24/day on top. That's $8k a year, which is enough to hire a contractor to write the tool wrapper once.
Narrow, stable task set. The agent does five specific things. A narrow whitelist and a short description will be more compact than any universal MCP server.
Full control over the context. Every token in the system prompt and tool description is yours. No ~3KB of hidden awslabs guidance, no update surprises, no external dependency that might suddenly change.
Compliance / audit. Every tool call is visible, every input is validated by your code, every failure mode is known. MCP adds a protocol layer between you and the AWS API that some audits won't accept.
You already have the knowledge. If you know how to work with the service efficiently, you can bake that knowledge into the tool description once and reuse it forever.

Checklist: how to build a cli-full equivalent

If after all this you've decided your use case is CLI, here are six items that turn "raw subprocess.run" into something that beats awslabs MCP.

1. Accept batch input. Tool input schema:

"cli_command": {
  "anyOf": [
    {"type": "string"},
    {"type": "array", "items": {"type": "string"}}
  ]
}

When the model passes an array, the runner executes the commands in parallel via asyncio.gather (or equivalent) and returns the results in list order with index headers [1/15], [2/15]... Saves 10-20x on tool calls for tasks where one command has to be run with different parameters.

2. Put runtime context in the system prompt. Minimum - four lines:

Runtime context (provided by the runner, not by the tool):
- Current UTC time: <now>
- Default region: <region>
- Identity: <arn>
- This account is real and live; commands return real data.

This closes a whole class of problems where the model gets confused about dates, regions, or thinks it's working against documentation rather than production.

3. Write a rich tool description. Aim for 2500-3000 characters. A structure that works (copying awslabs):

Short tool description (1 sentence).
Key constraints (allowed commands, region defaults, auth model).
A "Best practices" section - how to pick commands, when to use batch, when to use --query and --filters.
An "Anti-patterns" section - an explicit "don't list-then-iterate if there's a more specific operation".
2-3 concrete examples covering different task categories.
Restrictions: no shell pipes, no --profile, no substitution.

The model reads this as a cookbook. A badly written description means the model writes naive commands.

4. The whitelist must cover the **optimal commands, not just the "obvious" ones.** This is the point that cost me half a day. Ask yourself: "what would a senior AWS engineer write for this task?" - and make sure that command is in the whitelist. Not just the commands needed for the naive strategy.

5. Return structured output, not prose. Always --output json + truncate to a fixed byte budget with an explicit truncation marker. The model has to know that the response was truncated.

6. Forward tool errors to the model verbatim. When a command fails with [exit=N] <stderr>, return it to the model as-is. It can self-correct on the next turn. Silent failures waste turns for nothing.

Following these six rules turns the CLI wrapper from a parody of a tool into something that actually beats awslabs MCP on tokens. Takes half a day per service.

Methodology notes

Three things I spent time on and which are worth knowing if you want to reproduce a benchmark like this.

First: claude-agent-sdk and Claude Code poison the context. For the first two days I was measuring CLI vs MCP through claude-agent-sdk, and the numbers were wild. 30k input tokens on a "how many running EC2" task. I thought for a long time that it was protocol overhead, but no - it was Claude Code through the SDK dragging my entire user-level ~/.claude.json into the context: figma MCP, pencil MCP, PubMed MCP, Gmail, Calendar, Bash, Edit, Read... 40+ tools from other servers I hadn't asked for. I rewrote the runner onto the direct Anthropic API - cache_read dropped from 30k to 0, input tokens dropped to "normal" 2k on a simple task. If you are benchmarking agents through someone else's ready-made harness, check with your own eyes what exactly goes into the model on the first system turn.

Second: your own whitelist is an invisible benchmark variable. I already wrote about this in the "effect C" section. I'll repeat: any safety / security / validation layer between the model and the real service is part of what you are measuring, even if you don't consciously think of it that way. If your whitelist forces the model into a narrow path, you are measuring the model's behaviour in that narrow path, not the model's behaviour in general.

Third: success_rate and retry policy. One of my cli-full ec2_running runs fell over with an HTTP 529 Overloaded from the Anthropic API. In the stats that's 90% success rate, even though it's not a transport issue. I decided not to retry, because then the risk of masking real problems is too high. The article has to mention that 529 explicitly - otherwise the reader will compare 100% MCP against "90%" CLI and draw the wrong conclusion. Retry policy is yet another invisible variable the benchmark has to state out loud.

Reproducibility

Everything is in a public repo: github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark.

What's in there:

src/agent_loop.py - ~150 lines of a self-contained agent loop on the direct Anthropic API.
src/tools_cli.py, tools_cli_v2.py, tools_mcp.py - CLI and MCP transports. Plus the ablation variants (tools_cli_rich.py, tools_cli_renamed.py, tools_cli_with_fake_suggest.py) from the "five hypotheses" section.
src/runner.py - CLI for running --tasks <ids> --transports <ids> --n <N>.
src/aggregate.py - medians + IQR + success rate from raw JSONL.
src/safety.py - whitelist + injection guard.
src/ground_truth.py - a boto3 script that captures ground truth from a live account (parameterised via BENCH_S3_BUCKET).
results/scrubbed/final_summary.json - aggregated numbers at n=10 across all (task, transport) cells. These are the same numbers as in the tables above, in machine-readable form.
results/scrubbed/sample_runs.jsonl - 8 hand-curated runs, one per key storyline in the article: naive CLI on iam_admin_roles (36 calls), MCP on the same task (1 call), cli-full (1 call); CLI failure on ec2_cpu_last_hour due to 2025 timestamps vs the cli-ctx fix; naive CLI on s3_bucket_regions (16 calls) vs MCP with batch (2 calls) vs cli-full with batch (2 calls). All role, bucket and instance names are replaced with role-N, bucket-N, i-instanceNN. Metrics and full model response text are preserved.
docs/findings.md - extended analytical notes, part of which went into this article.

Why there are no full 250 raw runs in the repo: the raw JSONL files contain real IAM role names, S3 bucket names and EC2 instance IDs from my AWS account, woven into free-form text of model responses and batch commands. They can't be auto-scrubbed without a manual mapping for every name, and one missed line is a leak. So the repo only includes what I reviewed by eye: the aggregated final_summary.json and 8 curated sample runs. If you want to see a full dataset, the best way to get a correct one is to run the benchmark on your own account in ~20 minutes (see below).

To run the benchmark under your own account:

Create a dedicated IAM user with the ReadOnlyAccess policy + any extra grants for your tasks.
cp .env.example .env, fill in AWS_PROFILE, AWS_REGION, ANTHROPIC_API_KEY, BENCH_S3_BUCKET (the name of any bucket in your account for the bucket-policy task).
python -m src.ground_truth - captures ground truth for your account.
python -m src.runner --n 10 - runs the full series, ~15-20 minutes, ~$5-10 on the Anthropic API.
python -m src.aggregate results/raw/*.jsonl - prints the table.

If you repeat this on your own stack and get different numbers - drop me a line, I'd love to compare.

Conclusions

The popular "MCP loses to CLI" narrative rests on a single benchmark (Scalekit, n=1, GitHub Copilot MCP). It is correct in its own conditions, but generalising from it to "MCP is bad" is a mistake.
AWS has already solved the schema-dump problem in awslabs.aws-api-mcp-server. Their flagship MCP server is essentially the CLI with two tools, and that's a fair benchmark partner for raw aws CLI.
On a fair 5-task series at n=10, cli-full beats MCP on input tokens by 43-60% on every task. But that takes writing a tool wrapper, a whitelist, a system prompt, a rich description. Half a day of engineering per service.
The real question isn't "MCP or CLI" but "how much does your engineering time cost vs how much do your tokens cost". MCP wins on velocity, broad surface, polyglot, low-QPS. CLI wins on high-QPS, narrow task set, compliance, and when best-practice knowledge already lives in your head.
All three gap mechanisms - HTTP metadata, batch calling, a broad allowlist - are reproducible in a CLI tool via 4 lines in the system prompt, anyOf string | array in the input schema, and one line in the whitelist. None of them is a structural property of the MCP protocol.
Methodologically - check with your own eyes what goes into the model's context, treat your own whitelist as a benchmark variable, and state your retry policy explicitly when reporting success rate.

If after all this you look at your own use case and decide you want a well-designed CLI tool, the six-item checklist is above. If you decide you want MCP - uvx awslabs.aws-api-mcp-server@latest and you're in the game.

Both options are correct answers to different questions.

Stop Using JSON in Claude Prompts. I Tested 4 Formats — One Won by 30%.

Webmaster Ramos — Tue, 14 Apr 2026 20:22:36 +0000

My own benchmark across three Claude tiers (Haiku, Sonnet, Opus): 120 data files, 8 real-world scenarios, 5 formats. Tokens, cost, and accuracy – numbers, not opinions.

You Are Overpaying for Prompts

Every time you send data to the Claude API, the format of that data determines how many tokens you spend. The same 200-product catalog in JSON costs 15,879 tokens. In Markdown, it costs 7,814. In TOON, 6,088. That is a 62% difference.

A 120-task list? JSON consumes 8,500 tokens. TOON uses 2,267. Savings: 73%.

The problem is that every existing benchmark focuses on GPT, Gemini, and Llama. There has not been a public benchmark for Claude. I decided to fix that.

I ran 450 API calls on Claude Haiku 4.5, tested Sonnet 4.6 and Opus 4.6, and counted tokens across 120 files using Anthropic’s production tokenizer. Eight real-world scenarios, five formats. In this article – the results, the conclusions, and specific recommendations.

Five Formats at a Glance

JSON (JavaScript Object Notation)

Year created: 2001; ECMA-404 standard (2013)
Author: Douglas Crockford
Primary use case: APIs, data exchange between systems, configuration files
Key characteristic: strict typing, nesting via {} and [], mandatory quotes

JSON is the lingua franca of programmatic interfaces. Every API speaks JSON, and every language can parse it. But that universality comes at a price in an LLM context: quotes, braces, and commas all consume tokens. They carry syntactic weight, but not semantic meaning.

{"products": [{"id": 1, "name": "Mouse", "price": 29.99, "in_stock": true}]}

YAML (YAML Ain't Markup Language)

Year created: 2001; YAML 1.2 standard (2009)
Authors: Clark Evans, Ingy döt Net, Oren Ben-Kiki
Primary use case: configuration files (Docker Compose, Kubernetes, GitHub Actions)
Key characteristic: indentation-based structure, minimal punctuation

YAML is the de facto standard of the DevOps world. It reads like pseudocode and usually does not require quotes. The trade-off is that repeating keys for every array item eats up much of the punctuation savings.

products:
  - id: 1
    name: Mouse
    price: 29.99
    in_stock: true

Markdown

Year created: 2004
Author: John Gruber (with Aaron Swartz)
Primary use case: documentation, READMEs, blogs, wikis
Key characteristic: human-first syntax – headings #, tables |, lists -

Markdown is the most “native” format for LLMs. Models have been trained on billions of READMEs and wiki pages. GitHub, Notion, Obsidian – all rely on Markdown. It is a communication format, not a data format.

## Products

| ID | Name  | Price | In Stock |
|----|-------|-------|----------|
| 1  | Mouse | 29.99 | Yes      |

Plain Text

Primary use case: human communication – emails, notes, instructions
Key characteristic: no syntax, no markup, maximum flexibility

Plain text with no markup. It minimizes token overhead, but it provides no explicit structure for programmatic data extraction.

Products: Mouse (ID 1, $29.99, in stock)

TOON (Token-Oriented Object Notation)

Year created: 2025 (v1.0 – November 2025, MIT license)
Author: open-source community (GitHub)
Primary use case: token optimization in LLM prompts, replacing JSON in AI workflows
Key characteristic: a YAML + CSV hybrid (indentation for objects, row-style encoding for arrays)

The newest format in this comparison. TOON was created for one purpose: minimize tokens while preserving lossless JSON round-tripping. For arrays of homogeneous objects, field names are declared once and values are written as CSV-style rows. On GPT-5 Nano, it showed 99.4% accuracy with 46% token savings. Before this benchmark, it had not been tested on Claude.

products[1]{id,name,price,in_stock}:
1,Mouse,29.99,true

Methodology

What I Tested

Eight scenarios, each in three sizes (S / M / L), each in five formats. Total: 120 data files.

#	Scenario	Data type	S	M	L
1	System prompt / instructions	Rules, sections	10 rules	30 rules	60 rules
2	Product catalog	Tabular data	20 products	100 products	200 products
3	Roadmap / tasks	Statuses, dependencies	15 tasks	50 tasks	120 tasks
4	Business rules	Conditional logic	8 rules	25 rules	50 rules
5	Few-shot classification	Input-output examples	5 examples	15 examples	40 examples
6	Organizational hierarchy	3 levels of nesting	12 people	60 people	150 people
7	API documentation	Endpoints, parameters	5 endpoints	15 endpoints	30 endpoints
8	Output format	Requesting data in a given format	10 countries	50 countries	100 countries

Few-shot (scenario 5) is a prompting technique in which several “input → output” examples are included directly in the prompt so the model can infer the task from a pattern. For example: "Great product!" → positive, "Terrible quality" → negative, then the question "Love it!" → ?. Zero examples is zero-shot, one example is one-shot, several examples is few-shot. The format of those examples directly affects cost: 40 pairs in JSON take 2,131 tokens; in TOON, 996.

For scenarios 2, 3, 6, and 7, I prepared questions with precomputed correct answers (ground truth). For scenarios 1, 4, and 5, scoring was manual and rubric-based. For scenario 8, I measured output tokens and format compliance.

Models and Pricing

Model	Tier	Input ($/1M)	Output ($/1M)
Claude Haiku 4.5	Fast	$0.80	$4
Claude Sonnet 4.6	Mid	$3	$15
Claude Opus 4.6	Premium	$15	$75

Accuracy was measured across all three tiers. Sizes S and M were tested for accuracy. L-size was used only for token counts.

Clean-Test Principle

All requests were sent directly via the anthropic Python SDK: plain client.messages.create() with temperature=0. No MCP servers, IDE plugins, or agent frameworks.

Token counting was done with client.messages.count_tokens() – Anthropic’s production tokenizer, i.e. the same numbers used for billing. The tokenizer is the same across all Claude tiers – so the token-count data applies to all Claude models.

Benchmark code: github.com/webmaster-ramos/yaml-vs-md-benchmark

Input-Token Efficiency

These numbers apply to all Claude tiers – Haiku, Sonnet, and Opus all use the same tokenizer. The only cost difference comes from the price per token.

Summary Table: Average Input Tokens Across All Scenarios

Format	Average tokens	vs JSON
JSON	3,252	baseline
YAML	2,208	-32%
Markdown	1,514	-53%
Plain Text	1,391	-57%
TOON	1,226	-62%

TOON saves 62% of input tokens on average versus JSON. Markdown saves 53%. YAML, despite its minimal punctuation, saves only 32% – because of repeated keys and indentation overhead.

Breakdown by Scenario (% Savings vs JSON, L-size)

Scenario	YAML	MD	TXT	TOON
Instructions	-22%	-29%	-24%	-24%
Products	-29%	-51%	-53%	-62%
Tasks	-35%	-63%	-69%	-73%
Business Rules	-28%	-52%	-48%	-63%
Few-shot	-31%	-45%	-37%	-53%
Hierarchy	-37%	-61%	-67%	-68%
API Docs	-35%	-45%	-59%	-53%

YAML Savings vs JSON (%, L-size)

MD Savings vs JSON (%, L-size)

TXT Savings vs JSON (%, L-size)

TOON Savings vs JSON (%, L-size)

Detailed Charts by Scenario

Input tokens by scenario: Instructions

)

Input tokens by scenario: Products

Input tokens by scenario: Tasks

Input tokens by scenario: Rules

Input tokens by scenario: Few-shot

Input tokens by scenario: Hierarchy

Input tokens by scenario: API Docs

Key Observations

TOON is the clear leader for tabular data. Product catalogs, task lists, few-shot examples – anything that looks like an array of homogeneous objects. Savings: 62–73% versus JSON.
Markdown is the best all-purpose format. A stable 50–65% reduction across all data types. It is the only format that performs consistently well across tables, instructions, and hierarchies.
YAML is underwhelming. Many people expect YAML to be much more compact than JSON. In practice, the savings are only 14–41%. The reason is repeated keys for every array element.
Plain Text wins on API docs. For technical specifications, plain text is more efficient than TOON (59% vs 53%). Without extra syntax, descriptive text compresses better.
Scale barely affects the percentage savings. The difference between S and L is under 2 percentage points. Format drives efficiency more than data volume does.

Haiku 4.5: When Format Matters

Haiku is the most format-sensitive tier. In 35% of questions, it produced different answers depending on the input format. Accuracy spread reached as high as 36 percentage points between the best and worst format within the same scenario.

Accuracy by Scenario

Accuracy Haiku: Products (product catalog)

Accuracy Haiku: Tasks (tasks / roadmap)

Accuracy Haiku: Hierarchy (organizational hierarchy)

Accuracy Haiku: API Docs (documentation)

Scenario	JSON	YAML	MD	TXT	TOON	Best
Products	63.4%	61.4%	69.2%	70.2%	66.2%	TXT
Tasks	71.0%	65.7%	66.7%	56.7%	65.3%	JSON
Hierarchy	85.7%	92.9%	85.7%	78.2%	85.7%	YAML
API Docs	85.7%	85.7%	57.1%	78.6%	85.7%	JSON/YAML/TOON

Hierarchy shows the sharpest gap: YAML (92.9%) vs Markdown (57.1%) – a 36-point difference. Tree-like structures are clearly easier for Haiku to parse in an indentation-based format.

API Docs: Markdown performs unexpectedly poorly – 57.1% vs 85.7% for JSON. For technical specifications with parameters and types, explicit structure matters more than compactness.

Accuracy by Size (Haiku)

Size	Accuracy
S (small data)	80.3%
M (medium data)	67.2%

Scale matters more than format. Accuracy drops by 13 points when moving from S to M – more than the average difference between formats (5.7 points). The implication is straightforward: reduce data volume first, then optimize format.

Cost: Haiku

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0026	$260
YAML	2,208	$0.0018	$177
MD	1,514	$0.0012	$121
TXT	1,391	$0.0011	$111
TOON	1,226	$0.0010	$98
JSON -> TOON	-	-62%	$162/month

Output Format: Haiku

Output tokens: S-size (10 countries) – Haiku, Sonnet, Opus

Output tokens: M-size (50 countries) – Haiku, Sonnet, Opus

Requested format	S (10 countries)	M (50 countries)	Savings vs JSON
JSON	465	1,985	baseline
YAML	296	1,352	-32..36%
Markdown	165	1,125	-43..65%
Plain Text	294	1,381	-30..37%
TOON	342	1,369	-26..31%

Markdown is the cheapest output format on Haiku. 165 vs 465 tokens on S-size – a 65% reduction. At $4 per 1M output tokens, that matters.

Important: TOON loses on output. Haiku does not know the TOON format and, instead of producing compact CSV-like rows, tends to emit verbose plain text that only vaguely resembles TOON. A few-shot example improves TOON output quality, but it still trails Markdown in efficiency.

Output-Format Choice: Technical Requirements

Output cost is not the only thing that matters. Often, Claude’s response must be processed programmatically – parsed, inserted into a database, or passed to another service. The best output format depends on who or what is going to read it.

Usage scenario	Recommendation	Why
User-facing answer in UI	Markdown	Renders natively, lowest token cost
Backend parsing	JSON	Reliable, universal, guaranteed structure
Config / YAML pipeline	YAML	Human-readable + machine-parsable
Rows for CSV / spreadsheet	TXT	Minimal overhead, structure via delimiters
Compact output for TOON SDK	TOON	Only if using Opus, or with a few-shot example

Rule of thumb: if a human reads the output, use Markdown. If code reads it, use JSON or YAML. Do not optimize output cost at the expense of parsing reliability in production.

Recommendations for Haiku

Data type	Best input	Accuracy	Best output
System prompts	MD	stable	MD
Catalogs, lists	TXT	70.2%	MD
Tasks / roadmap	JSON	71.0%	MD or JSON
Hierarchies	YAML	92.9%	YAML
API documentation	JSON or YAML	85.7%	JSON
Few-shot examples	TOON	65.3% (-0.5% vs JSON)	MD

On Haiku, format matters – especially for hierarchies and API documentation. Use TOON on input where token savings are worth a small accuracy trade-off, but do not use TOON on output without a few-shot example.

Sonnet 4.6: Format Affects Cost, Not Quality

Sonnet 4.6 produced identical answers across all five formats. In 100% of questions, the result was the same regardless of how the data was represented. For Sonnet, format optimization is pure cost reduction with no quality trade-off.

Accuracy: Format-Invariant

Accuracy by model and format

Format	Sonnet 4.6
JSON	89.4%
YAML	89.4%
Markdown	89.4%
Plain Text	89.4%
TOON	89.4%

The answers are completely identical across all formats. Switching from JSON to TOON saves 62% of input tokens while preserving the same output.

Cost: Sonnet

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0098	$975
YAML	2,208	$0.0066	$663
MD	1,514	$0.0045	$454
TXT	1,391	$0.0042	$417
TOON	1,226	$0.0037	$368
JSON -> TOON	-	-62%	$607/month

At 100K requests per month, switching from JSON to TOON saves $607/month. On Sonnet, output costs $15 per 1M tokens, so output optimization also matters.

Output Format: Sonnet

Output tokens for Sonnet (estimated as characters ÷ 3.5 chars/token):

Format	S (10 countries)	M (50 countries)
JSON	~210	~1,120
YAML	~195	~1,023
Markdown	~143	~746
Plain Text	~103	~549
TOON	~86	~414

Comparison of output tokens across all three models (S-size):

M-size (50 countries):

On Sonnet, TOON output requires a few-shot example. Without extra context, Sonnet interprets “TOON format” literally – as an abbreviation connected to cartoons – and returns an irrelevant answer. With a format example in the prompt, it generates correct TOON.

Technical requirements for output on Sonnet are the same as on Haiku: if a downstream system parses the response programmatically, use JSON or YAML. If a human is going to read it, use Markdown.

Recommendations for Sonnet

On Sonnet, format choice is a pure cost optimization. The logic is simple:

Input data: use TOON (for tables) or MD (for instructions / hierarchies)
Human-readable output: Markdown (-65% vs JSON)
Machine-parsed output: JSON (most reliable) or YAML (more compact, still parseable)
TOON output: add a few-shot example to the prompt; otherwise the answer may be incorrect

Optimal prompt design: MD for instructions + TOON for data + a request for MD/JSON output.

Opus 4.6: Maximum Capability, Also Format-Invariant

Opus 4.6 is the strongest model and the most expensive one. Like Sonnet, it is completely insensitive to input format. But Opus has one unique advantage: it knows TOON “out of the box.”

Accuracy: Format-Invariant

Format	Opus 4.6
JSON	93.5%
YAML	93.5%
Markdown	93.5%
Plain Text	93.5%
TOON	93.5%

The answers are 100% identical across all formats. Changing format affects only cost.

Cost: Opus

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0488	$4,878
YAML	2,208	$0.0331	$3,312
MD	1,514	$0.0227	$2,271
TXT	1,391	$0.0209	$2,087
TOON	1,226	$0.0184	$1,839
JSON -> TOON	-	-62%	$3,039/month

On Opus, switching from JSON to TOON saves over $3,000/month at 100K requests. Output costs $75 per 1M tokens – so format optimization has the largest financial impact here.

Output Format: Opus

Output tokens for Opus (estimated as characters ÷ 3.5 chars/token):

Format	S (10 countries)	M (50 countries)
JSON	~254	~1,271
YAML	~286	~1,414
Markdown	~177	~814
Plain Text	~194	~986
TOON	~106	~543

Comparison of output tokens across all three models (S-size):

M-size (50 countries):

Opus generates TOON without hints. That is the key difference from Sonnet and Haiku. Opus knows the format and produces valid TOON output on the first try.

Can Claude generate valid TOON output?

Model	Without example in prompt	With few-shot example
Opus 4.6	Valid TOON	Valid TOON
Sonnet 4.6	Cartoon / irrelevant	Valid TOON
Haiku 4.5	Verbose plain text	Closer to TOON, but still inaccurate

In practical terms, this means: if you need TOON output and want it to work reliably without prompt scaffolding, use Opus.

Technical Requirements for Output: When Parsing Matters More Than Cost

On Opus, output costs $75 per 1M tokens – so output-format savings are highly relevant. But the requirements of the downstream system still take priority:

Scenarios where output must be parsed programmatically:

The response goes into a database or structured store – use JSON
Another LLM or service consumes the response through an API – use JSON or YAML
The response is part of a pipeline (the next step processes the data) – use JSON
The response is rendered in the UI as text or a document – use Markdown (lowest token cost)
You need compact machine-readable output and already have a TOON SDK – use TOON (only Opus works reliably without prompt help)

The key point: output on Opus costs $75 per 1M – five times more than input. A 65% output reduction (Markdown vs JSON) can matter even more than input savings. But do not trade away parse reliability just to cut cost.

Recommendations for Opus

Input: TOON for tabular data (-62%), MD for instructions (-53%)
Human-readable output: Markdown (-65% output tokens)
Machine-parsed output: JSON – reliable and universal
TOON output: works without few-shot – Opus’s unique advantage
Do not use JSON on input: it is the most expensive format with no accuracy benefit

Summary Results

Accuracy Across All Models and Formats

Format	Haiku 4.5	Sonnet 4.6	Opus 4.6
JSON	75.3%	89.4%	93.5%
YAML	75.1%	89.4%	93.5%
Markdown	69.6%	89.4%	93.5%
Plain Text	70.6%	89.4%	93.5%
TOON	74.8%	89.4%	93.5%

For Sonnet and Opus, format does not affect accuracy. For Haiku, it matters materially – especially for hierarchies and documentation.

Decision Matrix: Input Format

Data type	Haiku	Sonnet / Opus
System prompts / instructions	MD (-29%)	TOON or MD
Catalogs, lists	TXT (70.2%)	TOON (-62%)
Tasks / roadmap	JSON (71.0%)	TOON (-73%)
Business rules	JSON (stable)	TOON (-63%)
Few-shot examples	TOON (≈JSON)	TOON (-53%)
Hierarchies	YAML (92.9%)	TOON or MD
API documentation	JSON/YAML (85.7%)	TXT (-59%)

Decision Matrix: Output Format

Output consumer	Recommendation	Haiku	Sonnet	Opus
UI / end user	Markdown	native	native	native
API / JSON parser	JSON	reliable	reliable	reliable
YAML pipeline	YAML	reliable	reliable	reliable
TOON SDK	TOON	with few-shot*	with few-shot*	native
CSV / spreadsheet	TXT	with template	with template	with template

*Requires a few-shot example in the prompt

Benchmark Limitations

Accuracy was measured only on S+M sizes. L-size includes token counts only. Accuracy may degrade more sharply on larger data.
The data is synthetic. Catalogs and tasks were script-generated. Real-world data may be messier (missing fields, Unicode, long descriptions).
Automatic scoring covers 4 of 8 cases. Cases 1, 4, and 5 require rubric-based evaluation. The accuracy numbers here cover cases 2, 3, 6, and 7.
Sonnet / Opus were tested via subscription (subagents). Output-token counts are estimated, not directly measured. Haiku was tested via API.
No A/B test on live traffic. This is a laboratory benchmark. The impact on a production product must be validated separately.

The code and data are open – reproduce it, extend it, challenge it.

What Surprised Me

Opus and Sonnet are completely insensitive to format. I expected a 3–5% gap. I got 0%. For the higher tiers, format is pure cost optimization.
YAML is not as efficient as many assume. The expectation is usually “YAML is more compact than JSON.” In practice, the savings are only 32%. Repeated keys wipe out much of the benefit from removing braces.
TOON works on Claude without special training. Claude may not have seen much TOON in training data, yet all three tiers parse it correctly – essentially on par with JSON.
Opus knows TOON; Sonnet does not. Opus generates valid TOON output without hints. Sonnet interpreted “TOON format” as “cartoon” and produced an irrelevant answer. With a few-shot example, both work correctly.
Markdown is the best output format. The gap in output tokens between JSON and Markdown is 65%. At $75 per 1M on Opus, that is significant. It is also the only format every tier generates natively without extra prompting.
On Haiku, scale matters more than format. Accuracy drops from 80.3% (S) to 67.2% (M) – a 13-point drop. The average difference between formats is 5.7 points. On Sonnet and Opus, scale is much less of an issue.

FAQ

Q: Do these results apply to other models (GPT, Gemini)?

The trends are similar, but the numbers differ. Every model has its own tokenizer. On GPT-5 Nano, YAML shows 62% accuracy on nested data (ImprovingAgents); on Claude Haiku, it reaches 93%. Use these results for Claude, and other benchmarks for other models.

Q: How were tokens counted?

Using client.messages.count_tokens() – the standard Anthropic SDK method and production tokenizer. These are the same numbers used for billing. The tokenizer is the same across all tiers.

Q: Why not test XML?

XML is rarely used in modern LLM workflows. Existing benchmarks (ShShell) suggest that XML is significantly more expensive than Markdown in token terms, with comparable or worse accuracy.

Q: Is TOON a serious format or just hype?

TOON v1.0 was released in November 2025 under MIT, and there are SDKs in 6+ languages. For tabular data, the savings are real – 62% on Claude with JSON-level accuracy. Opus generates TOON output without prompting. Other tiers require a few-shot example.

Q: Does the input format affect the output format?

Partially. If you provide data in YAML, Claude is more likely to structure its answer with indentation. But an explicit instruction such as “Return as a Markdown table” overrides that tendency.

Q: Is it worth converting all prompts away from JSON?

At 100K requests/month on Sonnet, moving from JSON to TOON saves $607/month. On Opus, it saves $3,039/month. For hobby projects with 1K requests, the difference is around $6. Run the math on your own usage.

Q: Can you combine formats in one prompt?

Yes – and that is usually the recommended approach. Markdown for instructions + TOON for data + a request for output in the format you need. Claude handles multi-format prompts well.

Q: Where is the benchmark source code?

github.com/webmaster-ramos/yaml-vs-md-benchmark. All 120 data files, 51 questions, ground truth, runner, and scorer are open for reproduction.

Conclusion

Data format in a prompt is not a cosmetic choice. On the Claude API, the gap between JSON and TOON is 62% on input tokens. Markdown saves 65% on output tokens. At 100K requests/month on Opus, that means $3,039 saved on input and even more on output.

But the main finding is not about tokens. Claude Sonnet 4.6 and Opus 4.6 are completely insensitive to format. They produced 100% identical answers on JSON, YAML, Markdown, Plain Text, and TOON. For the higher tiers, format optimization is pure savings with no quality trade-off.

Only Haiku 4.5 is meaningfully format-sensitive – and only there does the choice of format affect accuracy (by up to 36 percentage points). On Haiku, format should be matched to data type: YAML for hierarchies, JSON for tasks with dependencies.

Beyond cost, there are technical requirements: if the output must be parsed programmatically, JSON is more reliable than Markdown. If a human reads the answer, Markdown is cheaper. Opus is the only tier that generates TOON natively; Sonnet and Haiku require a few-shot example.

TL;DR by tier:

	Haiku 4.5	Sonnet 4.6	Opus 4.6
Does format affect accuracy?	Yes, by up to 36 points	No	No
Best input (data)	YAML/JSON/TXT by data type	TOON	TOON
Best input (instructions)	MD	MD	MD
Best output (human-readable)	MD	MD	MD
Best output (parsing)	JSON	JSON	JSON
TOON output without prompt help	No	No	Yes
JSON -> TOON savings	$162 / 100K	$607 / 100K	$3,039 / 100K

Benchmark run in April 2026 on Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5.
120 data files, 8 scenarios, 3 sizes, 5 formats, 3 models.
All code and data: github.com/webmaster-ramos/yaml-vs-md-benchmark

DEV Community: Webmaster Ramos

Frontier LLMs Get 2 of 3 Tax Returns Wrong - Stop Letting Them Decide

The decision you should never let a model guess at

The pattern: formalize, don't decide

What is actually doing the reasoning - and why not just Python

The evidence: a cheap model plus an engine reaches the frontier

The one condition: you need a capable formalizer

When the method applies - a three-part gate

Where it pays off

The economics: the model is a compiler, not a calculator

The determinism layer agent commerce is missing

I checked every Universal Cart merchant. None on Magento.

The five-protocol stack, compressed

The one decision

The 30-day playbook

Day 1-3: publish /.well-known/ucp

Day 3-10: map your REST API to MCP transport

Day 10-15: declare a payment handler that matches your processor

Day 15-22: wire loyalty - optional but high-leverage

Day 22-30: validate before you ship

What Adobe is doing (and not doing)

Six Principles in Practice: How an Agentic E2E Found 11 Production Bugs in 8 Runs

Eight runs, eleven bugs

Premise: six principles applied to E2E

The contract: seven environment principles

Four layers of code

The four-agent pipeline

Analyzer

Planner

Generator

Healer

Why four agents, not three

Knowledge as the fourth layer

Two categories of entry

Project-local vs cross-stack

sources.yml routing and kb_by_app

Editorial promotion

The healing loop and the saturation curve

Why this produces compounding

The saturation curve

What the metrics don't track

Stack-agnostic: porting to FastAPI in days, not weeks

What carries over 1-to-1

What is rewritten

What appeared new on the second stack

Metrics on the second stack

Closing

Six Principles for Agent Systems That Don't Hallucinate

Why this article exists

Principle 1: An explicit contract

Principle 2: Role separation

Principle 3: Persistent state between phases

Principle 4: Knowledge as a separate layer

When RAG is actually needed

Principle 5: Closed-loop learning (knowledge compounding)

Principle 6: Additive instrumentation

What these principles give you together

What these principles do not give you

What comes next

One Nav, Two Stacks: A Microfrontend Between Magento and Laravel Without Replatforming

TL;DR

The problem nobody names out loud

Why "just consolidate on one stack" is not the answer

Why repair is different from design

Shell architecture: 15-20 kb, one build, one file

Host integration: Magento 2.4

Host integration: Laravel 11

Representative code shape (abridged)

The host <-> shell contract

The SEO question, answered honestly

What this article proves today - and what it does not yet

What this pattern does not solve

When this pattern fits

What's next

Everyone Says MCP Beats CLI. The AWS Benchmark Disagrees.

TL;DR

Where this whole debate comes from

AWS has already done its homework

Methodology

First attempt: CLI loses everywhere

Day 1-3: publish `/.well-known/ucp`

`sources.yml` routing and `kb_by_app`