Validating a UCP manifest takes a second. Scoring it for agent-readiness takes another. Neither of those answers the harder question: when a real frontier agent — Claude or GPT or Gemini, picked by a user three weeks from now — walks up to your store with an ordinary shopping prompt, does it actually complete a checkout? Compared to the next implementation? Across the models people are actually using?
Today there's no shared way to find out. AI commerce has the same coordination problem ML had before MLPerf, web performance had before Lighthouse, and coding models had before HumanEval — and the cost of not solving it is the same: every claim a vendor makes about agent-readiness is currently unverifiable by anyone outside that vendor.
This post is about what we've been building to close that gap.
The pre-benchmark moment
Every category that grew up around AI has gone through a pre-benchmark moment.
Machine learning before MLPerf was a pile of vendor-flavoured numbers. NVIDIA reported one set of throughput claims, Google another, AMD a third — and none of it was directly comparable, because nobody was running the same workload, on the same input, on the same harness. MLPerf — submitted to, run by, and audited across the whole industry — fixed that. Buyers could finally compare. The category matured.
Web performance before Lighthouse was the same. "Fast website" was vibes. PageSpeed Insights gave one number, WebPageTest another, internal RUM dashboards a third. Lighthouse — graded, reproducible, open — fixed it. Today nobody ships a serious site without checking their score.
Coding models before HumanEval were even worse. Every lab benchmarked against its own preferred problems and reported its own preferred metrics. HumanEval, then MBPP, then SWE-bench, then LiveCodeBench, gave the field a shared evaluation surface. Comparisons stopped being marketing.
Agentic commerce is in exactly the place those categories were before their benchmarks landed. The standard has converged — UCP is the open spec the industry is building against, and the public directory tracks 4,500+ verified stores. Major retailers and platforms ship UCP implementations almost weekly. The recent tech council expansion brings in most of the rest. But there is still no neutral, reproducible way to evaluate how well any of those implementations actually work when a real frontier agent tries to shop them.
You can't get this from inside a vendor. Shopify cannot credibly benchmark Shopify stores. OpenAI cannot credibly benchmark OpenAI agents. Even when their numbers are honest, the methodology is theirs, the test conditions favour their stack, and nobody else can rerun it. AI commerce has the same coordination problem ML had before MLPerf, and it solves the same way: a shared evaluation layer, run by a third party, that anyone can audit and reproduce.
Agentic commerce can't mature without that layer. We've built a first credible attempt at one.
What UCP Playground Evals does
UCP Playground Evals is a benchmark framework for agentic commerce. You define a multi-turn shopping conversation, pick the stores and the models you want to evaluate against it, and get back a structured comparison report — funnel matrix, per-session token and duration metrics, error classification, replayable session links, downloadable PDF.
The point isn't the report format. The point is the three properties underneath, because those determine whether a benchmark is worth trusting.
1. Standardised, multi-turn sequences
Agentic commerce is conversational, not single-prompt. A real shopping session looks like "Show me products under $60" → "Add both to my cart" → "Proceed to checkout", with full context carried across turns. That's the unit an eval has to operate on.
Each eval is a scripted sequence of turns. Every turn gets its own orchestrator round (up to 8 internal tool-calling sub-turns) and the full conversation history is preserved across the sequence — so the agent's choices on T2 are conditioned on what it actually saw on T1, the way real user behaviour conditions on real responses. Four collections ship today: Browse & Buy (4 turns, generic shopping journey), Multi-Item (3 turns, multi-product cart composition and checkout), Price Constrained (3 turns, budget-anchored reasoning across a single purchase), and Custom for user-defined sequences.
2. Cross-store comparability
The sequences are intentionally generic. Not "Find Nike Air Max 90 in size 10" but "Show me products under $60". That distinction is load-bearing: it's what makes the same test valid against any store running UCP, and it's what makes results from one store directly comparable to results from another. Without it, every benchmark is apples-to-oranges and nothing aggregates.
The eval runner discovers MCP endpoints automatically from each store's /.well-known/ucp manifest, so any UCP-conformant store works without per-store wiring — Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and Custom & Headless stacks all work the same way.
3. Multi-model coverage
The same sequence runs against any of 15 frontier models currently wired up — every major lab, plus a reasoning-tuned subset:
| Model | Provider | Type |
|---|---|---|
| Claude Opus 4.6 | Anthropic | Frontier |
| Claude Sonnet 4.5 | Anthropic | Frontier |
| GPT-5.2 | OpenAI | Frontier |
| GPT-4o | OpenAI | Frontier |
| Gemini 3.1 Pro | Frontier | |
| Gemini 3 Flash | Frontier | |
| Gemini 2.5 Pro | Frontier | |
| Gemini 2.5 Flash | Frontier | |
| Grok 4 | xAI | Frontier |
| DeepSeek V3.2 | DeepSeek | Frontier |
| Llama 3.3 70B | Meta | Frontier |
| DeepSeek R1 | DeepSeek | Reasoning |
| QwQ 32B | Alibaba | Reasoning |
| Grok 3 Mini | xAI | Reasoning |
| o4-mini | OpenAI | Reasoning |
The model is part of the test matrix. Same store, different models, same sequence — directly comparable behaviour, with model-level differences surfaced rather than averaged away. Any two can also be compared side-by-side outside the eval framework, on the same workload.
The math is straightforward
stores × models × sequences = sessions. Two stores × two models × one sequence = four sessions. Each one is a full agent shopping run, captured end-to-end, replayable, and rolled up into the report.
Standardised, reproducible, vendor-neutral. The three properties that make a benchmark worth trusting. Everything else in the framework is built to defend those three.
What the framework actually surfaces
The clearest way to show what evals do is to walk through one. Below is a multi-item checkout report we ran across two stores and two Gemini models in March:
Download the full multi-item checkout report (PDF) →
Two-page report covering the funnel comparison matrix, per-session performance breakdown, evaluator configuration, auto-generated recommendations, and clickable session-replay IDs for every run.
Two stores (oakywood.shop, ugmonk.com). Two models (Gemini 3 Flash, Gemini 3.1 Pro). One sequence (multi-item checkout: search → add → checkout). Four sessions total. The headline numbers:
- 100% checkout rate across all four sessions
- 95,513 average tokens per session
- 48.3s average duration
- 0 errors across the matrix
That's the boring summary. The interesting parts are in the per-session table.
| Store | Model | Tokens | Duration | Turns | Cart value |
|---|---|---|---|---|---|
| oakywood.shop | Gemini 3.1 Pro | 85,614 | 93.4s | 7 | EUR 82.75 |
| oakywood.shop | Gemini 3 Flash | 154,294 | 34.7s | 12 | — |
| ugmonk.com | Gemini 3.1 Pro | 46,084 | 35.1s | 6 | USD 77.00 |
| ugmonk.com | Gemini 3 Flash | 96,058 | 29.9s | 11 | — |
Same sequence, same stores, two models. Gemini 3.1 Pro completes the run in fewer turns and roughly half the tokens of Flash on the same store, but its latency is meaningfully higher when the store itself is slower to respond. That isn't a fact you can extract from a vendor benchmark or a single-model demo. It only shows up when the same scripted run hits multiple models head-to-head, with both numbers landing in the same row.
The auto-generated recommendations point at where the real engineering work is, and they're grounded in the actual run data:
Average token usage is 95,513 — above the 40K baseline. Product descriptions may be inflating context. Consider truncating descriptions in MCP responses.
Average session duration is 48.3s — above the 15s target. Optimise MCP endpoint response times, especially initial search calls.
Those are concrete merchandising actions. They land because the evidence is right there in the per-session breakdown.
The deeper signal shows up across runs against richer stores. In a separate eval against a single shop, two models picked different variant IDs for "Medium" — one mapped Medium to one variant ID, the other to a different one, and neither is provably correct because the store doesn't expose a human-readable size axis in its variant data. That isn't a bug in either model. It's a gap in how the store represents its product axes, and it only becomes visible when two models walk the same path. This is the kind of behavioural divergence between frontier models that evals surface — and that vendor-internal benchmarks can't credibly report.
The same run logged 6/6 prompt-injection resistance across every session, against benchmark prompts seeded in product descriptions and review fields. Useful by itself; more useful as a baseline that future runs can regress against.
What's on the evals roadmap
This is v1. A few things on the roadmap, in priority order.
More eval collections. The four built-in sequences cover the core shopping flow. The next batch is more diagnostic: single-item flow (the simplest path), variant selection accuracy (the size-label gap above, formalised), prompt-injection resistance (already running, becoming its own collection), escalation handling (requires_escalation compliance), attribution accuracy (UTM and referrer handling at checkout hand-off), return policy surfacing.
Public benchmark leaderboards. Same pattern as the UCP Score leaderboard — by-store and by-model rankings against the standard sequences, refreshed on schedule, indexed and shareable. The categories that matured around shared benchmarks (ML, web perf, coding models) all developed public leaderboards — and the leaderboards turned out to be most of the forcing function.
Headless API and CI/CD integration. Already shipped. The full automation surface:
POST /api/v1/collections — create
POST /api/v1/collections/{id}/run — trigger
GET /api/v1/collection-runs/{id} — poll status + results
GET /api/v1/collection-runs/{id}/pdf — download report
The first integration we expect anyone to ship is a deploy-time check: trigger an eval after every UCP manifest deploy, assert checkout_rate >= 80, errors.total == 0, avg_duration_ms < 30000, fail the build otherwise. Same shape as Lighthouse CI for web performance — a regression catch you bolt onto the pipeline rather than rediscover in production. Full developer documentation — authentication, rate limits, and a worked GitHub Actions example — lives at ucpchecker.com/developer-tools, alongside the rest of the public API surface.
Scheduled runs and version tracking. Also shipped. Collections auto-increment versions when their config changes, runs snapshot the config they used, and a cron field on each collection lets you run the same eval on a regular cadence — same Monday-9am sequence every week, before-and-after comparisons whenever the underlying UCP implementation changes. This is how a benchmark becomes a tracking record instead of a one-shot demo.
Cloning and team scoping. Public collections can be cloned into any team workspace; quotas are scoped per team. The intent is community sharing — well-known sequences turning into shared, reusable yardsticks the way SWE-bench problem sets did for coding models.
How evals fit the broader development cycle
Evals don't sit alone. They're the runtime testing surface in a development loop that starts earlier in UCP Checker — manifest validation, agent-readiness scoring, capability coverage analysis. The web performance world solved the same shape with three tools used in sequence: Lighthouse to grade pages, PageSpeed Insights to drill into specific issues, synthetic monitoring to verify behaviour over time. UCP implementations follow the same arc: validate the manifest at /check, score it against agent-readiness criteria with the UCP Score, then run evals against it to see how it actually behaves when a real frontier agent shops it.
Each tool surfaces something different. Score tells you what's missing structurally — which discovery signals, which capabilities, which conformance rules. Check confirms the manifest validates after fixes land. Evals confirms the agent actually behaves correctly when it tries to complete a real flow. None is sufficient on its own; together they're the development feedback loop UCP needs. We've watched developers iterate across the whole thing in a single session — score the implementation, fix the gap server-side, re-check the manifest, then run an eval to confirm the agent now closes a checkout it couldn't before.
If you're starting from zero on a UCP implementation, the natural sequence is: get a Score first to see what's missing, fix the highest-impact issues, run a Check to confirm the manifest validates cleanly, then run Evals to confirm real agents complete the flows you care about. CI covers the long tail — automated scoring on each deploy, scheduled evals weekly, alerts when capabilities regress.
Methodology and verification
Three properties separate a credible benchmark from a marketing claim. UCP Playground Evals are designed around all three.
Every result links to a replayable session. Each eval session generates the same agent_sessions data the public Playground UI produces — full tool-call timeline, model responses, token-by-token event stream, every retrieved page. The session IDs in any report are clickable. Open one and you see exactly what the agent did, turn by turn, on which tool call, with which response. The sample report above lists four such IDs (e.g. 01KMJZM5MG2CA4QN5M983H19E1) and each resolves to a full replay at ucpplayground.com/sessions/{id}. This isn't a marketing claim; it's a verifiable test you can audit.
Every collection is versioned. When the configuration of a collection changes — turns added, models swapped, store list updated — the version increments and every run snapshots the config it ran against. Anyone questioning a result can reproduce the exact methodology used at that moment. The PDF report itself prints the collection version at the bottom of every page; the sample above is Collection v3. Versioning is what stops "we got better results" from quietly sliding into "we changed the test" — the same constraint MLPerf submission rules enforce on hardware vendors.
The methodology is open. The framework configuration shape is documented — the turns, the orchestrator loop, the stop conditions, the success metrics, the PDF schema. Anyone can build the same test, run it against any UCP store, and get back a directly comparable report. If we get a methodology choice wrong, the path to disagreement is technical, not promotional.
That's the credibility floor. Everything else in the product builds on it.
About UCP Checker and UCP Playground
UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the merchant directory and the UCP Score, publish the leaderboard and adoption stats, and ship developer tools — the validator, bulk checker, browser extension, public dataset, and a public REST API. The whole dataset is open, indexed, and ungated.
UCP Playground is the agent shopping layer that sits next to it — same data model, same /.well-known/ucp discovery, same replayable session format. UCP Playground Evals is the benchmark surface on top of that. Together they form the third-party scoreboard the ecosystem can build trust on top of — the SSL Labs and Lighthouse of agentic commerce, depending on which side you're looking from.
Try it
The interesting eval gaps are the ones nobody's tested yet. If a result surprises you — your own store, a competitor's, a model you assumed was a clear winner that turns out not to be — let us know.
Three concrete next steps:
- Run an eval against your own UCP store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, run it. The four-session example above is the shape most first runs take.
- Read a public eval report. Sample reports are linked from the framework page. Each has clickable session IDs you can replay end-to-end.
-
Wire it into CI. The developer tools page covers authentication, rate limits, and a GitHub Actions worked example. The assertion shape is the same one Lighthouse CI uses for web performance —
checkout_rate,errors.total,avg_duration_msinstead of LCP and TBT.
Top comments (0)