DEV Community

Cover image for UCP Playground at 1,000+ Agent Sessions: What 16 Models and 97 Real Stores Reveal About AI Shopping
Benji Fisher
Benji Fisher

Posted on • Originally published at ucpchecker.com

UCP Playground at 1,000+ Agent Sessions: What 16 Models and 97 Real Stores Reveal About AI Shopping

Two and a half months ago we published Why We Built UCP Playground, which closed on 114 agent sessions and an honest acknowledgement that the dataset was thin — most models had single-digit sample sizes, store coverage was uneven, and the headline rates moved meaningfully with every new run. A month later we crossed a different threshold: the first fully autonomous AI agent purchase through UCP — a Gemini agent searching, adding to cart, linking identity, paying, and completing checkout at houseofparfum.nl without a human past the initial prompt.

Eighty days on from the first post, and roughly forty days after that autonomous purchase, the dataset is in a different shape:

  • Over 1,000 agent shopping sessions captured end-to-end with full tool-call timelines and replayable event streams
  • 16 frontier models — every major lab, plus a reasoning-tuned subset
  • 97 distinct UCP-enabled stores across Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and custom stacks
  • $96,032 of agent-driven cart value generated, primarily in USD with a long tail across EUR, GBP, INR, ILS, PKR
  • 80 days of run history since Feb 14, 2026

That's the reference dataset for this post. Eight findings emerge from it. Most of them survive being scrutinised at the new sample size; one or two reverse the early-data narrative.

Finding 1 — Claude Sonnet 4.5 leads on aggregate checkout rate

With sample sizes now large enough to take seriously, the per-model checkout-rate leaderboard looks like this:

Model Share of dataset Checkout rate Avg tokens Avg duration Fail rate
Claude Sonnet 4.5 20.7% 50.8% 71,195 38.1s 17.2%
Llama 3.3 70B 6.4% 49.3% 57,676 47.7s 14.7%
DeepSeek V3.2 5.1% 45.0% 32,502 46.0s 21.7%
Gemini 3 Flash 12.5% 44.6% 46,520 21.8s 15.5%
Grok 4 4.5% 39.6% 34,297 77.1s 9.4%
Claude Opus 4.6 10.2% 38.8% 44,611 29.7s 25.6%
Gemini 2.5 Flash 9.9% 36.8% 32,394 11.8s 23.1%
GPT-4o 5.2% 29.5% 32,811 14.7s 24.6%
Gemini 3.1 Pro 7.9% 29.0% 30,971 48.7s 28.0%
Gemini 2.5 Pro 6.4% 27.6% 31,566 34.4s 22.4%
GPT-5.2 4.7% 23.6% 30,585 37.4s 27.3%
DeepSeek R1 1.4% 17.6% 35,360 61.4s 29.4%
o4-mini 1.4% 12.5% 64,055 38.1s 37.5%
Grok 3 Mini 1.7% 10.0% 58,386 55.6s 35.0%
QwQ 32B 2.0% 0.0% 25,525 63.9s 50.0%

Claude Sonnet 4.5 leads on aggregate checkout rate at 50.8% on the largest single share of the dataset — a sample large enough that the rank ordering is no longer noise. Llama 3.3 70B sits a fraction below at 49.3% on a smaller but still meaningful share. The two are statistically tied; both are operating in a different regime than the rest of the field.

The most interesting result on this table is GPT-5.2, which at 23.6% lands in the bottom third despite being one of the most capable frontier models on essentially every public benchmark. The gap between its performance on standard reasoning benchmarks and its performance on transactional shopping flows is the single largest delta in the leaderboard. We dig into why in the development notes below.

One caveat worth flagging up-front: GPT-5.2's 23.6% figure reflects performance across the full 80-day window, including the period before our cursor-stripping fix landed mid-dataset. Sessions after that fix show GPT-5.2 performing meaningfully more competitively. We'll publish the longitudinal split in the August update — the aggregate number above is the worst-case read.

Finding 2 — Reasoning-tuned models continue to underperform

The cohort of reasoning-tuned models (DeepSeek R1, o4-mini, Grok 3 Mini, QwQ 32B) sits unambiguously at the bottom of the leaderboard. Three of them are in the bottom four overall. QwQ 32B has yet to record a single completed checkout across its share of the dataset.

The pattern was visible in the original four-session sample report shipped with the eval-framework launch in April; it has only sharpened as the dataset grew two orders of magnitude. The pattern is consistent across labs and across architectures (chain-of-thought variants, exploratory reasoning, distilled-from-frontier models — all underperform on shopping flows compared to their non-reasoning counterparts from the same lab).

The working hypothesis remains: shopping requires fast tool-use rhythm, not deliberation. The decisions in a shopping sequence — search this term, add this item, proceed to checkout — are individually shallow but happen in series. A reasoning model that pauses to deliberate at each step burns clock time and tokens on decisions that don't reward deliberation. Combined with reasoning models' tendency to over-question their own outputs, the result is sessions that hit max_turns_exceeded before completing.

Worth noting what isn't in this hypothesis: reasoning models are not bad at commerce in general. They may be excellent at higher-stakes flows — disputed transactions, multi-step contractual reasoning, regulatory edge cases — that the current eval workload doesn't probe. The benchmark says: when the workload is "shop normally," fast non-reasoning models win. Other workloads will tell different stories.

Finding 3 — Speed and accuracy aren't correlated

Gemini 2.5 Flash finishes the average shopping session in 11.8 seconds — the only model in the field under 15s. Its checkout rate is 36.8% — middling. Claude Sonnet 4.5 takes 38.1s on average and lands a 50.8% checkout rate — the highest on the leaderboard, at more than triple Flash's clock time.

Two real surfaces: latency-bound use cases (voice agents, mobile commerce, conversational checkout where the user is waiting in real time) effectively must use Gemini 2.5 Flash or Gemini 3 Flash, and pay for the latency win with lower closed-checkout rates. Throughput-bound use cases (batch agents, scheduled buying, autonomous shopping where wall-clock time is mostly hidden) should use Claude Sonnet 4.5 or Llama 3.3 70B and accept the latency cost for the conversion lift.

The naive intuition merchants reach for — "the better model is faster and more accurate" — doesn't survive contact with this data. The two axes are essentially independent within this corpus. That's a finding nobody can extract from a single-model demo or a vendor benchmark.

Finding 4 — The failure mode taxonomy is dominated by tool errors, not model refusals

Across the 256 failed sessions in the dataset, the categorised error taxonomy is:

Error type Sessions % of categorised failures
openrouter_error (provider-side) 51 56%
model_refused 22 24%
max_turns_exceeded 18 20%

The single-largest categorised failure mode is provider-side errors — the routing layer between the agent and the model returning a non-200 before the session can complete. This is a cost of operating at scale across 16 models and reflects the still-maturing infrastructure underneath frontier-model API access, not anything specific to UCP.

The second-largest, model refusals, is more interesting. Twenty-two refusals across the dataset is a refusal rate of roughly 2%. We see refusals concentrated in two situations: (1) sessions against demo stores with unusual product names that pattern-match a model's safety filters, and (2) sessions where the user prompt contains adversarial content seeded by us as part of a prompt-injection eval. We've recorded 6/6 prompt-injection resistance across the dedicated injection-eval runs to date, so the model_refused category is partly capturing models doing exactly what they should.

The third, max_turns_exceeded, is concentrated in the reasoning-model cohort and is the empirical signal for the over-deliberation pattern in Finding 2.

The remaining 165 failures don't carry a categorised error_type — typically these are sessions where the model abandoned the flow without raising an explicit error. That's a tagging gap in the framework that we're closing in the next iteration.

Finding 5 — Store implementation explains most of the cross-store variance

The benchmark's most strategically important finding doesn't come from the per-model column. It comes from the per-store one.

Across the 97 stores in the dataset, the same model produces dramatically different outcomes. Between the most agent-friendly and least agent-friendly implementations at meaningful sample sizes, the checkout-rate spread exceeds 60 percentage points — wider than any model-versus-model gap on the leaderboard. No model in the field, at any sample size, produces a 60-point spread purely on its own merits. Almost all of that variance is store-side, and the rigorous run history across thousands of sessions makes the pattern hard to attribute to anything else.

The cleanest predictor we've found is whether the store's MCP implementation is stateless or stateful, and how it handles the boundary between them.

Stateless implementations treat every tool call as self-contained. Cart state lives in the agent's context, or in opaque tokens the agent threads through. Identity is established once and re-asserted on each call. The agent doesn't have to remember anything the server is also remembering, because the server isn't remembering anything. Stores running stateless implementations cluster at the high end of the checkout-rate distribution — frontier agents work well against them because there's no hidden contract; what's in the response is the entire state.

Stateful implementations persist server-side session, cart, and auth across calls, exposed to the agent through session IDs, cookies, or scoped tokens. When this works, it works well. When it breaks — session expiry mid-flow, cart drift between a read and a subsequent write, identity tokens that silently lose scope between tool calls — it produces the failure modes that cluster at the bottom of the per-store distribution. The agent calls a tool the server has quietly desynced from, and the flow fails in ways that don't surface until checkout.

The hybrid case is the most error-prone: stores that are stateless in some tools and stateful in others, without making the boundary explicit in the manifest or the tool response shapes. Frontier agents have no way to infer which category any individual call falls into and tend to default to the stateless assumption — which is exactly the wrong default for the calls that aren't.

Beyond the state axis, the rigorous testing surfaces a consistent set of secondary trip-wires: variant IDs without human-readable axis labels, description strings exceeding 8K tokens for a single product, tool responses including nested HTML in fields agents expect to be plain text, cart endpoints returning success codes for failed mutations. None of these break UCP Score validation. All of them break agent flows.

These are merchant-side fixes, not model-side ones. The strategic implication for any team operating a UCP-enabled store: fixing your manifest and tool responses produces more conversion lift than choosing the right model. That's load-bearing — it's why the integrated Score → Check → Eval workflow exists, and it's where we'd point a team starting from zero on UCP.

Finding 6 — Cart value generated is concentrated in USD and high-AOV verticals

Of the 1,000+ sessions, 96 produced a non-zero cart value. The breakdown:

Currency Sessions Total cart value Avg cart value
USD 85 $95,647.23 $1,125.26
INR 2 ₹3,845.00 ₹1,922.50
PKR 2 ₨4,490.00 ₨2,245.00
EUR 5 €296.74 €59.35
ILS 1 ₪189.60 ₪189.60
GBP 2 £47.99 £24.00

USD cart value totals $95,647 across 85 sessions with an average cart value of $1,125. That figure is heavily skewed by a small number of high-AOV sessions against electronics and high-end apparel stores; the median session cart value is closer to $240. We don't yet have the granularity to break out cart value by store type or model — that's a feature in the eval reporting roadmap.

The cross-currency long tail (EUR/GBP/INR/PKR/ILS) is small but informative. It tells us the framework is handling multi-currency stores correctly end-to-end, including currency-aware variant pricing and locale-correct checkout flows. Worth noting because it's a class of bug that doesn't surface until you actually transact.

Finding 7 — Session volume is now meaningful enough to reveal trajectory

Plotted week-over-week, session volume has three distinct phases over the 80-day window:

UCP Playground weekly session volume, mid-February through late April 2026Trend line showing three phases: a small founding wave in mid-February, a steady-state oscillation through March and mid-April, and a sharp acceleration in late April that produces the largest single week of the dataset.Feb 14Apr 27

Founding wave (mid-February). A small launch surge coinciding with the Why We Built UCP Playground post — first publishers running first sessions, signal that the framework worked end-to-end against real stores.

Steady state (March through mid-April). Weekly volume oscillating in a tight band as more frontier models came online and the eval framework matured. Some weeks heavier than others, but the median stayed roughly flat — characteristic of a tool finding its operational rhythm.

Acceleration (late April). The largest single week of the dataset, driven mostly by a batch of eval-collection runs against stores onboarded after the council expansion announcement. The line bends upward at the end of the window.

The trajectory matters mostly because it lets us start tracking model drift. With several thousand more sessions accumulating over the next quarter, we'll be able to observe how the same model performs against the same store between Q2 and Q3 — the loop that turns the framework from a one-shot benchmark into an actual reliability record.

Finding 8 — The 0.2% flawless-end-to-end rate has improved, slightly

The April State of Agentic Commerce report flagged that of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent shopping experience. That's the 0.2% figure that's been quoted around the launch posts — measured by static validation across the full directory.

Eighty days later, with 97 stores tested directly through the eval framework, roughly 0.5–0.7% reach the same bar. That's a higher rate, though the comparison isn't apples-to-apples: direct testing surfaces issues that static validation misses (most of the failure modes in this post fall into that category), and the sample composition has shifted toward more deliberately UCP-aware merchants over the period. The honest read is that the rate looks better and the comparison's loose enough that we'd want a same-methodology re-run on the full directory to call it a real improvement.

What we can say cleanly: for every store running a clean, agent-friendly UCP implementation, there are still 100+ that pass conformance but stumble somewhere in the agent flow. The gap continues to be on the merchant side. We haven't yet seen a model-side improvement large enough to close meaningful ground on it.

Why Playground stays neutral

Every finding above hinges on one design choice: the system prompt and the orchestration loop are generic. Same for every model. Same for every store. No store-specific scaffolding, no model-specific workarounds. That's what makes the framework work as a testing environment.

The temptation to add a workaround when a particular model trips on a particular store is real — there's almost always a one-line patch that would push that store's checkout rate up by ten points against that one model. We don't ship those patches, on principle. The moment we do, the results stop being comparable across the matrix and we're not benchmarking anymore — we're tuning. Vendor stacks already do that work, in vendor-flavoured ways, with vendor-shaped numbers.

Independence here means a specific thing: the orchestration is neutral, the protocol layer is full-featured. Stores get the tools they declare. Identity linking works. Payment handlers pass through. Multi-turn context flows the way the spec defines. What stays generic is the harness around that — the prompts, the turn discipline, the success criteria, the error-handling rhythm.

The reason that design choice matters can be put in two sentences:

  • If a model doesn't follow the checkout flow, that's signal about the model.
  • If a store returns the wrong status, that's signal about the store.

Both signals are useful. Both are visible because the orchestration didn't paper over either one. Hiding either defeats the purpose of running the test.

Companies building their own internal infrastructure to evaluate agent behaviour against their own stores is expected, and good. Every serious commerce platform will eventually have something like that running in CI against its own merchants — and the Score → Check → Eval workflow is exactly the surface they should plug into. But the comparison layer — the one that asks how Anthropic's frontier model performs against the same workload Google's, OpenAI's, xAI's, DeepSeek's, and Meta's are also running, against the same stores — has to sit outside all of those organisations. Vendors can't credibly benchmark themselves; the platform layer has the same problem one level down. Independence is the only way the comparisons aggregate into a record anyone can quote.

That's the niche this layer occupies. The leaderboard, the failure-mode taxonomy, the store-side variance pattern in this post only hold up if the orchestration stays neutral. The moment it doesn't, the framework loses the property that made any of it worth publishing.

What we learned building this

The framework didn't ship in May the same shape it shipped in February. Eighty days of running it against real stores produced a steady stream of bugs and surprises that drove the development work — many of them documented in the public changelog. Five worth surfacing.

Cursor stripping unlocked GPT-5.2 search. Through February we had GPT-5.2 at a 0% search success rate on Shopify stores. The cause was a model-side tic: GPT-5.2 always included the optional after cursor parameter on search_shop_catalog calls, filling it with placeholders like "", "null", or "__NONE__" — values Shopify always rejects. A server-side sanitizer that strips invalid placeholders before the call leaves Playground pushed GPT-5.2's search success from 0% to 100% overnight. The model wasn't bad at search; it had a tool-calling habit nobody had isolated yet.

Failed tool calls used to inflate conversion metrics. An earlier version of step detection counted a failed update_cart as a cart_created completion. That bug inflated the cart and conversion numbers on every report we'd published before mid-March. Fixed in 0.9.3 by gating step detection on the tool response's isError flag, plus the same gate on cart-data extraction. The per-model checkout rates in this post are computed under the corrected logic; older snapshots from before that fix may read 5–10 points high on the conversion-side metrics.

REST-only stores forced a transport rework. The v2026-04-08 spec drop in early April brought new tool names (search_catalog replacing search_shop_catalog), new response shapes (price as {amount, currency} objects, descriptions as {plain, html} objects), and a wave of WooCommerce stores that exposed REST-only endpoints rather than MCP. The 0.10.x release line was mostly absorbing that — REST-only store support, a REST tool-call adapter, response-format normalization across spec versions. Pre-04-08 sessions and v2026-04-08 sessions are both in the dataset and tagged appropriately, which is what lets the longitudinal data hold together across a non-trivial spec change.

The GPay token wall built ECP. In a February session, Claude Sonnet 4.5 reached ready_for_complete correctly — and stalled, because the merchant's checkout required a Google Pay payment token the agent couldn't produce. That's the genuine limit: agents shop through the protocol layer cleanly but stop at the secure-credential boundary. The Embedded Commerce Protocol shipped in 0.8.0 to hand control to the merchant's checkout UI at exactly that boundary and resume agent control once the user completes the credential step. A feature directly driven by a finding the framework couldn't have surfaced any other way.

A Playground session became a spec proposal. A live test against houseofparfum.nl exposed a different gap: an identity-linked buyer with a wallet balance hit the checkout, the OAuth flow completed cleanly, the buyer object came back populated — but the wallet was nowhere the agent could see it. payment.instruments was empty, the only declared handler (dev.ucp.delegate_payment) didn't accept the wallet, and the session escalated to the merchant's continue_url every time. Authenticated checkout was provably blocked, by spec. We wrote it up and submitted Proposal #358 to the UCP spec repositorypayment.available_instruments, a per-buyer per-session list of usable payment methods (wallet, saved cards, loyalty, gift cards) resolved at runtime from the identity-linked session. Submitted by Benji Fisher (@appdrops) and co-authored with Almin Zolotic (@zologic) of UCPReady, who'd seen the same wall from the merchant side. Currently submitted to the UCP technical council for review. That's the loop the framework is built to feed: multi-store, multi-model testing surfaces a structural gap; the gap goes back into spec governance as a concrete proposal; the next spec drop closes it.

Methodology, briefly

Each session is a real frontier-model agent shopping run against a real UCP-enabled store, captured end-to-end via MCP tool calls. Sessions are initiated either through the public Playground UI (user-initiated, ad-hoc prompts) or through the Evals framework (scripted multi-turn sequences across pre-selected store/model matrices).

Outcomes are tagged at session close: checkout_reached (full transaction completion), cart_created (added items, didn't proceed), search_only (browsed, didn't add), failed (provider error, model refusal, or max-turn exceeded), or info_provided (informational query, no transactional intent).

Every session has a clickable replay link in its source ULID. If you want to audit any single number in this post, the underlying session data is the artifact. That's intentional — independent reproducibility is the point.

Try it

Three concrete next steps:

  • Run a benchmark against your own store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, and compare your store's per-model performance against the aggregate above.
  • See where individual models stand. Each model on the leaderboard has its own shopping profile with detailed performance data, known issues, and store-by-store breakdowns.
  • Compare two models head-to-head. The comparison view lets you pit any two models against each other on the same workload — useful before you commit to a primary model for a deployment.

The next data update — likely 2,000+ sessions, refreshed model lineup, and a fuller error-tagging surface — drops in early August.

Top comments (0)