DEV Community: Benji Fisher

How to Test Your UCP Implementation with AI Agents

Benji Fisher — Fri, 15 May 2026 09:11:04 +0000

You ship a UCP manifest. The validator returns green. The schema parses cleanly. Every required field is present, every URL resolves, every transport responds. You declare the work done and move on.

Three weeks later, you find out your store has been quietly failing every agent shopping session. The cart endpoint accepts adds but rejects checkouts. A specific variant ID throws a 400 on update_cart. The agent reaches ready_for_complete and stalls because your payment handler doesn't recognise the token format. None of these issues showed up in static validation. All of them block real users on agent-mediated flows.

This post is about how to actually test your UCP implementation — not as a schema document, but as a runtime surface that real frontier agents have to operate against. The short version: schema validation is necessary but not sufficient. The long version is the rest of this post.

What validators catch and what they miss

A UCP validator (including ours, the validator at ucpchecker.com/ucp-validator) checks structural things:

Manifest is valid JSON
Required fields are present (spec, services, signing_keys, etc.)
Declared spec version is one we recognise
Transport endpoints return non-error responses
Schema URLs resolve
Capability namespaces match the spec catalogue

Those are the things you can verify without actually running an agent flow against the store. They're table-stakes, and the UCP Score bakes them into the structural-conformance dimension of its grade.

What static validation doesn't catch:

Whether update_cart rejects valid variant IDs intermittently
Whether the cart endpoint's success response contains the line items it claims to contain
Whether the checkout flow surfaces the buyer-specific payment instruments your customer can actually use
Whether your search_catalog returns more than 8 KB of HTML in a description field that crashes Claude's tool-calling layer
Whether two different models pick the same variant ID for "Medium" against your product (the variant-data problem we cover separately)
Whether the agent can recover when one of your tool calls returns a 500 mid-flow

These are runtime properties. They only surface when you run an actual agent against an actual checkout. And they're where the gap between "store passes validation" and "agent can buy" lives. The April State of Agentic Commerce report sized that gap concretely: of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent experience. A 0.2% flawless rate against a 98%+ conformance rate. The runtime gap is the gap.

The three-layer testing pyramid

The right way to test UCP is not "validator or no validator" — it's three layers, each catching a different class of problem, in increasing order of cost and fidelity:

Layer	Tool	Catches	Cost
1. Schema validation	`/ucp-validator`	Manifest parse errors, missing required fields, malformed URLs	seconds, free
2. Capability score	`/score`	Surface signals, declared capabilities, transport reachability, robots/sitemap hygiene	seconds, free
3. Live agent eval	UCP Playground	Variant resolution, cart/checkout shape, error recovery, multi-model behaviour, attribution flow	dollars per session, paid

Each layer feeds the next. If layer 1 fails, layer 2 has nothing to score. If layer 2 reports gaps, layer 3 will find them magnified in real agent runs. Skipping layers wastes layer 3's time on bugs the cheaper layers would have caught — that's the case for running them in order rather than going straight to live agents.

Most teams stop at layer 2. Stopping at layer 2 is what produces the 99.8%-conformant / 0.2%-flawless gap. A clean Score gets you to "the agent has a fair chance." A clean Score plus a clean eval gets you to "the agent reliably completes the flow you care about."

What live agent testing actually looks like

Layer 3 is where most readers are unfamiliar, so this section walks through what running an agent test against your own store actually involves.

The shape: you point a frontier agent (Claude, GPT, Gemini, Grok, Llama — whichever model you want to evaluate against) at your store's UCP manifest endpoint and give it a multi-turn shopping prompt. The agent does what an agent does — discovers your tools via the manifest, calls search_catalog against your products, evaluates the results, picks something, calls update_cart, navigates checkout. The framework records every tool call, every response, every model decision, the full token-by-token event stream.

At the end of the session you get a structured report:

Did the agent reach checkout_reached (full transaction completion)?
Or did it stop at cart_created, search_only, or failed?
How many tool calls did it make? How many succeeded? Which ones errored?
How many tokens did the model consume?
How long did the session take?
If the agent failed, why?

That's the data layer 1 and layer 2 can't produce. Schema validation tells you what your store says; agent eval tells you what an agent does with what your store says. They're answering different questions.

For most stores, the first eval session is uncomfortable. The agent picks the wrong variant. Or it adds something to the cart and then stalls because the response shape isn't quite what it expected. Or it reaches ready_for_complete and can't move forward because your payment-handler declaration doesn't match what the agent has been trained to handle. Each of those is a fix you can make, and each fix lifts your real conversion rate the next time an actual user-facing agent shops your store.

Why testing on one model isn't enough

A useful pattern from the Playground 1,000-session dataset: the same store gets meaningfully different outcomes across different models. A store that completes checkout 65% of the time on Claude Sonnet 4.5 might complete only 18% of the time on GPT-5.2 — the same UCP implementation, the same shopping prompt, just a different model.

That spread isn't because one model is "better." It's because each frontier model has its own quirks in how it handles tool calls, schemas, error responses, and ambiguous data. Models differ on:

How they handle empty arrays vs missing fields
Whether they follow up on a 4xx response or move on
How aggressively they retry failed tool calls
How they parse multi-line strings in description fields
Whether they pass through optional metadata fields verbatim

The real-world implication: your customers don't all use the same agent. Some use ChatGPT-routed flows; some use Anthropic's; some use Google AI Mode; some use a custom agent built on Llama. Testing against just one model means catching only the bugs that one model surfaces, while shipping silent failures to everyone using a different one. Multi-model coverage is what gets you from "this passes for our internal demo" to "this works for real customer traffic."

UCP Playground supports head-to-head testing across 15+ frontier models. The comparison view lets you run the same store against any two models on the same workload. We'd suggest at minimum testing against:

One Anthropic model (Claude Opus or Sonnet)
One OpenAI model (GPT-5.2 or GPT-4o)
One Google model (Gemini 3.1 Pro or 2.5 Flash)

Three models cover most of the deployed-agent universe. If any of the three behaves badly against your store, you have a real problem worth fixing before more traffic arrives.

Wiring tests into your deploy pipeline

Manual eval is fine for one-off audits. If you're shipping changes regularly, you want this in CI. The Playground exposes a headless API for exactly that:

POST /api/v1/collections          — define a test (sequence of prompts + models + stores)
POST /api/v1/collections/{id}/run — trigger the test
GET  /api/v1/collection-runs/{id} — poll status + results

The pattern most teams ship first: a deploy-time test that triggers an eval after every UCP-related code change, asserts on key metrics, and fails the build if any of them regress. A reasonable assertion shape:

# .github/workflows/ucp-eval.yml
- name: Run UCP eval
  run: |
    curl -X POST $PLAYGROUND_API/v1/collections/$COLLECTION_ID/run \\
      -H "Authorization: Bearer $PLAYGROUND_TOKEN"
    # Poll, then assert:
    # - checkout_rate >= 80
    # - errors.total == 0
    # - avg_duration_ms < 30000

Same shape as Lighthouse CI for web performance. A regression catch you bolt onto your pipeline rather than rediscover in production. The UCP Playground Evals launch post walks through the full pattern with a worked example.

The order to do this in

If you're starting from a fresh UCP implementation:

Run the validator against your manifest. Fix any structural errors. This is the cheapest layer; do it first.
Get a UCP Score for your domain. Aim for B+ (70+) before moving to live testing. Below that, you have surface-level gaps that'll dominate the eval results and waste your test budget.
Run a Playground eval against your store with two different frontier models on a single shopping sequence. Fix whatever fails. Common first-time failures: variant-data ambiguity, response-shape inconsistencies, tool argument validation.
Expand to three models once your single-model baseline works. Multi-model coverage is what catches the long-tail issues.
Wire the eval into CI once your implementation is stable. From this point on, every code change that touches UCP runs against real agents before it ships.

If you've already got a UCP implementation in production and are trying to figure out why agents aren't completing checkouts, skip step 2 and go straight to step 3. The eval will show you the specific failure mode, and you can backfill the score work later.

What good looks like

A store that's passed all three layers cleanly looks like this:

Validator: green
Score: A grade (85+) across Discovery, Conformance, and Capability Coverage
Eval: 80%+ checkout rate against Claude Sonnet 4.5, Gemini 3 Flash, and one other model of your choice; <5s average tool-call latency; zero categorised errors across at least 20 sessions

That's the bar. The State of Agentic Commerce is tracking how many stores hit that bar — currently fewer than 1% of verified stores. The work to get from 99% conformance to 1% bar-clearing is mostly testing work.

Try it

Validator (free, instant): ucpchecker.com/ucp-validator
Score (free, instant): ucpchecker.com/score
Live agent eval (paid per session): ucpplayground.com/evals
Multi-model comparison view: ucpplayground.com/models
CI-ready eval API: documented at ucpplayground.com

Schema validation is necessary. It is not sufficient. The agents your customers use will run real flows against your store, and the only way to know whether those flows succeed is to run them yourself first.

Test before they do.

The State of Agentic Commerce — May 2026

Benji Fisher — Thu, 14 May 2026 09:45:37 +0000

In April, the story was a platform pulling a lever: Shopify migrated its entire UCP fleet to v2026-04-08 in four days, BigCommerce showed up with three stores, and we said the question for May was which platform ships next — because every prior jump in the directory had been a step function caused by a platform-level deployment.

May's answer: none, and it didn't matter. No platform shipped a UCP wave this month. BigCommerce still has three verified stores. WooCommerce still has three. Salesforce Commerce Cloud still has none verified, though a custom build is reportedly in development. And the directory still grew ~32% — the same rate as April — because the baseline discovery rate stepped up. For the first time since we started this report, UCP grew on a slope instead of a staircase.

This is the fourth monthly state-of-the-ecosystem report from UCP Checker. Here's what the data says as of May 12, 2026.

The numbers

5,294 verified UCP stores (up from 4,014 in April, +32%)
5,892 total domains tracked
1,829 new merchants discovered this month; 775 this week alone
5,264 verified stores on the latest v2026-04-08 spec (99.4%)
5,235 verified stores at A grade on UCP Score (98.9%)

Three consecutive months of ~30% growth is a real curve now, not a launch artifact. But the shape changed. February was discovery (first 1,000 Shopify stores). March was expansion (crossed 3,000, first non-Shopify manifests). April was consolidation (the four-day Shopify spec migration). May is the first month where the headline growth came from neither a new platform nor a spec event — it came from crawler optimisations we shipped in early May. The stores were always out there; we just got faster at finding them.

That distinction matters for forecasting. If May's growth had been platform-driven, you'd model the next jump as "wait for SFCC." Since it's discovery-rate-driven, the model is different: the directory keeps filling at a steady clip until either we exhaust the discoverable Shopify long tail, or a platform finally ships a wave and the staircase resumes. Both will happen; the order is the open question.

Shopify's head start, four months in

Platform	Monitored	Verified	Verified %	Avg score (verified)	Avg manifest latency
Shopify	5,242	5,241	~100%	92.5	178 ms
Custom & Headless	642	45	7.0%	83.0	356 ms
WooCommerce	3	3	100%	92.3	1,023 ms
BigCommerce	3	3	100%	88.3	993 ms
Magento	1	1	100%	85.0	218 ms
PrestaShop	1	1	100%	84.0	548 ms

Shopify is 99% of the verified directory — unchanged from April. Every non-Shopify platform combined sums to 53 verified stores, the same as last month. The head start is still the dominant signal in the data, and the Custom & Headless cohort is the mirror image: 642 domains attempted UCP, only 45 got to verified (a 7% completion rate). When a platform hands you the boilerplate, you compound; when you build it yourself, most attempts stall before validation. That's a tooling gap, not a spec problem.

The more interesting movement came from two more platforms shipping UCP support — Bareconnect and Selly.io — both of which already have verified stores live in the directory today, not roadmap promises. The numbers are still small. How either platform is exposing UCP (default for every storefront, opt-in, or a paid tier) decides whether this stays a handful or turns into a wave — that detail we don't know yet. But it's the first new platform movement since the Shopify migration.

Two structural notes on the table. BigCommerce and WooCommerce manifests run ~1 second versus Shopify's 178 ms because they're served from the storefront origin rather than a CDN-cached endpoint — a meaningful handicap as agent response budgets tighten. And geographically the directory is still a US/.com story: 4,720 of 5,294 verified stores ship under generic TLDs; the largest attributable ccTLD cohorts are .uk (229), .au (120), and .ca (66); continental Europe is under 2% by ccTLD (a floor, not a true distribution).

Capability coverage: the ceiling, and the edges

Capability	Verified adopters
`dev.ucp.shopping.checkout`	5,269
`dev.ucp.shopping.fulfillment`	5,264
`dev.ucp.shopping.catalog.lookup`	5,257
`dev.ucp.shopping.catalog.search`	5,256
`dev.ucp.shopping.order`	5,256
`dev.ucp.shopping.discount`	5,253
`dev.ucp.shopping.cart`	5,249
— the cliff —
`dev.ucp.common.identity_linking`	6
`dev.ucp.shopping.buyer_consent`	3
`dev.ucp.shopping.checkout.embedded`	2
`dev.ucp.shopping.ap2_mandate`	1
`dev.ucp.shopping.payment`	0

Identical pattern to March and April: the seven core shopping capabilities ship together as a Shopify-side bundle (~5,250 adopters each), then an 800× cliff. Identity linking: 6. AP2 mandate — the primitive that makes an agentic transaction auditably user-authorised — still 1 (houseofparfum.nl, WooCommerce, scoring 100). Payment capability: still 0. Of 5,294 verified stores, 5,161 (>99%) sit at Tier 2, one is Tier 3, one is Tier 4. The deeper primitives aren't slow-adopting, they're not adopting yet. When demand for AP2 turns into pressure (regulators, payment networks, the working group's eventual requirements), this number moves fast — the way checkout did once Shopify bundled it. Until then, "UCP store" means "agent-shoppable," not "mandate-credentialed."

Where the movement was: the edges of the spec

The new signals in May's data sit at the edges of the spec rather than its core. The first is in the capability namespace itself: below the standard dev.ucp.* entries, a handful of non-standard, vendor-prefixed capabilities are now appearing on real verified manifests:

com.pwc.accelerator.loyalty.rewards — 2 stores. PwC's agentic-commerce accelerator (more below).
com.appointedd.schedule / .booking / .intent — 1 store. Appointment-scheduling primitives — booking-vertical UCP, not retail.
com.woocommerce.ai_storefront — 1 store. A WooCommerce-specific storefront extension.
sh.agentscore.identity — 1 store. An identity primitive from a third party.
com.agoragentic.x402.checkout — 1 store. A checkout extension referencing x402 (the HTTP-402 micropayment pattern).

None is adopted at scale yet — 1–2 stores each, almost certainly vendors' own test deployments — but it's the first month the namespace long tail has held anything other than Shopify defaults. It's the leading indicator of a UCP extension ecosystem: third parties shipping vertical capabilities (loyalty, booking, identity, micropayments) on top of the core spec, a more realistic near-term diversification path than "another commerce platform ships a wave."

The PwC entry is worth pulling out, because it isn't a platform — it's a consultancy. PwC has launched an agentic-commerce accelerator: a practice that stands up custom UCP-enabled storefronts for enterprise clients, with its own capability extensions (the com.pwc.accelerator.* namespace) layered on the core spec. That's a third adoption channel, distinct from "platform ships a wave" and "developer hand-builds" — call it consulting-led. It's slower per engagement, but each accelerator that standardises on UCP arrives with a portfolio of enterprise clients attached. PwC is the leading edge; Deloitte, EY, KPMG, Accenture, McKinsey, BCG, and the systems integrators (Capgemini, IBM, TCS, Infosys) all face the same build-it-once, deploy-to-many incentive.

Transports and payment handlers: the monoculture, and the experiments tier

Transport	Verified declarations
MCP	5,258
Embedded	5,243
REST	47
A2A	2

MCP and Embedded are universal because Shopify declares both. REST shows up on 47 stores — the non-Shopify hand-builds, REST being the natural fit for anyone implementing without an MCP server. A2A (Google's Agent2Agent transport, formally added in v2026-04-08) holds at two. Payment handlers tell the same monoculture story: 5,250 verified stores declare Google Pay and 5,241 declare Shopify Card — the same shared Shopify-managed handler IDs we flagged in February as a single point of failure. Everything else is a rounding error. The payment partner ecosystem (Stripe, Adyen, Visa, Mastercard, PayPal, Affirm, Splitit — all on the registry) is mature on paper; the live handler declarations are two Shopify-managed IDs and a handful of experiments.

The experiments are the part worth zooming in on, because the same small set of builders is populating the spec's newer transport, its newer handler shapes, and its newer capability namespaces simultaneously. Both A2A adopters are agent-native rather than retail: one is an agent-identity storefront running pure A2A with a cryptographically signed manifest (JWS / EdDSA) and two custom payment handlers on crypto rails — an mpp rail on Tempo mainnet and an x402 rail on Base; the other is an agent-to-agent service exposed across MCP + A2A + REST, selling a USDC-priced audit via a com.agoragentic.x402 handler plus a direct USDC receive address. Both ship the custom capability namespaces flagged in the capability section above (sh.agentscore.identity, com.agoragentic.x402.checkout).

Separately, payment processors are starting to run dev UCP endpoints with fully custom handler integrations — their own handler IDs, their own init / verify protocol shapes, declared at v2026-04-08 over REST against real merchants, iterating against the Checker as they build. Still dev, not live, but for the first time the gap between the partner roster and the live handler declarations has something in it that's neither Shopify-default nor mock fixture — and it's coming from processors with the scale to move real merchant bases. Two data points in each direction don't make a trend, but the pattern is coherent: the spec's newer surfaces (A2A transport, custom handler shapes, third-party namespaces) are populated by a small set of builders doing novel work in parallel, while the core carries volume. That's the shape of a protocol leaving its launch phase.

How agents actually perform

The numbers above tell you which stores have UCP. This section is which stores work when an agent shops them. UCP Playground Evals passed 1,000 recorded agent sessions this month — and it's well past that now: a thousand-plus end-to-end agent shopping runs across 105 unique stores and 16 frontier models, totalling ~57M tokens, ~12 hours of cumulative agent runtime, and roughly $119,000 in aggregate cart value.

Outcomes: where the agent stops

Outcome	Sessions	Share
`checkout_reached`	475	37.9%
`search_only` (browsed, didn't cart)	344	27.4%
`failed` (provider error, refusal, max turns)	261	20.8%
`cart_created` (carted, didn't proceed)	172	13.7%

62% of sessions end without a completed checkout — and that ratio has stayed stable as the dataset grew, which is itself the finding. As we add models and stores, the shape of failure doesn't change: agents find products fine (search works nearly everywhere), build carts often, then ~14% of sessions stall at a cart that won't convert and ~21% fail outright (about half of those are variant-shape problems — the agent picks a variant ID the cart rejects and flails until it hits the turn limit). We dug into exactly that this month in UCP Variant Data: The #1 Reason Agent Checkouts Fail — the single largest categorisable cause of the gap between "has a manifest" and "agent can buy from it," and almost entirely fixable in the merchant's variant data without touching any tooling.

Model leaderboard

Checkout-conversion rate by model, from the UCP Playground model leaderboard — sessions where the agent reached a checkout URL ÷ total sessions for that model (the live leaderboard breaks out search, cart, and speed too):

Model	Sessions	Checkout %	Avg session	Vendor
Claude Sonnet 4.5	256	52.0%	~38 s	Anthropic
Llama 3.3 70B	75	49.3%	~48 s	Meta
DeepSeek V3.2	60	45.0%	~46 s	DeepSeek
Gemini 3 Flash	174	42.0%	~21 s	Google
Grok 4	53	39.6%	~77 s	xAI
Claude Opus 4.6	123	39.0%	~30 s	Anthropic
Gemini 2.5 Flash	125	36.0%	~12 s	Google
GPT-4o	63	31.7%	~15 s	OpenAI
Gemini 3.1 Pro	96	29.2%	~48 s	Google
Gemini 2.5 Pro	79	27.8%	~34 s	Google
GPT-5.2	63	20.6%	~36 s	OpenAI
DeepSeek R1	19	15.8%	~60 s	DeepSeek
o4-mini	21	14.3%	~42 s	OpenAI
Grok 3 Mini	21	9.5%	~57 s	xAI
QwQ 32B	25	0%	~61 s	Alibaba

Three things hold from April, plus one shift:

Search works everywhere. Checkout completion is the next frontier. Every model that runs to completion finds products. Checkout conversion ranges from 0% to 52% — a 50-point spread across the field, which is exactly where the work-to-do sits. The best model in the field completes checkout about half the time today; the headroom from there is the frontier the next quarter gets to push.

Reasoning-tuned models still underperform. QwQ 32B: 0% across 25 sessions. Grok 3 Mini: 9.5%. o4-mini: 14.3%. DeepSeek R1: 15.8%. Models that burn tokens on deliberation struggle with the fast, sequential, low-ambiguity tool-calling that shopping requires. Shopping rewards decisive, not thoughtful — true in April, true with 3× the data. (GPT-5.2 also lands below the median at 20.6%.)

Speed and success are decoupled. Gemini 2.5 Flash finishes a session in ~12 seconds; Grok 4 takes ~77. Their checkout rates are 36% and 40% — basically a wash. Being fast doesn't make you good at this; being slow doesn't either. The Claude models sit mid-pack on speed (~30–38 s) and top on conversion, which is the combination that actually matters when the agent is spending someone's money.

The shift: in April we reported DeepSeek V3.2 leading the composite shopping score. With ~3× the sessions, Claude Sonnet 4.5 is now clearly out front on checkout completion — 52% over 256 sessions, by far the largest sample — with Meta's Llama 3.3 70B the surprise second. Treat any single month's ranking as provisional until the eval dataset gets to the point — soon — where it stops being indicative and becomes authoritative.

The reliability gap, one more time

We've made this the editorial spine of every one of these reports, and the May data doesn't let us retire it. 98.9% of verified stores carry an A on UCP Score (5,235 of 5,294; the rest are 57 B's and two C's). By conformance, the directory is in excellent shape. But conformance isn't end-to-end agent-readiness, and that's the gap UCP Score doesn't grade.

A clean schema doesn't tell you whether the cart endpoint accepts the variant the agent picked, whether response-time budgets hold under load, whether payment-handler tokenisation completes inside the agent's timeout window, or whether the checkout URL drops the agent into an auth loop a browser would have handled with cookies. UCP Playground is the test harness developers use to exercise that second layer — replay sessions, probe edge cases, see exactly where an agent trips. By design it surfaces failure modes, not steady-state performance; treating Playground completion rates as a consumer-shopping success metric mis-reads the tool. But the categories of failure it surfaces — variant mismatch, slow tokenisation, malformed cart responses, checkout redirect loops — are real, and they're what separate an A-graded manifest from a store an agent can reliably transact against in production.

That's the gap we'd point a platform team at — and it isn't a percentage, it's a posture. The protocol's first phase, call it the first four months, was about getting the schema right, and the ecosystem did that. The next phase is the unglamorous second-order work: error recovery, schema robustness, response-time SLAs, variant-data hygiene, the long tail of edge cases that separate "manifest valid" from "agent transacts without anything tripping it up." That work is happening — the Playground sessions above are senior engineers doing exactly it. The open question is whether the posture spreads from the engineering teams already running this loop to the long tail of merchants still on bundled defaults. That's where the next quarter's competitive distance gets built.

The demand side: AI traffic is converting

For four months this report has focused on supply — which stores have UCP, what capabilities they declare, the shape of their manifests, what agents do against them in testing. On May 11 Shopify published its first real demand-side dataset, and the numbers reframe the urgency of everything above.

Across Shopify storefronts in Q1 2026, by Shopify's analysis:

AI-referred orders grew nearly 13× year-over-year. Referral sessions from AI chatbots (ChatGPT, Perplexity, Gemini, Copilot, Claude, Grok) grew more than 8× YoY.
AI-referred sessions convert at ~50% higher rates than organic search when they start on product pages.
Average order value is 14% higher for AI-referred than for organic-search orders.
More than half of AI-referred sessions start on a product detail page, vs ~20% for organic — "journey compression," the buyer arrives ready to buy because the AI did the research first.
AI-referred conversion outperforms organic SEO in 23 of 25 merchant categories.

Caveat: this is Shopify's analysis of Shopify storefronts with undisclosed methodology, so treat the precise numbers as Shopify-published rather than independently verified. But the direction is the story: agentic commerce isn't theoretical traffic any more. It's converting at premium rates, in volume, growing fast — and that's the demand signal that explains why every TC member is racing to ship at the productisation layer right now. Shopify Field CTO Sandy Jeong framed the operational work in three buckets: data readiness (machine-readable catalog with structured attributes), channel infrastructure (direct API syndication to AI platforms), and organisational alignment (a named DRI, not a committee). The teams that get those three right capture the 13× curve; the teams that don't watch it route around them.

Spec and ecosystem

Attribution landed in core. On May 5 the Technical Council merged a top-level attribution field into cart, checkout, catalog, and order operations — campaign IDs, click identifiers (gclid, fbclid, ttclid), source/medium markers, as an open string-keyed map. It's the first time advertising-and-measurement infrastructure has landed in UCP core, and the trajectory implication is the story: a protocol that carries attribution context is a protocol being built for commercial-scale deployment, not just technical demos.

The council expanded — and the regional question got sharper. Amazon, Meta, Microsoft, Salesforce, and Stripe joined the Technical Council at the end of April — a governance signal as much as an adoption one (none of the five has shipped a UCP store wave yet), but a notable one: the steering group now includes the company building the leading proprietary alternative (Amazon's "Buy for Me") and the company behind the leading rival protocol (Stripe, ACP). Convergence pressure, formalised.

Two German commerce trade publications picked up the expansion within a day of each other and used our breakdown of the 16-seat composition as a primary source: Exciting Commerce on April 27 (which drove the European enterprise retail audience UCP Alerts was built for), and Shoptechblog the next day. Both lead with the same regional point — "Keine Rolle spielen weiter europäische und asiatische Unternehmen" ("European and Asian companies continue to play no role") — and Shoptechblog adds the analytical layer: the new members sent senior engineers and architects rather than C-suite executives (implementation work, not press); each company's participation reads as defensive; and the real contest isn't the standardised protocol but the layers above it — ranking, paid placement, customer ownership. Which is exactly why attribution-in-core is more than plumbing: it's the first of those upper layers getting wired into the spec itself.

Two TC members shipped at the productisation layer. The contest moving up the stack got two concrete examples this month. On May 5 Google expanded UCP-powered checkout out of AI Mode into the main shopping section of standard Search results, with Wayfair the first live retailer on the new surface — a "Buy" button on listings inside Google Search itself, Google Pay tokenisation, checkout completing without leaving the page. Zero-click search results just became zero-click purchases. The two-track adoption story we drew in February has its first major convergence event.

Google's UCP-powered checkout flow on Wayfair: AI Mode query → product page with Buy button → Google Pay review → order complete. Source: Google.

Shopify, separately, started rolling out an Agentic Storefronts dashboard in merchant admin this week (live docs) — surfaces ChatGPT / Microsoft Copilot / AI Mode traffic, offers an "Allow Shopify to manage for me" toggle that auto-generates the AI-readability files (llms.txt, llms-full.txt, agents.md) for stores that opt in. The dashboard is protocol-agnostic: it covers ChatGPT (ACP), Copilot, and UCP-powered Search inside one admin view. UCP is one of the protocols Shopify is now monetising on the agentic-readiness layer, not the whole product. For Shopify it's the natural next step after the v2026-04-08 fleet migration; for everyone else watching the head start, it's the answer to what the next phase of it looks like.

Shopify Agentic Storefronts in merchant admin — ChatGPT / Microsoft Copilot / Shop Channel split, "Allow Shopify to manage for me" toggle, agentic-readiness checklist.

A potential spec gap, still being validated. In the variant-data guide we noted that v2026-04-08 makes variant.options[] optional even on products where product.options[] is non-empty and there are multiple variants — meaning two fully spec-compliant manifests can produce identical-looking payloads where one is unambiguous and the other is agent-unresolvable. The candidate fix would be a conditional MUST ("when product.options is non-empty and variants.length > 1, every variant MUST populate options[]"). It's a working hypothesis from one analysis, not a filed proposal — we want to sweep more of the live dataset for real-world incidence and check the edge cases (single-variant simple products, productGroup behaviour, platforms that already populate options by default) before raising it formally. If the pattern holds, it's a candidate for a future minor release.

No v2026-05. v2026-04-08 remains current. On the cadence so far, the next minor release more likely lands late summer (a notional v2026-08), probably bundling AP2 mandate refinements, schema corrections shaken out by running validators against thousands of real stores, and whatever the council formalises over the next two months. On the partner side: the registry now lists 61 merchants, 11 agents, and 8 extensions; the payment-handler roster (Adyen, Amex, Mastercard, Stripe, Visa, Checkout.com, Affirm, Splitit, PayPal) is unchanged and still almost entirely unrepresented in live manifest declarations.

What we shipped — and what developers are doing with it

UCP Variant Data: The #1 Reason Agent Checkouts Fail — the five variant-data anti-patterns, what clean variant data looks like, and the spec gap that lets compliant stores still be broken.
How to Test Your UCP Implementation — the three-layer validation workflow: static audit, live agent test, continuous monitoring.
UCP Score is doing exactly what it was built to do. This is the one we're proudest of this quarter. The Score turns "is my manifest agent-ready?" into a concrete, category-by-category checklist — and developers are using it that way: we've watched a failing manifest climb to an A grade in the space of a few hours, the developer iterating against the score breakdown between checks. That's the loop it was designed for, and it's now the loop it runs.
UCP Playground got sharper as a development tool. Two halves of the same loop: the agent-inspection tooling — replay any session, see the exact tool call where an agent tripped — and the runtime shopping evals, now past 1,000 recorded sessions and more than 12 hours of cumulative agent runtime against real stores. Together they take the build → test → fix cycle for an agent-ready storefront down from a sprint to an afternoon. Every improvement that got us there is in the changelog.
Crawler throughput — we roughly tripled the hourly crawl rate in early May (and added per-IP and global throttles to the expensive public routes so the directory stays fast under load). That's what moved the discovery curve this month.

What to watch in June

Second adopters at every edge. May produced first adopters across multiple novel patterns — non-Shopify platforms shipping UCP (Bareconnect, Selly.io), a consultancy-built accelerator (PwC), non-default payment-handler integrations (the processors in dev), AP2 mandate (still one), third-party capability namespaces (each at 1–2 stores). The diagnostic for June is whether any doubles up. Each is a distinct watch item; the meta-question is the same: did May's first adopters survive contact with month two?

Google's next live partner on main Search. Wayfair is first up on Google's UCP-checkout expansion into standard Search results. The other co-developing TC retailers — Etsy, Target, Walmart — are the next-most-likely to follow. The cadence of those rollouts is the diagnostic for how fast Google is willing to push agent-completed transactions onto its highest-traffic surface.

The platform-level integration question. SFCC, Adobe Commerce, Wix, Squarespace — any of them shipping a platform-level UCP integration is still the single highest-impact possible event, and still hasn't happened. The one-platform structure is four months old.

Whether the eval leaderboard holds its shape. Claude Sonnet 4.5 leads checkout completion on the largest sample; Llama 3.3 70B is the surprise second. Another month of sessions either confirms that or reshuffles it.

Sources

All data is from the UCP Checker crawler (re-checks every tracked domain at least every 24 hours) and UCP Playground's eval sessions, as of May 12, 2026. The verified-merchant dataset is published monthly on Hugging Face under CC-BY 4.0; the same data, a public REST API, the bulk checker, and the rest of our developer tools are all ungated.

Browse the directory: ucpchecker.com/directory
Track adoption live: ucpchecker.com/stats
Run a UCP Score: ucpchecker.com/score
Model + store leaderboard: ucpplayground.com/evals
Public dataset, REST API & developer tools: ucpchecker.com/developer-tools
Previous report: State of Agentic Commerce — April 2026

External coverage cited in this report:

Jochen Krisch, "Amazon schließt sich Googles Universal Commerce Protocol an," Exciting Commerce, April 27, 2026
Roman Zenner, "Agentic Commerce: Das UCP Council wächst," Shoptechblog, April 28, 2026
Google expands UCP Checkout to main Search shopping results, Search Engine Land, May 2026
New tech and tools for retailers to succeed in an agentic shopping era, Google blog (Ads & Commerce)
Shopify Agentic commerce developer docs — Agentic Storefronts, llms.txt, llms-full.txt, agents.md reference
What Shopify checks for agentic readiness, WISLR Research
Kyle Risley, "AI-referred shoppers convert better and spend more (2026)", Shopify Enterprise Blog, May 11, 2026

UCP Variant Data: The #1 Reason Agent Checkouts Fail

Benji Fisher — Wed, 13 May 2026 11:12:54 +0000

A user asks an AI shopping agent for "a medium grey t-shirt." The agent finds the product. It picks a variant. It adds it to the cart. The merchant rejects the cart. The agent retries with a different variant. The merchant rejects that one too. The session ends in cart_created without a checkout — the user's $40 purchase quietly disappears, and nobody on the merchant side ever sees the failure.

This pattern is the single largest source of agent checkout failures we see across the 4,500+ verified UCP stores in the directory. More than schema invalidity, more than tool errors, more than payment-handler problems. Variant mismatch — the agent and the merchant disagreeing on which SKU corresponds to "Medium" — is responsible for a meaningful fraction of the gap between "store has a UCP manifest" and "agent can actually buy from it."

The good news: it's almost entirely fixable on the merchant side, in your variant data structure, without changing any tooling. This post walks through the failure pattern, the five most common variant data anti-patterns we observe, and what clean variant data looks like in practice.

Anatomy of a variant mismatch

Here's the cleanest way to see the failure:

Two frontier agents — call them Agent A and Agent B — get the same prompt against the same store: "Add a medium grey t-shirt to my cart." Both agents call search_catalog, both get the same product back, both see three variants:

{
  "variants": [
    {"id": "var_5571", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "Small"}]},
    {"id": "var_5572", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "Medium"}]},
    {"id": "var_5573", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "Large"}]}
  ]
}

Agent A picks var_5572. Agent B picks var_5572. Both add to cart. Both succeed. Clean data, predictable behaviour. Each variant declares its options as an array of {name, label} pairs — the spec's selected_option shape — so the agent matches "medium" against the Size axis unambiguously.

Now the broken version. Same prompt, same product, but the variant data looks like this:

{
  "variants": [
    {"id": "var_5571", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "S"}]},
    {"id": "var_5572", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "M"}]},
    {"id": "var_5573", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "L"}]},
    {"id": "var_5574", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "Medium / Regular Fit"}]},
    {"id": "var_5575", "options": [{"name": "Color", "label": "Grey"}, {"name": "Size", "label": "Medium / Slim Fit"}]}
  ]
}

Agent A picks var_5572 (interpreting "M" as the canonical "Medium"). Agent B picks var_5574 (interpreting "Medium / Regular Fit" as the more explicit match). Neither is wrong. The user said "medium" and both interpretations are defensible. But because the variant data conflates two different axes — size and fit — into a single Size label, the two agents diverge, and the user's experience depends on which model they're using. The spec form makes the bug obvious: Fit should be its own selected_option, not crammed into the Size label.

Worse: many real implementations don't even include the option labels. They expose only opaque variant IDs:

{
  "variants": [
    {"id": "var_5571"},
    {"id": "var_5572"},
    {"id": "var_5573"}
  ]
}

Now the agent has no way to know which variant corresponds to "Medium" at all. It guesses. Sometimes it guesses right. Often it doesn't. That's how checkout sessions end up in cart_created without ever reaching checkout_reached.

Why this is the #1 failure mode

Across the Playground session dataset, roughly 62% of sessions end without a completed checkout. The breakdown is informative:

Outcome	Share
`checkout_reached`	38%
`search_only` (browsed, didn't add)	27%
`failed` (provider error, model refusal, max turns)	22%
`cart_created` (added, didn't proceed)	13%

The cart_created cohort — sessions where the agent successfully picked something but couldn't finish — is the variant-mismatch signal. The agent had enough information to add to cart but the cart contents weren't valid for checkout. That's the structural shape of "wrong variant picked."

Roughly half of the categorised failed sessions are also variant-shape problems — the agent picked a variant ID that the cart endpoint rejects, retried with another, hit max_turns_exceeded while flailing through the variant list. Add those in and variant-related failures account for somewhere around a fifth of all sessions, which is more than any other categorisable failure mode.

The thing that makes this pattern so consistent: clean variant data is not part of UCP Score or schema validation. A store can pass UCP Score at A grade and still emit variant data that breaks every agent in the field. The validator looks at whether the manifest parses; it doesn't look at whether the variants are agent-resolvable. That gap is exactly why this post exists.

A spec gap that compounds the problem

Even when a store is fully UCP-compliant, the protocol leaves room for ambiguity. The 2026-04-08 schema makes variant.options[] optional — including on products where product.options[] is non-empty and there are multiple variants. So a payload like {"options": [{"name": "Size", "values": [{"label": "Small"}, {"label": "Medium"}]}], "variants": [{"id": "var_a"}, {"id": "var_b"}]} is technically valid but agent-unresolvable: nothing links var_a to "Small" rather than "Medium." Two consumers looking at this payload can defensibly pick different variants for the same prompt.

A conditional MUST in the spec — "when product.options is non-empty and variants.length > 1, every variant MUST populate options[]" — would close this cleanly. Until that lands, agent-resolvability is on the merchant rather than the protocol.

The five variant anti-patterns

In rough order of frequency observed across the dataset:

1. Opaque variant IDs with no option metadata

The shape from the third example above — variants exposed only as var_5572, no options, no attributes, no human-readable axis. Agents have no way to map a user's "Medium" to a specific ID. They either guess or pick the first variant, both of which produce wrong outcomes routinely.

Fix: every variant must carry the axis values that distinguish it from siblings, in the spec's selected_option array form: "options": [{"name": "Size", "label": "Medium"}, {"name": "Color", "label": "Grey"}]. The name field tells the agent which axis the value belongs to; label is what gets matched against the user's request. Whatever the product's options page shows to a human shopper — size, colour, material, fit — the variant data should expose programmatically with one selected_option entry per axis.

The corollary: descriptive attributes that aren't selection axes belong in metadata, not product.options[]. A one-variant simple product with "Color: Gray" should expose Gray as metadata.attributes, not as a single-value product.option — otherwise consumer UIs render a one-button picker that looks selectable but isn't. The split: product.options[] is for axes the buyer chooses across; metadata is for descriptive properties of the (only) variant.

2. Conflated axes in a single string

The shape from the second example — "Medium / Regular Fit" as a single option value where size and fit are two separate user choices. Agents can parse this, but inconsistently across models, because the conflation is ambiguous. Different models split the string differently, and the variant they end up picking depends on which side of the slash they prioritise.

Fix: each variant attribute lives in its own field. Don't compose. If your product has size + fit as two axes, the variant data should look like:

{
  "id": "var_5574",
  "options": [
    {"name": "Size", "label": "Medium"},
    {"name": "Fit", "label": "Regular"}
  ]
}

Two clean axes, two unambiguous values, no string parsing required. Agents pick consistently. The array-of-selected_option form is the shape UCP 2026-04-08 defines for variant.options — see selected_option.json.

3. Inconsistent labelling between sibling variants

Not all variants on the same product use the same option vocabulary. One says "M", another says "Medium", another says "med". We see this on stores that have grown organically — different teams added variants over different years, naming conventions drifted, the inconsistency is invisible to the merchandising team because the storefront UI hides it.

Fix: one canonical label per axis value, applied consistently across every variant on every product. If "Medium" is the canonical label, every Medium variant uses exactly "Medium". No "M", no "med", no "Medium " (trailing space). Agents reason by string match; consistency is what makes the match reliable.

4. Missing or inconsistent stock / availability flags

A variant exists in the catalogue but is sold out, and the variant data doesn't say so. The agent picks it, the cart accepts the add, the checkout endpoint rejects it. The agent doesn't know to retry with a different variant — it had no signal that the variant was unavailable.

Fix: every variant declares its availability object — {"available": true, "status": "in_stock"} is the spec shape, with well-known status values in_stock, backorder, preorder, out_of_stock, and discontinued. Agents skip unavailable variants if you tell them to, and status gives them enough signal to decide whether to wait, substitute, or surface an out-of-stock message to the user.

5. Declared axes that variants don't honor

product.options[] declares the selectable axes; variants[] is the universe of actual purchasable combinations. When the cardinality of declared axes doesn't match what variants actually carry — e.g., product.options declares Color × Size = 9 combinations but only 3 color-only variants exist — agents try to satisfy a Size selection that no variant honors. Strict consumers return null and refuse to add; lenient consumers guess and pick wrong.

Fix: keep product.options[] and variants[] in sync. Either every declared axis combination has a corresponding variant, or the axis shouldn't be in product.options[]. If sizes aren't actually configurable for this product, drop Size from the axes; don't leave it dangling.

What clean variant data looks like

Here's the shape that resolves cleanly across every frontier model we test:

{
  "id": "prod_42",
  "title": "Heavyweight Crew Tee",
  "description": "Heavyweight cotton crew-neck tee.",
  "price_range": {
    "min": {"amount": 4500, "currency": "USD"},
    "max": {"amount": 4500, "currency": "USD"}
  },
  "options": [
    {"name": "Color", "values": [{"label": "Charcoal"}]},
    {"name": "Size",  "values": [{"label": "Small"}, {"label": "Medium"}]},
    {"name": "Fit",   "values": [{"label": "Regular"}]}
  ],
  "variants": [
    {
      "id": "var_5571",
      "title": "Charcoal / Small / Regular",
      "description": "Heavyweight crew tee, charcoal, size small, regular fit.",
      "price": {"amount": 4500, "currency": "USD"},
      "availability": {"available": true, "status": "in_stock"},
      "options": [
        {"name": "Color", "label": "Charcoal"},
        {"name": "Size",  "label": "Small"},
        {"name": "Fit",   "label": "Regular"}
      ]
    },
    {
      "id": "var_5572",
      "title": "Charcoal / Medium / Regular",
      "description": "Heavyweight crew tee, charcoal, size medium, regular fit.",
      "price": {"amount": 4500, "currency": "USD"},
      "availability": {"available": true, "status": "in_stock"},
      "options": [
        {"name": "Color", "label": "Charcoal"},
        {"name": "Size",  "label": "Medium"},
        {"name": "Fit",   "label": "Regular"}
      ]
    }
  ]
}

Four spec fields make this work:

product.options at the product level — declares the axes (Color, Size, Fit) and their valid values as an array of product_option {name, values: [{label}]}. Agents know upfront how many dimensions a variant occupies and what values are valid on each axis.
variant.options as an array of selected_option {name, label} — each axis has its own entry, no string parsing, no conflation. The name matches the product-level axis; the label matches the user's request.
variant.availability with available and status — agents skip unavailable variants without trial-and-error, and status (in_stock, backorder, preorder, out_of_stock, discontinued) gives them enough signal to wait, substitute, or surface the right message.
Required scaffolding — id, title, description, and price on every variant, and id, title, description, price_range, variants on the product. These aren't "nice to have"; they're the schema's required fields. Variants missing any of them won't validate.

Bonus stability: when present, option_value.id and selected_option.id give stable identifiers that survive label drift. If your platform supports it (most do — Shopify uses GIDs, WooCommerce uses pa_* taxonomy slugs), populate id alongside label and consumers can match on the stable key when labels change.

Stores running variant data in this shape resolve user prompts to specific variants reliably across every model we've benchmarked. The pattern isn't novel — it's the same shape Shopify uses internally, the same shape WooCommerce variations use when properly structured, the same shape every traditional e-commerce platform ends up at after enough years of evolution. UCP just exposes it programmatically to the agent layer.

How to validate your variant data

Three layers, in order:

1. Static audit. Run your store through UCPChecker. The validator surfaces variants with missing options data, conflated axes, inconsistent labels across sibling variants, and missing availability flags. None of this is part of strict UCP-spec conformance, but our methodology flags variant-quality issues as part of the Capability Coverage score because they materially affect whether agents can transact.

2. Live agent test. Run a multi-model agent session against your store via UCP Playground. The framework exercises the full search → variant-pick → cart → checkout flow against frontier agents across 15+ models. If your variant data is ambiguous, you'll see different models pick different variants for the same prompt — the exact pattern we walk through in the Playground 1,000-sessions analysis.

3. Continuous monitoring. Variant data changes over time as you add products and SKUs. Set up UCP Alerts so you get notified when a variant audit starts surfacing new issues — typically a sign that a recent merchandising change introduced inconsistent labelling at scale.

The order matters. Static audit catches the easy cases (missing fields, schema-shaped problems) cheaply. Live agent test catches the cases where the schema is fine but agents disagree (the conflated-axis cases, the inconsistent-label cases). Monitoring catches drift over time. Skipping any of the three leaves a class of variant problems undetected.

What to fix in your store, and how to verify it

If you're a merchant reading this and your store is running on Shopify, WooCommerce, BigCommerce, Magento, or PrestaShop, the variant data structure is mostly determined by your platform's defaults. The platform-specific fixes are documented in the platform guides — but the meta-pattern is the same:

Audit at ucpchecker.com/check — get a list of variant-data issues
Fix the most common one first (usually missing options metadata)
Test at ucpplayground.com with two different models against the same product, asking for the same variant
Verify that both models pick the same variant ID consistently
Monitor weekly — variant drift is the most common reason a store's UCP Score regresses

Variant data is a back-office data-quality problem dressed up as an agentic commerce problem. The fix is mostly editorial — get your axis labels consistent, expose your option values structurally, mark sold-out variants as such. None of this is technically hard. It's the kind of work that adds up to "agents can buy from your store" rather than "agents try to buy from your store and quietly fail."

If you fix one thing on the agent-readiness side this quarter, fix variant data. The conversion lift is bigger than any other single change you can make.

One thing worth naming: consumer tools that silently paper over variant data problems (substring matching, positional guessing, falling back to variants[0]) make this worse, not better. They hide the failure mode from merchants who would otherwise see it and fix the data. Faithful rendering — null when the match is ambiguous, errors when the data is inconsistent — is what produces correct merchant behaviour. If your variant data only works in some agents, that's a signal the data is the problem, not the agent.

Try it

Audit your variants now: ucpchecker.com/check — flags variant issues alongside the rest of the UCP Score
Test variant resolution with real agents: ucpplayground.com — run two models against your store on the same prompt, see if they pick the same variant
Read the broader failure-mode taxonomy: Common UCP Errors and How to Fix Them
Track ecosystem-wide variant adoption: State of Agentic Commerce — April 2026

The UCP Technical Council Just Shipped Attribution into Core. Here's What That Means.

Benji Fisher — Wed, 06 May 2026 07:43:57 +0000

On May 5, 2026, the UCP Technical Council merged PR #391 into the spec's main branch — adding a top-level attribution field to cart, checkout, catalog, and order operations. The field carries platform-emitted referral and conversion-event context: campaign IDs, click identifiers (gclid, fbclid, ttclid), source/medium markers. Open string-keyed map. Universal across requests; not gated by capability negotiation.

As UCP matures, attribution landing in core was always going to happen. Agentic commerce can't operate as commercial infrastructure without a path for advertising and measurement context to flow alongside the transactional data — and the longer that gap stayed open, the more pressure would have built for vendors to ship incompatible parallel solutions. The merge isn't the surprising part. The interesting part is the specific shape of what shipped, and what its presence in core tells us about where the spec is heading.

Two things to dig into: the technical detail of the field itself, and the trajectory implication of advertising and measurement infrastructure landing in UCP core for the first time.

What shipped

The attribution field is structurally simple. From Grigorik's own example in the PR:

{
  "attribution": {
    "campaign_id": "18234567890",
    "campaign_source": "google",
    "campaign_medium": "cpc",
    "campaign_name": "spring_2026",
    "gclid": "EAIaIQobChMI..."
  }
}

No prescribed schema beyond "string-keyed object." Platforms populate it with whatever conventions they already use — GA4 campaign parameters, click identifiers, custom tracking keys. Businesses receive the data and process per their own analytics needs. UCP itself does not prescribe attribution windows, models, or assignment logic. The protocol carries the data; attribution math happens downstream.

The field appears in three roles across the request lifecycle:

Operation	Role	Direction
`catalog` (search, lookup)	Platform-emitted input	Platform → merchant
`cart`	Platform-emitted input	Platform → merchant
`checkout`	Platform-emitted input	Platform → merchant
`order`	Business-emitted snapshot	Merchant → platform

The asymmetry matters. On catalog/cart/checkout, the platform writes attribution as it would write a UTM string into a browser URL — referral context flowing forward. On order, the business preserves the originating attribution as a snapshot — closing the loop between agent-mediated conversion and the platform that produced it.

Grigorik's framing in the PR is the cleanest one-line summary of intent: the field "carries the same parameters platforms communicate via URL query parameters in browser-based flows, in the same flat key-value form." Attribution in agent-mediated commerce is the agent counterpart of UTM strings. Same parameters, same model, different transport layer.

Thirteen files changed. The core addition is source/schemas/shopping/types/attribution.json — the new type definition. Schemas for cart, catalog_lookup, catalog_search, checkout, and order all gain the field as an optional property. Specification docs across cart, catalog, checkout, order, and the overview were updated to describe the field's purpose and semantics.

The architectural decision: core field, not extension

The substantively interesting part of this PR is not what got added. It's how it got added.

PR #391 was Grigorik's alternative proposal to PR #295, which James Andersen had opened earlier proposing an event_context extension. Both proposals tried to solve the same problem — give platforms a way to pass referral/attribution data through to merchants in agent flows — but with very different architectural shapes:

#295 (Andersen, Meta): Attribution as a structured extension. Capability-negotiated. Validated against a defined schema. Standardised vocabulary across platforms.
#391 (Grigorik, Shopify): Attribution as a top-level core field. Open key-value map. No capability negotiation. Each platform uses its own conventions.

Andersen formally approved Grigorik's alternative — "thanks for finding a better home for attribution data than the original proposal" — and the rearchitecture went on to merge through TC discussion. That cross-vendor pattern (one TC member proposes; another offers a structurally different alternative; the original proposer endorses it) is the dynamic that produces robust standards rather than fragmented vendor extensions.

The PR discussion pivots on which architectural shape this kind of data deserves. Amit Handa wrote the canonical comment on May 3 establishing the decision framework — worth quoting because it'll likely be cited as governance precedent in future spec discussions:

Criterion	Use a UCP Extension	Use Optional Flat Key-Value Pairs
Impact on Behavior	Changes state or execution of the operation	Purely informational
Data Stability	Stable, standardized vocabulary	Volatile, platform-specific, rapidly evolving
Capability Negotiation	Requires mutual agreement + active parent capability	Best-effort, consumed at-will, no gating
Schema Validation	Strict — transaction integrity matters	Flexible — validation happens downstream
Multi-Platform Scale	Data normalization across diverse platforms	Low friction; normalization burden on receiver
Typical Examples	`discount`, `fulfillment`	`attribution`, referral tracking, session tags

Attribution falls cleanly on the right side of every row. Marketing identifiers (gclid, fbclid, ttclid) are volatile and platform-specific — every adtech vendor invents their own; standardising them in the spec would be obsolete the moment a new platform launches. Attribution doesn't change protocol behaviour — it's read-only context that some downstream pipeline cares about, with no transactional consequence. There's nothing for a merchant to negotiate; either you record it or you don't.

The merged PR locks this decision in. Future contributors proposing similar volatile, informational, platform-specific data structures now have a precedent: the spec prefers flat optional key-value pairs over structured extensions for non-state-changing context. That's a piece of governance documentation as much as a feature merge, and Handa's table will be the reference for it.

The trajectory implication

UCP up to this point has been protocol mechanics. How agents discover stores. How they shop. How they pay. How they identify users. How they handle returns. The mechanics are necessary, but they don't directly produce commercial value for the ecosystem participants. A merchant with a perfectly conformant UCP implementation but no attribution can't measure agent-driven conversions, can't optimise marketing spend, can't close the loop between platform investment and merchant outcomes.

attribution closes that loop. With the field in core, the entire adtech infrastructure that powers current ecommerce extends naturally into agent-mediated commerce. Platforms attribute conversions to specific campaigns. Click identifiers persist across the agent flow. Businesses run their existing analytics pipelines on agent-driven traffic with no special handling. The bridge that makes UCP commercially usable for marketing teams — not just engineering teams — now exists in the core spec.

The trajectory implication is the part worth sitting with: UCP is evolving from protocol mechanics into commercial infrastructure. Each subsequent spec addition probably bridges another piece of existing commerce infrastructure into the agent layer. Loyalty programs. Customer data platforms. Marketing automation triggers. Inventory hooks. Each one makes UCP more complete as commercial infrastructure rather than just protocol mechanics.

The architectural-precedent decision in #391 makes that trajectory more efficient. Future contributors proposing similar bridges (attribution-adjacent measurement primitives, marketing identifiers, session metadata) now have a clear template: flat key-value pairs into core, governance precedent already established. The spec doesn't need to relitigate the core-vs-extension decision every time a volatile, informational primitive comes up.

What it means in practice

For merchants: your UCP implementation should accept the attribution field on incoming cart, checkout, and catalog requests, preserve it through to order records, and surface it through your analytics pipeline. The lift is small — it's a string-keyed JSON object on existing endpoints — but missing it means agent-driven conversions arrive at your analytics with no source attribution, which means your marketing team can't measure the channel.

For platform vendors (Shopify, WooCommerce, BigCommerce, Magento, and others): rolling attribution support into the next platform-side compatibility release is now table-stakes work. The stores running on your stack will need to accept and preserve attribution by the time the next published spec version makes this part of conformance.

For agent platforms (those of us building or testing agents that shop UCP stores): pass platform-emitted attribution forward into every cart/checkout/catalog request. The data is informational, not state-changing — your agent doesn't need to do anything with it beyond passing it through. The merchant decides what to do with it on the receive side.

For evaluators (us): the UCP Score will incorporate attribution-acceptance and attribution-preservation conformance in its next release. A store that accepts attribution on cart/checkout/catalog and threads it through to order records will score higher than one that drops it. The methodology page will reflect the rule update when the next score-version drops.

Timing: in core today, in the published spec next

One important distinction worth making explicit. PR #391 merged into the spec's main branch — not into a currently-published spec version. The latest released spec is v2026-04-08, which does not include attribution. The field lands for conformance purposes in whatever the next published spec version ships (no fixed cadence; expected in the next few months). Until then, attribution sits in the working draft on main — implementers can adopt it ahead of the release if they want, but it's not yet part of conformance for the published spec.

That distinction shapes how we're rolling out support across our tools:

UCP Playground will adopt attribution support when the next spec version drops — agents will pass platform attribution through to merchants.
The UCP Score will incorporate attribution-acceptance and attribution-preservation rules in the score release that aligns with the next published spec.
The validator will support the new field as soon as the next spec ships, and the bulk checker will surface attribution conformance per-merchant after that.

The architectural certainty is already here — the schema is locked, the field is documented, the design pattern is settled. The spec drop is the conformance trigger, not the design moment. Implementers who start work today against the working draft are operating against a known target.

Where to read more

The PR itself: #391 on Universal-Commerce-Protocol/ucp
The merge commit: 76a3539
The new schema type: source/schemas/shopping/types/attribution.json
Updated authoring guidance: docs/documentation/schema-authoring.md

About UCP Checker

If you're building on UCP and want to know whether your store is ready for the next spec version: run a check. If you're tracking the spec's evolution professionally: subscribe to our weekly digest — we cover spec changes like this one within a week of merge.

UCP Playground at 1,000+ Agent Sessions: What 16 Models and 97 Real Stores Reveal About AI Shopping

Benji Fisher — Tue, 05 May 2026 09:11:37 +0000

Two and a half months ago we published Why We Built UCP Playground, which closed on 114 agent sessions and an honest acknowledgement that the dataset was thin — most models had single-digit sample sizes, store coverage was uneven, and the headline rates moved meaningfully with every new run. A month later we crossed a different threshold: the first fully autonomous AI agent purchase through UCP — a Gemini agent searching, adding to cart, linking identity, paying, and completing checkout at houseofparfum.nl without a human past the initial prompt.

Eighty days on from the first post, and roughly forty days after that autonomous purchase, the dataset is in a different shape:

Over 1,000 agent shopping sessions captured end-to-end with full tool-call timelines and replayable event streams
16 frontier models — every major lab, plus a reasoning-tuned subset
97 distinct UCP-enabled stores across Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and custom stacks
$96,032 of agent-driven cart value generated, primarily in USD with a long tail across EUR, GBP, INR, ILS, PKR
80 days of run history since Feb 14, 2026

That's the reference dataset for this post. Eight findings emerge from it. Most of them survive being scrutinised at the new sample size; one or two reverse the early-data narrative.

Finding 1 — Claude Sonnet 4.5 leads on aggregate checkout rate

With sample sizes now large enough to take seriously, the per-model checkout-rate leaderboard looks like this:

Model	Share of dataset	Checkout rate	Avg tokens	Avg duration	Fail rate
Claude Sonnet 4.5	20.7%	50.8%	71,195	38.1s	17.2%
Llama 3.3 70B	6.4%	49.3%	57,676	47.7s	14.7%
DeepSeek V3.2	5.1%	45.0%	32,502	46.0s	21.7%
Gemini 3 Flash	12.5%	44.6%	46,520	21.8s	15.5%
Grok 4	4.5%	39.6%	34,297	77.1s	9.4%
Claude Opus 4.6	10.2%	38.8%	44,611	29.7s	25.6%
Gemini 2.5 Flash	9.9%	36.8%	32,394	11.8s	23.1%
GPT-4o	5.2%	29.5%	32,811	14.7s	24.6%
Gemini 3.1 Pro	7.9%	29.0%	30,971	48.7s	28.0%
Gemini 2.5 Pro	6.4%	27.6%	31,566	34.4s	22.4%
GPT-5.2	4.7%	23.6%	30,585	37.4s	27.3%
DeepSeek R1	1.4%	17.6%	35,360	61.4s	29.4%
o4-mini	1.4%	12.5%	64,055	38.1s	37.5%
Grok 3 Mini	1.7%	10.0%	58,386	55.6s	35.0%
QwQ 32B	2.0%	0.0%	25,525	63.9s	50.0%

Claude Sonnet 4.5 leads on aggregate checkout rate at 50.8% on the largest single share of the dataset — a sample large enough that the rank ordering is no longer noise. Llama 3.3 70B sits a fraction below at 49.3% on a smaller but still meaningful share. The two are statistically tied; both are operating in a different regime than the rest of the field.

The most interesting result on this table is GPT-5.2, which at 23.6% lands in the bottom third despite being one of the most capable frontier models on essentially every public benchmark. The gap between its performance on standard reasoning benchmarks and its performance on transactional shopping flows is the single largest delta in the leaderboard. We dig into why in the development notes below.

One caveat worth flagging up-front: GPT-5.2's 23.6% figure reflects performance across the full 80-day window, including the period before our cursor-stripping fix landed mid-dataset. Sessions after that fix show GPT-5.2 performing meaningfully more competitively. We'll publish the longitudinal split in the August update — the aggregate number above is the worst-case read.

Finding 2 — Reasoning-tuned models continue to underperform

The cohort of reasoning-tuned models (DeepSeek R1, o4-mini, Grok 3 Mini, QwQ 32B) sits unambiguously at the bottom of the leaderboard. Three of them are in the bottom four overall. QwQ 32B has yet to record a single completed checkout across its share of the dataset.

The pattern was visible in the original four-session sample report shipped with the eval-framework launch in April; it has only sharpened as the dataset grew two orders of magnitude. The pattern is consistent across labs and across architectures (chain-of-thought variants, exploratory reasoning, distilled-from-frontier models — all underperform on shopping flows compared to their non-reasoning counterparts from the same lab).

The working hypothesis remains: shopping requires fast tool-use rhythm, not deliberation. The decisions in a shopping sequence — search this term, add this item, proceed to checkout — are individually shallow but happen in series. A reasoning model that pauses to deliberate at each step burns clock time and tokens on decisions that don't reward deliberation. Combined with reasoning models' tendency to over-question their own outputs, the result is sessions that hit max_turns_exceeded before completing.

Worth noting what isn't in this hypothesis: reasoning models are not bad at commerce in general. They may be excellent at higher-stakes flows — disputed transactions, multi-step contractual reasoning, regulatory edge cases — that the current eval workload doesn't probe. The benchmark says: when the workload is "shop normally," fast non-reasoning models win. Other workloads will tell different stories.

Finding 3 — Speed and accuracy aren't correlated

Gemini 2.5 Flash finishes the average shopping session in 11.8 seconds — the only model in the field under 15s. Its checkout rate is 36.8% — middling. Claude Sonnet 4.5 takes 38.1s on average and lands a 50.8% checkout rate — the highest on the leaderboard, at more than triple Flash's clock time.

Two real surfaces: latency-bound use cases (voice agents, mobile commerce, conversational checkout where the user is waiting in real time) effectively must use Gemini 2.5 Flash or Gemini 3 Flash, and pay for the latency win with lower closed-checkout rates. Throughput-bound use cases (batch agents, scheduled buying, autonomous shopping where wall-clock time is mostly hidden) should use Claude Sonnet 4.5 or Llama 3.3 70B and accept the latency cost for the conversion lift.

The naive intuition merchants reach for — "the better model is faster and more accurate" — doesn't survive contact with this data. The two axes are essentially independent within this corpus. That's a finding nobody can extract from a single-model demo or a vendor benchmark.

Finding 4 — The failure mode taxonomy is dominated by tool errors, not model refusals

Across the 256 failed sessions in the dataset, the categorised error taxonomy is:

Error type	Sessions	% of categorised failures
`openrouter_error` (provider-side)	51	56%
`model_refused`	22	24%
`max_turns_exceeded`	18	20%

The single-largest categorised failure mode is provider-side errors — the routing layer between the agent and the model returning a non-200 before the session can complete. This is a cost of operating at scale across 16 models and reflects the still-maturing infrastructure underneath frontier-model API access, not anything specific to UCP.

The second-largest, model refusals, is more interesting. Twenty-two refusals across the dataset is a refusal rate of roughly 2%. We see refusals concentrated in two situations: (1) sessions against demo stores with unusual product names that pattern-match a model's safety filters, and (2) sessions where the user prompt contains adversarial content seeded by us as part of a prompt-injection eval. We've recorded 6/6 prompt-injection resistance across the dedicated injection-eval runs to date, so the model_refused category is partly capturing models doing exactly what they should.

The third, max_turns_exceeded, is concentrated in the reasoning-model cohort and is the empirical signal for the over-deliberation pattern in Finding 2.

The remaining 165 failures don't carry a categorised error_type — typically these are sessions where the model abandoned the flow without raising an explicit error. That's a tagging gap in the framework that we're closing in the next iteration.

Finding 5 — Store implementation explains most of the cross-store variance

The benchmark's most strategically important finding doesn't come from the per-model column. It comes from the per-store one.

Across the 97 stores in the dataset, the same model produces dramatically different outcomes. Between the most agent-friendly and least agent-friendly implementations at meaningful sample sizes, the checkout-rate spread exceeds 60 percentage points — wider than any model-versus-model gap on the leaderboard. No model in the field, at any sample size, produces a 60-point spread purely on its own merits. Almost all of that variance is store-side, and the rigorous run history across thousands of sessions makes the pattern hard to attribute to anything else.

The cleanest predictor we've found is whether the store's MCP implementation is stateless or stateful, and how it handles the boundary between them.

Stateless implementations treat every tool call as self-contained. Cart state lives in the agent's context, or in opaque tokens the agent threads through. Identity is established once and re-asserted on each call. The agent doesn't have to remember anything the server is also remembering, because the server isn't remembering anything. Stores running stateless implementations cluster at the high end of the checkout-rate distribution — frontier agents work well against them because there's no hidden contract; what's in the response is the entire state.

Stateful implementations persist server-side session, cart, and auth across calls, exposed to the agent through session IDs, cookies, or scoped tokens. When this works, it works well. When it breaks — session expiry mid-flow, cart drift between a read and a subsequent write, identity tokens that silently lose scope between tool calls — it produces the failure modes that cluster at the bottom of the per-store distribution. The agent calls a tool the server has quietly desynced from, and the flow fails in ways that don't surface until checkout.

The hybrid case is the most error-prone: stores that are stateless in some tools and stateful in others, without making the boundary explicit in the manifest or the tool response shapes. Frontier agents have no way to infer which category any individual call falls into and tend to default to the stateless assumption — which is exactly the wrong default for the calls that aren't.

Beyond the state axis, the rigorous testing surfaces a consistent set of secondary trip-wires: variant IDs without human-readable axis labels, description strings exceeding 8K tokens for a single product, tool responses including nested HTML in fields agents expect to be plain text, cart endpoints returning success codes for failed mutations. None of these break UCP Score validation. All of them break agent flows.

These are merchant-side fixes, not model-side ones. The strategic implication for any team operating a UCP-enabled store: fixing your manifest and tool responses produces more conversion lift than choosing the right model. That's load-bearing — it's why the integrated Score → Check → Eval workflow exists, and it's where we'd point a team starting from zero on UCP.

Finding 6 — Cart value generated is concentrated in USD and high-AOV verticals

Of the 1,000+ sessions, 96 produced a non-zero cart value. The breakdown:

Currency	Sessions	Total cart value	Avg cart value
USD	85	$95,647.23	$1,125.26
INR	2	₹3,845.00	₹1,922.50
PKR	2	₨4,490.00	₨2,245.00
EUR	5	€296.74	€59.35
ILS	1	₪189.60	₪189.60
GBP	2	£47.99	£24.00

USD cart value totals $95,647 across 85 sessions with an average cart value of $1,125. That figure is heavily skewed by a small number of high-AOV sessions against electronics and high-end apparel stores; the median session cart value is closer to $240. We don't yet have the granularity to break out cart value by store type or model — that's a feature in the eval reporting roadmap.

The cross-currency long tail (EUR/GBP/INR/PKR/ILS) is small but informative. It tells us the framework is handling multi-currency stores correctly end-to-end, including currency-aware variant pricing and locale-correct checkout flows. Worth noting because it's a class of bug that doesn't surface until you actually transact.

Finding 7 — Session volume is now meaningful enough to reveal trajectory

Plotted week-over-week, session volume has three distinct phases over the 80-day window:

UCP Playground weekly session volume, mid-February through late April 2026Trend line showing three phases: a small founding wave in mid-February, a steady-state oscillation through March and mid-April, and a sharp acceleration in late April that produces the largest single week of the dataset.Feb 14Apr 27

Founding wave (mid-February). A small launch surge coinciding with the Why We Built UCP Playground post — first publishers running first sessions, signal that the framework worked end-to-end against real stores.

Steady state (March through mid-April). Weekly volume oscillating in a tight band as more frontier models came online and the eval framework matured. Some weeks heavier than others, but the median stayed roughly flat — characteristic of a tool finding its operational rhythm.

Acceleration (late April). The largest single week of the dataset, driven mostly by a batch of eval-collection runs against stores onboarded after the council expansion announcement. The line bends upward at the end of the window.

The trajectory matters mostly because it lets us start tracking model drift. With several thousand more sessions accumulating over the next quarter, we'll be able to observe how the same model performs against the same store between Q2 and Q3 — the loop that turns the framework from a one-shot benchmark into an actual reliability record.

Finding 8 — The 0.2% flawless-end-to-end rate has improved, slightly

The April State of Agentic Commerce report flagged that of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent shopping experience. That's the 0.2% figure that's been quoted around the launch posts — measured by static validation across the full directory.

Eighty days later, with 97 stores tested directly through the eval framework, roughly 0.5–0.7% reach the same bar. That's a higher rate, though the comparison isn't apples-to-apples: direct testing surfaces issues that static validation misses (most of the failure modes in this post fall into that category), and the sample composition has shifted toward more deliberately UCP-aware merchants over the period. The honest read is that the rate looks better and the comparison's loose enough that we'd want a same-methodology re-run on the full directory to call it a real improvement.

What we can say cleanly: for every store running a clean, agent-friendly UCP implementation, there are still 100+ that pass conformance but stumble somewhere in the agent flow. The gap continues to be on the merchant side. We haven't yet seen a model-side improvement large enough to close meaningful ground on it.

Why Playground stays neutral

Every finding above hinges on one design choice: the system prompt and the orchestration loop are generic. Same for every model. Same for every store. No store-specific scaffolding, no model-specific workarounds. That's what makes the framework work as a testing environment.

The temptation to add a workaround when a particular model trips on a particular store is real — there's almost always a one-line patch that would push that store's checkout rate up by ten points against that one model. We don't ship those patches, on principle. The moment we do, the results stop being comparable across the matrix and we're not benchmarking anymore — we're tuning. Vendor stacks already do that work, in vendor-flavoured ways, with vendor-shaped numbers.

Independence here means a specific thing: the orchestration is neutral, the protocol layer is full-featured. Stores get the tools they declare. Identity linking works. Payment handlers pass through. Multi-turn context flows the way the spec defines. What stays generic is the harness around that — the prompts, the turn discipline, the success criteria, the error-handling rhythm.

The reason that design choice matters can be put in two sentences:

If a model doesn't follow the checkout flow, that's signal about the model.
If a store returns the wrong status, that's signal about the store.

Both signals are useful. Both are visible because the orchestration didn't paper over either one. Hiding either defeats the purpose of running the test.

Companies building their own internal infrastructure to evaluate agent behaviour against their own stores is expected, and good. Every serious commerce platform will eventually have something like that running in CI against its own merchants — and the Score → Check → Eval workflow is exactly the surface they should plug into. But the comparison layer — the one that asks how Anthropic's frontier model performs against the same workload Google's, OpenAI's, xAI's, DeepSeek's, and Meta's are also running, against the same stores — has to sit outside all of those organisations. Vendors can't credibly benchmark themselves; the platform layer has the same problem one level down. Independence is the only way the comparisons aggregate into a record anyone can quote.

That's the niche this layer occupies. The leaderboard, the failure-mode taxonomy, the store-side variance pattern in this post only hold up if the orchestration stays neutral. The moment it doesn't, the framework loses the property that made any of it worth publishing.

What we learned building this

The framework didn't ship in May the same shape it shipped in February. Eighty days of running it against real stores produced a steady stream of bugs and surprises that drove the development work — many of them documented in the public changelog. Five worth surfacing.

Cursor stripping unlocked GPT-5.2 search. Through February we had GPT-5.2 at a 0% search success rate on Shopify stores. The cause was a model-side tic: GPT-5.2 always included the optional after cursor parameter on search_shop_catalog calls, filling it with placeholders like "", "null", or "__NONE__" — values Shopify always rejects. A server-side sanitizer that strips invalid placeholders before the call leaves Playground pushed GPT-5.2's search success from 0% to 100% overnight. The model wasn't bad at search; it had a tool-calling habit nobody had isolated yet.

Failed tool calls used to inflate conversion metrics. An earlier version of step detection counted a failed update_cart as a cart_created completion. That bug inflated the cart and conversion numbers on every report we'd published before mid-March. Fixed in 0.9.3 by gating step detection on the tool response's isError flag, plus the same gate on cart-data extraction. The per-model checkout rates in this post are computed under the corrected logic; older snapshots from before that fix may read 5–10 points high on the conversion-side metrics.

REST-only stores forced a transport rework. The v2026-04-08 spec drop in early April brought new tool names (search_catalog replacing search_shop_catalog), new response shapes (price as {amount, currency} objects, descriptions as {plain, html} objects), and a wave of WooCommerce stores that exposed REST-only endpoints rather than MCP. The 0.10.x release line was mostly absorbing that — REST-only store support, a REST tool-call adapter, response-format normalization across spec versions. Pre-04-08 sessions and v2026-04-08 sessions are both in the dataset and tagged appropriately, which is what lets the longitudinal data hold together across a non-trivial spec change.

The GPay token wall built ECP. In a February session, Claude Sonnet 4.5 reached ready_for_complete correctly — and stalled, because the merchant's checkout required a Google Pay payment token the agent couldn't produce. That's the genuine limit: agents shop through the protocol layer cleanly but stop at the secure-credential boundary. The Embedded Commerce Protocol shipped in 0.8.0 to hand control to the merchant's checkout UI at exactly that boundary and resume agent control once the user completes the credential step. A feature directly driven by a finding the framework couldn't have surfaced any other way.

A Playground session became a spec proposal. A live test against houseofparfum.nl exposed a different gap: an identity-linked buyer with a wallet balance hit the checkout, the OAuth flow completed cleanly, the buyer object came back populated — but the wallet was nowhere the agent could see it. payment.instruments was empty, the only declared handler (dev.ucp.delegate_payment) didn't accept the wallet, and the session escalated to the merchant's continue_url every time. Authenticated checkout was provably blocked, by spec. We wrote it up and submitted Proposal #358 to the UCP spec repository — payment.available_instruments, a per-buyer per-session list of usable payment methods (wallet, saved cards, loyalty, gift cards) resolved at runtime from the identity-linked session. Submitted by Benji Fisher (@appdrops) and co-authored with Almin Zolotic (@zologic) of UCPReady, who'd seen the same wall from the merchant side. Currently submitted to the UCP technical council for review. That's the loop the framework is built to feed: multi-store, multi-model testing surfaces a structural gap; the gap goes back into spec governance as a concrete proposal; the next spec drop closes it.

Methodology, briefly

Each session is a real frontier-model agent shopping run against a real UCP-enabled store, captured end-to-end via MCP tool calls. Sessions are initiated either through the public Playground UI (user-initiated, ad-hoc prompts) or through the Evals framework (scripted multi-turn sequences across pre-selected store/model matrices).

Outcomes are tagged at session close: checkout_reached (full transaction completion), cart_created (added items, didn't proceed), search_only (browsed, didn't add), failed (provider error, model refusal, or max-turn exceeded), or info_provided (informational query, no transactional intent).

Every session has a clickable replay link in its source ULID. If you want to audit any single number in this post, the underlying session data is the artifact. That's intentional — independent reproducibility is the point.

Try it

Three concrete next steps:

Run a benchmark against your own store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, and compare your store's per-model performance against the aggregate above.
See where individual models stand. Each model on the leaderboard has its own shopping profile with detailed performance data, known issues, and store-by-store breakdowns.
Compare two models head-to-head. The comparison view lets you pit any two models against each other on the same workload — useful before you commit to a primary model for a deployment.

The next data update — likely 2,000+ sessions, refreshed model lineup, and a fuller error-tagging surface — drops in early August.

UCP Requirements: What Your Store Needs Before Going Live

Benji Fisher — Mon, 04 May 2026 12:23:16 +0000

What do you need for UCP? There are two levels of UCP readiness. The first is the minimum viable manifest — the bare requirements to pass validation and appear in the UCP directory. The second is the agent-ready setup — what it actually takes for an AI agent to browse, cart, and check out at your store without friction.

Think of this as your UCP checklist — the minimum requirements plus the recommended prerequisites that separate stores agents can find from stores agents can actually shop. Most guides only cover the first level. This one covers both, grounded in data from 4,024 verified merchants and hundreds of agent testing sessions.

Minimum requirements (pass validation)

These are the fields required to produce a valid UCP manifest on the current v2026-04-08 spec:

1. A JSON file at /.well-known/ucp

The manifest must be publicly accessible at https://yourdomain.com/.well-known/ucp, served with Content-Type: application/json, and reachable without authentication.

Platform notes:

Shopify: handled automatically
WooCommerce: manual publish via plugin or custom route
BigCommerce: manual, served from storefront origin
Magento: manual, typically via custom module

Full publishing guide with code examples: /.well-known/ucp developer reference.

2. ucp.version (required)

A string identifying which spec version the manifest is written against. Current latest: "2026-04-08".

99.4% of verified stores are on this version. If you're starting fresh, use it. If you're on an older version, the spec update post walks through the migration.

3. ucp.services (required)

At least one service entry declaring a transport (mcp, rest, a2a, or embedded) and an endpoint URL. This tells agents where to send requests.

MCP is the dominant transport — ~100% of verified stores declare it. If you're building from scratch, start with MCP. See the transport comparison for the tradeoffs.

4. ucp.payment_handlers (required)

A map of payment handler namespaces. Can be an empty object {} if your store uses checkout-link redirects instead of tokenized payments (common on WooCommerce).

If you declare handlers, use reverse-domain namespaces like com.stripe.card or dev.shopify.card. See the payment handlers directory for examples.

5. signing_keys (required, at root level)

An array of JWK objects at the document root (not nested inside ucp). An empty array [] is valid if you're not signing payloads yet, but the key must be present.

This field moved from ucp.signing_keys to the root in v2026-04-08 — the most common validation warning we see is stores that still nest it.

Recommended setup (agent-ready)

Passing validation gets you into the directory. The requirements below determine whether agents can actually shop your store — the difference between a B+ grade and an A grade in our benchmarks.

6. Capabilities declaration

The ucp.capabilities field is optional per spec but strongly recommended. Without it, agents know your store exists but not what it can do.

Declare every capability you support:

checkout — 99.5% adoption across verified stores
cart — 99.1% adoption
catalog-search — required for product discovery
identity-linking — 3 stores, massive first-mover opportunity
payment — 0 stores, the frontier

Full list: capability registry.

7. Clean variant data

Variant mismatches are the #1 failure mode in agent shopping sessions. Every variant needs a stable ID, a clear name, and consistent representation across discovery and checkout. This is the single highest-impact fix you can make.

8. Responsive MCP endpoint

Latency matters. The average Shopify store responds in ~130ms. BigCommerce stores average ~890ms. Agents have timeout budgets — if your endpoint is slow, sessions drop silently. Target under 500ms for tool responses.

9. robots.txt allowing AI crawlers

Make sure /.well-known/ucp is explicitly allowed in your robots.txt. Some WAFs and CDN configurations block well-known paths by default. Check the common errors guide for the fix.

10. Supported_versions for backward compatibility

Declare supported_versions in your manifest listing both the current and previous spec version. This lets agents that haven't migrated yet still find a valid endpoint:

"supported_versions": {
    "2026-04-08": "https://yourstore.com/.well-known/ucp",
    "2026-01-23": "https://yourstore.com/.well-known/ucp/2026-01-23"
}

The UCP readiness checklist

Requirement	Required?	% of stores that have it
Manifest at /.well-known/ucp	Yes	100% (by definition)
ucp.version	Yes	100%
ucp.services with transport + endpoint	Yes	100%
ucp.payment_handlers	Yes	100%
signing_keys at root	Yes	~97% (rest have it nested)
ucp.capabilities	Recommended	~99% (Shopify default)
Clean variant data	Recommended	Unknown (runtime issue)
Latency < 500ms	Recommended	~95% (Shopify), ~30% (others)
robots.txt allows /.well-known/ucp	Recommended	~99%
supported_versions	Recommended	~70%

Validate your setup

Not sure if you pass? Start with Is My Store UCP Ready? — it walks through the full diagnostic in 60 seconds. Or jump straight to the tool:

Run a live check on your domain — it tests every requirement above in seconds. For runtime issues (variant mismatches, checkout failures), test with real agents in Playground. For ongoing monitoring, set up alerts.

Once you're verified, make sure your listing on UCP Registry is accurate — that's what agents see when deciding which stores to route customers to. And if you're a developer building agents rather than stores, the Build an Agent quickstart covers the other side of the equation.

Check your store now at UCPChecker.com. See how you compare: side-by-side store comparison. Platform guides: Shopify · WooCommerce · BigCommerce · Magento

AI Commerce Needs MLPerf — and Here's an Early Attempt

Benji Fisher — Fri, 01 May 2026 12:07:45 +0000

Validating a UCP manifest takes a second. Scoring it for agent-readiness takes another. Neither of those answers the harder question: when a real frontier agent — Claude or GPT or Gemini, picked by a user three weeks from now — walks up to your store with an ordinary shopping prompt, does it actually complete a checkout? Compared to the next implementation? Across the models people are actually using?

Today there's no shared way to find out. AI commerce has the same coordination problem ML had before MLPerf, web performance had before Lighthouse, and coding models had before HumanEval — and the cost of not solving it is the same: every claim a vendor makes about agent-readiness is currently unverifiable by anyone outside that vendor.

This post is about what we've been building to close that gap.

The pre-benchmark moment

Every category that grew up around AI has gone through a pre-benchmark moment.

Machine learning before MLPerf was a pile of vendor-flavoured numbers. NVIDIA reported one set of throughput claims, Google another, AMD a third — and none of it was directly comparable, because nobody was running the same workload, on the same input, on the same harness. MLPerf — submitted to, run by, and audited across the whole industry — fixed that. Buyers could finally compare. The category matured.

Web performance before Lighthouse was the same. "Fast website" was vibes. PageSpeed Insights gave one number, WebPageTest another, internal RUM dashboards a third. Lighthouse — graded, reproducible, open — fixed it. Today nobody ships a serious site without checking their score.

Coding models before HumanEval were even worse. Every lab benchmarked against its own preferred problems and reported its own preferred metrics. HumanEval, then MBPP, then SWE-bench, then LiveCodeBench, gave the field a shared evaluation surface. Comparisons stopped being marketing.

Agentic commerce is in exactly the place those categories were before their benchmarks landed. The standard has converged — UCP is the open spec the industry is building against, and the public directory tracks 4,500+ verified stores. Major retailers and platforms ship UCP implementations almost weekly. The recent tech council expansion brings in most of the rest. But there is still no neutral, reproducible way to evaluate how well any of those implementations actually work when a real frontier agent tries to shop them.

You can't get this from inside a vendor. Shopify cannot credibly benchmark Shopify stores. OpenAI cannot credibly benchmark OpenAI agents. Even when their numbers are honest, the methodology is theirs, the test conditions favour their stack, and nobody else can rerun it. AI commerce has the same coordination problem ML had before MLPerf, and it solves the same way: a shared evaluation layer, run by a third party, that anyone can audit and reproduce.

Agentic commerce can't mature without that layer. We've built a first credible attempt at one.

What UCP Playground Evals does

UCP Playground Evals is a benchmark framework for agentic commerce. You define a multi-turn shopping conversation, pick the stores and the models you want to evaluate against it, and get back a structured comparison report — funnel matrix, per-session token and duration metrics, error classification, replayable session links, downloadable PDF.

The point isn't the report format. The point is the three properties underneath, because those determine whether a benchmark is worth trusting.

1. Standardised, multi-turn sequences

Agentic commerce is conversational, not single-prompt. A real shopping session looks like "Show me products under $60" → "Add both to my cart" → "Proceed to checkout", with full context carried across turns. That's the unit an eval has to operate on.

Each eval is a scripted sequence of turns. Every turn gets its own orchestrator round (up to 8 internal tool-calling sub-turns) and the full conversation history is preserved across the sequence — so the agent's choices on T2 are conditioned on what it actually saw on T1, the way real user behaviour conditions on real responses. Four collections ship today: Browse & Buy (4 turns, generic shopping journey), Multi-Item (3 turns, multi-product cart composition and checkout), Price Constrained (3 turns, budget-anchored reasoning across a single purchase), and Custom for user-defined sequences.

2. Cross-store comparability

The sequences are intentionally generic. Not "Find Nike Air Max 90 in size 10" but "Show me products under $60". That distinction is load-bearing: it's what makes the same test valid against any store running UCP, and it's what makes results from one store directly comparable to results from another. Without it, every benchmark is apples-to-oranges and nothing aggregates.

The eval runner discovers MCP endpoints automatically from each store's /.well-known/ucp manifest, so any UCP-conformant store works without per-store wiring — Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and Custom & Headless stacks all work the same way.

3. Multi-model coverage

The same sequence runs against any of 15 frontier models currently wired up — every major lab, plus a reasoning-tuned subset:

Model	Provider	Type
Claude Opus 4.6	Anthropic	Frontier
Claude Sonnet 4.5	Anthropic	Frontier
GPT-5.2	OpenAI	Frontier
GPT-4o	OpenAI	Frontier
Gemini 3.1 Pro	Google	Frontier
Gemini 3 Flash	Google	Frontier
Gemini 2.5 Pro	Google	Frontier
Gemini 2.5 Flash	Google	Frontier
Grok 4	xAI	Frontier
DeepSeek V3.2	DeepSeek	Frontier
Llama 3.3 70B	Meta	Frontier
DeepSeek R1	DeepSeek	Reasoning
QwQ 32B	Alibaba	Reasoning
Grok 3 Mini	xAI	Reasoning
o4-mini	OpenAI	Reasoning

The model is part of the test matrix. Same store, different models, same sequence — directly comparable behaviour, with model-level differences surfaced rather than averaged away. Any two can also be compared side-by-side outside the eval framework, on the same workload.

The math is straightforward

stores × models × sequences = sessions. Two stores × two models × one sequence = four sessions. Each one is a full agent shopping run, captured end-to-end, replayable, and rolled up into the report.

Standardised, reproducible, vendor-neutral. The three properties that make a benchmark worth trusting. Everything else in the framework is built to defend those three.

What the framework actually surfaces

The clearest way to show what evals do is to walk through one. Below is a multi-item checkout report we ran across two stores and two Gemini models in March:

Download the full multi-item checkout report (PDF) →

Two-page report covering the funnel comparison matrix, per-session performance breakdown, evaluator configuration, auto-generated recommendations, and clickable session-replay IDs for every run.

Two stores (oakywood.shop, ugmonk.com). Two models (Gemini 3 Flash, Gemini 3.1 Pro). One sequence (multi-item checkout: search → add → checkout). Four sessions total. The headline numbers:

100% checkout rate across all four sessions
95,513 average tokens per session
48.3s average duration
0 errors across the matrix

That's the boring summary. The interesting parts are in the per-session table.

Store	Model	Tokens	Duration	Turns	Cart value
oakywood.shop	Gemini 3.1 Pro	85,614	93.4s	7	EUR 82.75
oakywood.shop	Gemini 3 Flash	154,294	34.7s	12	—
ugmonk.com	Gemini 3.1 Pro	46,084	35.1s	6	USD 77.00
ugmonk.com	Gemini 3 Flash	96,058	29.9s	11	—

Same sequence, same stores, two models. Gemini 3.1 Pro completes the run in fewer turns and roughly half the tokens of Flash on the same store, but its latency is meaningfully higher when the store itself is slower to respond. That isn't a fact you can extract from a vendor benchmark or a single-model demo. It only shows up when the same scripted run hits multiple models head-to-head, with both numbers landing in the same row.

The auto-generated recommendations point at where the real engineering work is, and they're grounded in the actual run data:

Average token usage is 95,513 — above the 40K baseline. Product descriptions may be inflating context. Consider truncating descriptions in MCP responses.

Average session duration is 48.3s — above the 15s target. Optimise MCP endpoint response times, especially initial search calls.

Those are concrete merchandising actions. They land because the evidence is right there in the per-session breakdown.

The deeper signal shows up across runs against richer stores. In a separate eval against a single shop, two models picked different variant IDs for "Medium" — one mapped Medium to one variant ID, the other to a different one, and neither is provably correct because the store doesn't expose a human-readable size axis in its variant data. That isn't a bug in either model. It's a gap in how the store represents its product axes, and it only becomes visible when two models walk the same path. This is the kind of behavioural divergence between frontier models that evals surface — and that vendor-internal benchmarks can't credibly report.

The same run logged 6/6 prompt-injection resistance across every session, against benchmark prompts seeded in product descriptions and review fields. Useful by itself; more useful as a baseline that future runs can regress against.

What's on the evals roadmap

This is v1. A few things on the roadmap, in priority order.

More eval collections. The four built-in sequences cover the core shopping flow. The next batch is more diagnostic: single-item flow (the simplest path), variant selection accuracy (the size-label gap above, formalised), prompt-injection resistance (already running, becoming its own collection), escalation handling (requires_escalation compliance), attribution accuracy (UTM and referrer handling at checkout hand-off), return policy surfacing.

Public benchmark leaderboards. Same pattern as the UCP Score leaderboard — by-store and by-model rankings against the standard sequences, refreshed on schedule, indexed and shareable. The categories that matured around shared benchmarks (ML, web perf, coding models) all developed public leaderboards — and the leaderboards turned out to be most of the forcing function.

Headless API and CI/CD integration. Already shipped. The full automation surface:

POST /api/v1/collections          — create
POST /api/v1/collections/{id}/run — trigger
GET  /api/v1/collection-runs/{id} — poll status + results
GET  /api/v1/collection-runs/{id}/pdf — download report

The first integration we expect anyone to ship is a deploy-time check: trigger an eval after every UCP manifest deploy, assert checkout_rate >= 80, errors.total == 0, avg_duration_ms < 30000, fail the build otherwise. Same shape as Lighthouse CI for web performance — a regression catch you bolt onto the pipeline rather than rediscover in production. Full developer documentation — authentication, rate limits, and a worked GitHub Actions example — lives at ucpchecker.com/developer-tools, alongside the rest of the public API surface.

Scheduled runs and version tracking. Also shipped. Collections auto-increment versions when their config changes, runs snapshot the config they used, and a cron field on each collection lets you run the same eval on a regular cadence — same Monday-9am sequence every week, before-and-after comparisons whenever the underlying UCP implementation changes. This is how a benchmark becomes a tracking record instead of a one-shot demo.

Cloning and team scoping. Public collections can be cloned into any team workspace; quotas are scoped per team. The intent is community sharing — well-known sequences turning into shared, reusable yardsticks the way SWE-bench problem sets did for coding models.

How evals fit the broader development cycle

Evals don't sit alone. They're the runtime testing surface in a development loop that starts earlier in UCP Checker — manifest validation, agent-readiness scoring, capability coverage analysis. The web performance world solved the same shape with three tools used in sequence: Lighthouse to grade pages, PageSpeed Insights to drill into specific issues, synthetic monitoring to verify behaviour over time. UCP implementations follow the same arc: validate the manifest at /check, score it against agent-readiness criteria with the UCP Score, then run evals against it to see how it actually behaves when a real frontier agent shops it.

Each tool surfaces something different. Score tells you what's missing structurally — which discovery signals, which capabilities, which conformance rules. Check confirms the manifest validates after fixes land. Evals confirms the agent actually behaves correctly when it tries to complete a real flow. None is sufficient on its own; together they're the development feedback loop UCP needs. We've watched developers iterate across the whole thing in a single session — score the implementation, fix the gap server-side, re-check the manifest, then run an eval to confirm the agent now closes a checkout it couldn't before.

If you're starting from zero on a UCP implementation, the natural sequence is: get a Score first to see what's missing, fix the highest-impact issues, run a Check to confirm the manifest validates cleanly, then run Evals to confirm real agents complete the flows you care about. CI covers the long tail — automated scoring on each deploy, scheduled evals weekly, alerts when capabilities regress.

Methodology and verification

Three properties separate a credible benchmark from a marketing claim. UCP Playground Evals are designed around all three.

Every result links to a replayable session. Each eval session generates the same agent_sessions data the public Playground UI produces — full tool-call timeline, model responses, token-by-token event stream, every retrieved page. The session IDs in any report are clickable. Open one and you see exactly what the agent did, turn by turn, on which tool call, with which response. The sample report above lists four such IDs (e.g. 01KMJZM5MG2CA4QN5M983H19E1) and each resolves to a full replay at ucpplayground.com/sessions/{id}. This isn't a marketing claim; it's a verifiable test you can audit.

Every collection is versioned. When the configuration of a collection changes — turns added, models swapped, store list updated — the version increments and every run snapshots the config it ran against. Anyone questioning a result can reproduce the exact methodology used at that moment. The PDF report itself prints the collection version at the bottom of every page; the sample above is Collection v3. Versioning is what stops "we got better results" from quietly sliding into "we changed the test" — the same constraint MLPerf submission rules enforce on hardware vendors.

The methodology is open. The framework configuration shape is documented — the turns, the orchestrator loop, the stop conditions, the success metrics, the PDF schema. Anyone can build the same test, run it against any UCP store, and get back a directly comparable report. If we get a methodology choice wrong, the path to disagreement is technical, not promotional.

That's the credibility floor. Everything else in the product builds on it.

About UCP Checker and UCP Playground

UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the merchant directory and the UCP Score, publish the leaderboard and adoption stats, and ship developer tools — the validator, bulk checker, browser extension, public dataset, and a public REST API. The whole dataset is open, indexed, and ungated.

UCP Playground is the agent shopping layer that sits next to it — same data model, same /.well-known/ucp discovery, same replayable session format. UCP Playground Evals is the benchmark surface on top of that. Together they form the third-party scoreboard the ecosystem can build trust on top of — the SSL Labs and Lighthouse of agentic commerce, depending on which side you're looking from.

Try it

The interesting eval gaps are the ones nobody's tested yet. If a result surprises you — your own store, a competitor's, a model you assumed was a clear winner that turns out not to be — let us know.

Three concrete next steps:

Run an eval against your own UCP store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, run it. The four-session example above is the shape most first runs take.
Read a public eval report. Sample reports are linked from the framework page. Each has clickable session IDs you can replay end-to-end.
Wire it into CI. The developer tools page covers authentication, rate limits, and a GitHub Actions worked example. The assertion shape is the same one Lighthouse CI uses for web performance — checkout_rate, errors.total, avg_duration_ms instead of LCP and TBT.

Is My Store UCP Ready? How to Check in 60 Seconds

Benji Fisher — Thu, 30 Apr 2026 10:25:51 +0000

The short answer: enter your domain here and you'll know in under 60 seconds. This UCP ready check runs the same validation that AI agents use to decide whether your store is worth shopping.

The longer answer — what "UCP ready" actually means, why it matters, and what to do about the result — is what this post covers.

What UCP readiness means

A store is "UCP ready" when it publishes a valid manifest at /.well-known/ucp that AI shopping agents can discover, parse, and act on. That's the technical definition.

In practice, there are three levels:

Level 1: Verified

Your manifest exists, returns valid JSON, and passes schema validation against the current v2026-04-08 spec. You appear in the UCP directory. Agents can find you.

As of this month, 4,024 stores are at this level.

Level 2: Agent-functional

Agents can actually shop your store — not just discover it. Your MCP endpoint responds, your product data is clean, your checkout flow completes without errors. You score B+ or higher on the Playground leaderboard.

422 stores are at this level. The gap between "verified" and "agent-functional" is where most common errors live.

Level 3: Optimized

Agents complete purchases reliably across multiple models. Your variant data is clean, your latency is low, your capabilities go beyond the defaults. You score A. Only 9 stores are here today.

The UCP requirements checklist breaks down exactly what each level requires.

How to check your store

Step 1: Run the checker

Go to UCPChecker.com/check and enter your domain. When you check your UCP status, the checker will:

Fetch /.well-known/ucp from your domain
Validate the JSON against the current spec
Check your robots.txt for AI bot policies
Inventory your declared capabilities, transports, and payment handlers
Verify your UCP compliance and report every error and warning with specific error codes

The whole process takes about 1 second. You'll get a full diagnostic report on your status page.

Step 2: Read the result

Verified (green) — your manifest is valid. You're in the directory. Agents can find you. Check the warnings section for things to improve.

Invalid (amber) — your manifest exists but fails validation. The diagnostic panel shows exactly which fields are wrong or missing. Most invalid manifests are one fix away from passing — usually a missing required field or a misplaced signing_keys.

Not Detected (grey) — no manifest found at /.well-known/ucp. Your store isn't UCP ready yet. See the requirements post for what to publish.

Blocked (orange) — your robots.txt or firewall is preventing access to the manifest. The diagnostic will tell you whether it's a robots.txt rule or an HTTP-level block.

Step 3: Fix what's broken

The checker tells you what is wrong. Here's where to go for how to fix it:

Platform-specific guides: Shopify · WooCommerce · BigCommerce · Magento
Manifest reference: /.well-known/ucp developer guide
Error-by-error fixes: Common UCP errors
Spec changes: v2026-04-08 update

Step 4: Test with real agents

Schema validation tells you if your manifest is syntactically correct. It tells you nothing about whether an agent can actually buy something from your store. For that, you need UCP Playground — it runs real AI agent sessions against your store and shows you exactly where the flow breaks.

The agent testing data shows that the most common runtime failure is variant mismatches — clean product data matters more than perfect schema.

Step 5: Monitor

Your UCP endpoint is a live API. Platform updates, catalog changes, and CDN reconfigurations can break it silently. Set up UCP Alerts to get emailed the moment your status changes — before agents notice.

How you compare

Once you're verified, see how your store stacks up:

Compare side-by-side with a competitor or partner store — capabilities, transports, payment handlers, latency.
Browse your platform — see all verified Shopify, WooCommerce, BigCommerce, or Magento stores ranked by capability depth.
Check the leaderboard — stores graded A through F on real agent shopping performance.

Why this matters now

UCP adoption is accelerating. 1,400+ new merchants were discovered in April alone. Shopify migrated its entire fleet to the latest spec in four days. BigCommerce, WooCommerce, and Magento stores are appearing every week.

Am I UCP ready? The question isn't whether your store will need UCP. It's whether you'll be ready when agents start shopping — and they already are.

Before you check, it helps to understand the building blocks: capabilities define what your store can do for agents, payment handlers define how agents pay, transports define how agents connect, and product discovery is the flow agents actually run when they shop.

Make sure your listing on UCP Registry is accurate once you're verified — that's how agents find you in the first place.

Check your store now →

Build your own agent: developer quickstart. Understand the protocol stack: MCP vs UCP vs AP2. Monthly ecosystem data: State of Agentic Commerce.

Introducing the UCP Score: A 0–100 Agent-Readiness Grade for Every UCP Store

Benji Fisher — Wed, 29 Apr 2026 09:41:44 +0000

After every status check on UCPChecker, the same follow-up question lands in our inbox: "OK, my manifest is verified. But is it actually any good?"

That question comes from everywhere. Engineering leads who shipped a manifest last quarter and want to know if it would actually carry an agent through checkout. Platform teams pitching agent-readiness to merchants who need a number, not a status pill. Analysts trying to chart "how Shopify compares to WooCommerce" and finding that "verified" tells them next to nothing. Developers picking which UCP store to integrate with first. AI agent builders deciding whose endpoints to feature in demo flows. Store owners benchmarking against direct competitors before a quarterly review.

None of these audiences really care that a manifest exists. They care about how good it is. Whether it has the surface signals that keep AI shopping agents finding it. Whether the declared transports actually respond when you call them. Whether the spec and schema URLs in the manifest resolve, or quietly 404 the moment a strict agent tries to validate the response shape. The interesting answer is always graded.

Until today, the only way to answer that question on UCPChecker was to read every line of the validator output and squint. So we built the thing people were already trying to do manually.

Get a UCP Score for any domain at ucpchecker.com/score →

What the UCP Score is

A 0–100 composite grade that measures how agent-ready any UCP store actually is. Not "does the manifest exist" — that's the status page. How well does it work for agents.

The score maps to a single letter grade you can share, embed, or watch over time. Bands are deliberately calibrated to match Lighthouse and SSL Labs — A is meant to be hard to earn:

A (85–100) — Agent-ready. Valid manifest, strong discovery, broad capability coverage.
B (70–84) — Solid. Minor gaps or one weak category, agents can still transact.
C (50–69) — Partial. Manifest works but missing capabilities or surface signals.
D (30–49) — Weak. Manifest reachable but invalid or near-empty.
F (0–29) — Failing. Blocked, unreachable, or no manifest detected.

Every score breaks down into three weighted categories so you can see exactly where the points come from:

Agent Discovery (30%) — Can agents find and reach you? HTTPS, reachability, agent-friendly robots.txt, plus the surface signals that keep you in the conversation: /llms.txt, sitemap.xml, Open Graph tags, Organization JSON-LD, mobile viewport meta.
UCP Conformance (40%) — Does the manifest validate against the spec? Validity is 3× weighted in this category — an invalid manifest cannot score above ~50 here, regardless of how good the surface polish is.
Capability Coverage (30%) — What can an agent actually do at your store? Declared transports (REST/MCP/A2A), checkout, payment handlers, and breadth of capabilities. When functional probes run, declared transport endpoints that don't actually respond drag this score down.

The composite is a straight weighted average: Discovery × 0.30 + Conformance × 0.40 + Capabilities × 0.30. No tricks, no hidden weights. The full ruleset is documented in our methodology.

What you actually get

Every score URL is a live page at /score/{your-domain}, indexed and shareable. Open one and you don't just see a number:

Top priorities — The three highest-impact issues we found, ranked by impact × effort. Start here.
Impact vs Effort matrix — Quick Wins / Strategic / Incremental / Consider Later quadrants so you can plan a sprint instead of staring at a wall of warnings.
Recommendations with copy-paste fixes — Every flagged issue surfaces a snippet you can drop straight into your manifest, robots.txt, sitemap, or HTML <head>. Hit "Show fix", copy, paste, redeploy, re-check.
Platform-aware percentile — "You're at p72 latency vs the median Shopify store." Because comparing your latency against the whole directory is meaningless when half of it runs on a fundamentally different infrastructure profile.
Full check breakdown — Every signal we evaluate, grouped by category, with a "why it matters" paragraph alongside each check. No black boxes.
Save this report — We re-run the full check weekly and email you only when something material changes. Score drops, capability regresses, status flips. Free, no marketing, unsubscribe anytime.

The page is ungated. No signup, no paywall, no "create an account to see the breakdown." We're indexing every score — just like SSL Labs grades and PageSpeed scores. Public scores create a baseline and pressure for the ecosystem to improve, in the same way SSL grades did for HTTPS adoption.

Why we built it

The honest answer: verified-or-not is the wrong question now.

When the UCP spec first landed in January (v2026-01-11), finding a verified store at all was novel. The bar was "did anyone publish a manifest." The status page was the right product for that moment, and it still is for the discovery layer.

The directory has 4,500+ verified domains today. Verified isn't novel. The interesting question shifted to "how well does this thing actually work for agents," and nobody had a good answer to that — including us.

When we ran a deeper analysis for our April State of Agentic Commerce report, the gap was stark: out of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent experience. A 0.2% flawless rate. The other 99.8% had a manifest published — they just didn't actually work as well as that manifest suggested. That gap between "verified" and "actually works" is the central infrastructure problem in agentic commerce today. The UCP Score makes that gap visible, measurable, and addressable.

There's a clear analogue: PageSpeed before Lighthouse. Pre-Lighthouse, web performance optimisation was vibes. People knew slow sites were bad and fast sites were good but couldn't quantify "how slow" or "compared to what." Lighthouse gave them three things — a graded score, a category breakdown, and copy-paste optimisations — and the field changed overnight. Nobody ships a serious site today without checking their Lighthouse score first.

The agentic commerce ecosystem is at exactly that pre-Lighthouse moment. There's no shared yardstick for agent-readiness. Stores have no way to tell whether the integration they shipped last month is competitive. Platform teams have no way to back up "our merchants are more agent-ready" with a number. AI agent builders have no way to filter "show me the stores most likely to actually complete a transaction."

The UCP Score is meant to be that yardstick. Lighthouse for agentic commerce.

How we built it (the short version)

Three signal sources, one composite:

Static analysis — The same manifest validator that powers /check and /ucp-validator. Validity, version format, signing keys, payment handlers — every spec rule turned into a check row.
Surface signals — Five public files and meta tags fetched in parallel: /llms.txt, /sitemap.xml, Open Graph, Organization JSON-LD, viewport. Presence + content captured (with a content hash for change detection on llms.txt so we can spot when a brand updates their LLM brief).
Functional probes (opt-in) — Two probe families. Transport probes hit each declared transport endpoint with a benign request (MCP gets a tools/list, REST/A2A get a GET). URL resolution probes fetch every spec and schema URL declared in the manifest. Probes only run on user-triggered checks — not on the 24h cron sweep, because hammering 4,500 merchants daily with a dozen extra HTTP requests each isn't neighbourly.

Each signal feeds one category sub-score (0–100), and the composite is the weighted average. Recommendations join error codes against a fix library so every flagged issue surfaces a copy-paste snippet — the same pattern Lighthouse uses for its audit list. The whole pipeline runs on the same 24h cycle as the rest of the directory; checks you trigger manually run the full probe stack.

If you want the deep version, the methodology page walks through every category, every check, every grade band, and the "what we don't score" list.

What you can do with it

A few workflows the score unlocks immediately:

Pre-merge gate — Add a check in your CI that fails the build if your /score/{domain} drops below B. Same pattern as Lighthouse CI. The score URL is stable and the JSON breakdown lands in the API soon.
Platform comparison — The /platforms page now shows average UCP Score by platform — Shopify vs WooCommerce vs BigCommerce vs Magento at a glance. Useful both for picking a stack and for benchmarking the one you're on.
Leaderboard — The leaderboard is now ranked by UCP Score with sortable columns for each sub-score. Filter by platform to see the top stores on your stack.
Monitoring — Save any report against your email. We re-run it weekly and alert you on regressions. Score drops, capability disappears, status flips — one email, free, no marketing.
Competitive benchmarking — Run Allbirds vs Casper and see grades side by side. The compare page picks up score data automatically.

What's next

This is v1. A few things already on the roadmap:

Score history & sparkline — Save a report and you'll see your score trend over time. We're tracking every check in our history table from day one, so the data exists; the visual lands shortly.
Score API — GET /api/v1/score/{domain} returning the full breakdown as JSON. The data feed is already public; the score endpoint is the same data behind a stable contract.
Spec-version-aware scoring weights — As new UCP spec versions land with new emphasis, scoring rules for each version live in config and absorb cleanly. Already version-aware for validation; widening to scoring weights too.

We've also taken pains to make the system absorb future spec releases without a rewrite. Static check copy lives in config, not hardcoded; new error codes plug into the recommendations engine via a single config entry. The next spec drop should land as a configuration change, not a refactor.

About UCP Checker

UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the public merchant directory, publish the leaderboard and adoption stats, and ship developer tools — the validator, the bulk checker, the browser extension, and now the UCP Score. Everything is free, indexed, and ungated; the dataset is published openly under CC-BY 4.0. Think of us as the SSL Labs of agentic commerce — the third-party scoreboard the ecosystem can build trust on top of.

Try it

Pick any domain. Type it into ucpchecker.com/score and you'll have a graded report in under a second. If you find a score that surprised you — yours or a competitor's — let us know. The interesting score gaps are the ones nobody's looked at yet.

Get a score: ucpchecker.com/score
See the leaderboard: ucpchecker.com/leaderboard
How it's calculated: ucpchecker.com/methodology
Compare two stores: ucpchecker.com/compare
Track adoption live: ucpchecker.com/stats
Get notified on changes: ucpchecker.com/alerts

UCP Tech Council Expands: What the Meeting Minutes Tell Us About Where the Protocol Is Heading

Benji Fisher — Sun, 26 Apr 2026 21:37:00 +0000

On Friday just gone, five of the largest technology companies in the world quietly joined the governing body of the Universal Commerce Protocol. No press release. No blog post. Just a commit to MAINTAINERS.md in the spec repository.

Amazon. Meta. Microsoft. Salesforce. Stripe. All now have seats on the UCP Tech Council — the body that reviews, debates, and approves every change to the protocol that AI shopping agents use to buy things.

We know this because we read the meeting minutes. Every week, the TC meets to debate spec changes, vote on PRs, and argue about how agent commerce should work. Most people in the industry don't read these minutes. We do — and what they reveal about where UCP is heading is more interesting than any announcement.

This is what the minutes tell us.

The expansion: who joined and why it matters

The Tech Council grew from roughly 12 seats to 16 members across 8 companies:

Company	Representatives	Role
Google	4 seats	Founding sponsor, spec steward
Shopify	4 seats (incl. 2 new)	Largest platform implementer
Amazon	Greg Smith (new)	The world's largest online retailer
Meta	James Andersen (new)	Social commerce, Instagram Shopping
Microsoft	Patrick Jordan (new)	Copilot, enterprise commerce
Stripe	Prasad Wangikar (new)	Payment infrastructure
Salesforce	Scot DeDeo (new)	Commerce Cloud, enterprise retail
Etsy	Imran Hoosain	Marketplace commerce
Target	Maxime Najim	Enterprise retail
Wayfair	Naga Malepati	Furniture/home goods

This isn't ceremonial. The TC has binding authority over spec changes — every PR that ships in a UCP release has been reviewed and voted on by this group. When Amazon and Stripe join that table, it changes what gets prioritised, what gets debated, and ultimately what the protocol becomes.

The meeting minutes from March 13 first mentioned the election process: seats rotating every six months, with growing partner interest. By March 27, six nominations had been received. The final review was scheduled for April 10. The MAINTAINERS.md update landed April 24.

The new members are already contributing. James Andersen (Meta) submitted PR #367 on April 17 — a documentation PR clarifying network token usage and PCI scope in card credentials. Patrick Jordan (Microsoft) contributed documentation accuracy fixes the same day. These aren't advisory seats. They're engineering seats.

What the meeting minutes actually say

We reviewed the six TC meetings from March 6 through April 17. Here's what's being debated, decided, and built — translated for a merchant audience.

Identity linking is the top priority — and it's hard

The single most discussed topic across all six meetings is identity linking — how an agent knows who the customer is across sessions, stores, and platforms.

The April 17 minutes show an active debate about OAuth 2.0 scope design: nested scopes vs flat scopes vs config maps. The TC favoured flat. PR #354 implements OAuth 2.0 as the foundation for identity linking with capability-driven scopes.

Why this matters for merchants: Identity linking is the missing piece that would let an agent complete a purchase without a checkout-page handoff. Right now, agents can browse and cart — but paying requires redirecting the customer to a human checkout flow. Identity linking + payment handlers would close that loop. Until then, agents rely on the transport layer to reach the store and the manifest endpoint for discovery. Our April state-of-commerce report showed only 3 stores out of 4,024 currently declare identity linking capability. The spec work happening now is what will eventually bring that number up.

Loyalty is being trimmed to ship faster

The TC has been debating loyalty schemas since March. PR #340 implements a loyalty extension for the checkout capability. The April 10 minutes note that the extension is being "trimmed to baseline use cases" — a pragmatic decision to ship something that works for simple loyalty programs now, rather than waiting for a comprehensive solution that handles every edge case.

Why this matters: If your store has a loyalty or rewards program, the spec is building the infrastructure for agents to verify loyalty status and redeem points as part of the checkout flow. This is early — don't build against it yet — but understand that it's coming and it's being shaped by people at Google, Shopify, Etsy, and Target who run real loyalty programs.

Local commerce is on the roadmap

The April 3 minutes list Q2 priorities. Among them: local commerce. PR #375 proposes store-based local inventory and fulfilment options — the infrastructure an agent would need to answer "is this product available at a store near me?"

This is Target and Wayfair territory. Both have TC seats. Both have store networks. The fact that local commerce is a Q2 priority with retail representation on the council suggests it's not theoretical.

Returns are "incredibly complicated"

The April 17 minutes include the most honest assessment we've seen in any spec discussion: returns are acknowledged as an "incredibly complicated domain." This is refreshing. Most protocol specs pretend returns are simple. UCP's TC is saying out loud that they're not, and that getting them right will take time.

PR #257 from the February cycle introduced a returns extension. It's still in review. The complexity is in modelling return windows, refund methods, partial returns, and eligibility rules — all of which vary by merchant, product, and jurisdiction.

Why this matters: Don't expect agent-managed returns in 2026. But understand that the protocol is building toward it, and the merchants who implement return policies as structured data (not just PDF links) will be ahead when it ships.

The spec itself just shipped its biggest release ever

v2026-04-08 landed with 60+ merged PRs — the largest release since the protocol launched. Key additions:

Cart capability — basket building for agents, a prerequisite for multi-item flows
Catalog search + lookup — formalised product discovery as a spec capability
Request/response signing — cryptographic integrity for agent-store communication
Error handling overhaul — first-class errors, business logic error types
Eligibility claims — for loyalty, membership, and verification-gated pricing
Discount extension to cart — discounts now apply pre-checkout, not just at checkout
Risk signals — authorization and abuse metadata for fraud prevention

Our crawler showed Shopify migrating its entire fleet to v2026-04-08 in four days. 99.4% of verified stores are now on the latest spec.

What this means for you

If you're a merchant

The governance expansion doesn't change what you need to do today. Your UCP requirements are the same: valid manifest, declared capabilities, clean variant data. Check your store, fix any common errors, compare against competitors, and set up alerts so you know if anything breaks.

What it does change is the timeline and the confidence. When Amazon, Microsoft, and Salesforce have engineering seats on the governing body, the protocol is not going away. If you've been waiting for a signal that UCP is "real enough" to invest in — five of the ten largest technology companies joining the TC in a single commit is that signal.

If you're a platform

If you run Shopify, you're covered — platform-level UCP support is mature. If you run BigCommerce, WooCommerce, Magento, or a custom stack, watch the identity linking and loyalty PRs. These are the capabilities that will differentiate agent-ready platforms from agent-compatible ones in H2 2026.

Salesforce Commerce Cloud now has a seat at the table. If you're on SFCC, this is the clearest signal yet that platform-level UCP support is coming. Our April report noted that we've already seen SFCC engineering work in progress.

If you're building agents

The Build an Agent quickstart still works — the protocol surface you're building against is stable. But start tracking the identity linking PRs. When that capability ships, the agent flow goes from "browse + cart + redirect to checkout" to "browse + cart + pay" — end-to-end autonomous purchasing. That's the step change.

Check the store leaderboard to find the highest-performing targets, understand how product discovery works, and test your agent against real stores in UCP Playground and use UCP Registry for production discovery. Both will surface the new capabilities as they ship.

The reading list

For anyone who wants to follow the protocol's evolution themselves:

Meeting minutes: github.com/Universal-Commerce-Protocol/meeting-minutes
Spec repo: github.com/Universal-Commerce-Protocol/ucp
v2026-04-08 release notes: github.com/Universal-Commerce-Protocol/ucp/releases/tag/v2026-04-08
MAINTAINERS.md: github.com/Universal-Commerce-Protocol/ucp/blob/main/MAINTAINERS.md
Active PRs: github.com/Universal-Commerce-Protocol/ucp/pulls

We'll continue monitoring the spec, the TC minutes, and the 4,500+ merchants building on the protocol. If any of the Q2 priorities (identity, loyalty, local commerce) ship in spec form, we'll cover them in the May state-of-commerce report.

Check your store's UCP status at UCPChecker.com. Browse verified stores at UCPRegistry.com. Test agent performance at UCPPlayground.com. Read the full protocol stack: MCP vs UCP vs AP2.

Agentic Commerce Optimization: What 4,491 Merchants Reveal About UCP Readiness

Benji Fisher — Thu, 23 Apr 2026 11:53:42 +0000

Agentic Commerce Optimization: What 4,491 Merchants Reveal About UCP Readiness

Every UCP technical guide tells you how to get UCP ready. We decided to measure who actually is.

Since UCP launched, UCP Checker has tracked 4,491 merchants — 4,024 of which are verified and actively serving UCP endpoints. We maintain the largest UCP index of live merchant implementations, and the data tells a story that no theoretical guide can. We've run over 1k agent testing sessions in UCP Playground, consumed 43 million tokens doing it, and watched real AI agents attempt to browse, cart, and buy products across every major ecommerce platform. The result isn't a theoretical framework for agentic commerce optimization. It's a field report.

And the field looks very different from what the guides tell you.

What "Agentic Commerce Optimization" Actually Means When You Have Data

The term "agentic commerce optimization" — or ACO — has entered the SEO lexicon as a catch-all for making your store ready for AI-powered shopping agents. Most of the early writing treats it like a checklist: add Schema.org markup, update your Merchant Center feed, structure your product data. That advice isn't wrong. It's just incomplete, because it's built on assumptions about how agents will behave rather than observations of how they actually do.

ACO, measured empirically, is the practice of optimizing your ecommerce stack for the specific patterns that AI agents exhibit when they interact with UCP endpoints. Those patterns are surprising. Agents don't browse the way humans do. They don't use carts the way humans do. And the failure modes that block them from completing purchases are not the ones you'd predict from reading the spec alone.

The data we've collected across 4,024 verified UCP merchants tells a concrete story about what matters, what doesn't, and where the real optimization opportunities are hiding.

The Real State of UCP Readiness

Let's start with what's working. Of the 4,024 verified merchants in UCP Registry — the open UCP directory where agents discover merchants — capability adoption breaks down like this:

Checkout: 4,003 merchants (99.5%)
Cart: 3,987 merchants (99.1%)
Product discovery: Near-universal
Identity: 3 merchants
Payment: 0 merchants

Read those last two numbers again. Three merchants support identity. Zero support native payment. This is the defining feature of UCP's current state: the bottom of the funnel is wide open, but the capabilities that would make agentic commerce truly autonomous — knowing who the customer is and processing payment without a handoff — are functionally nonexistent.

The spec migration numbers are more encouraging. When the v2026-04-08 specification dropped, 3,994 out of 4,022 tracked merchants had migrated within four days. That's a 99.3% adoption rate in under a week, which speaks to the platform-driven nature of UCP rollout. Most merchants aren't manually implementing UCP. Their platform is doing it for them, and the platforms shipped the update fast.

Platform-by-Platform Reality

The theoretical guides will tell you that UCP readiness is about your structured data and feed configuration. In practice, it's mostly about which platform you're on. Here's what we've seen across the major players.

Shopify: The Default Winner

Shopify accounts for roughly 74% of identified platforms in our dataset (898 of the platform-identified merchants). This dominance isn't because Shopify merchants are more proactive about UCP — it's because Shopify rolled out UCP support at the platform level, giving every store baseline compliance automatically.

Out of the box, a Shopify store gets functional product discovery, cart, and checkout endpoints. The Schema.org markup is handled. The Merchant Center feed attributes are populated. For the average merchant, getting UCP ready on Shopify means verifying that your product data is clean rather than building anything from scratch.

The downside: Shopify's one-size-fits-all approach means limited customization of UCP behavior. If you need to implement conversational commerce attributes like substitution logic or compatibility data, you're working within Shopify's constraints. But for baseline agentic commerce readiness, nothing else comes close to the out-of-the-box experience.

WooCommerce: Flexible but Inconsistent

WooCommerce stores show the widest variance in UCP readiness. The open-source model means implementation quality depends entirely on which plugins a merchant has installed and how they've configured their stack. We've seen WooCommerce stores with excellent structured data and smooth agent interactions right next to stores where basic product attributes are missing or malformed.

The flexibility is a genuine advantage for merchants who want to implement advanced ACO features — conversational attributes, detailed return policies, rich product relationships. But the inconsistency is a problem for agents, which need predictable data structures to operate reliably. If you're on WooCommerce and serious about agentic commerce optimization, an audit of your specific UCP endpoint output is essential, not optional. Run your store through UCP Checker and see what an agent actually encounters.

BigCommerce: Strong APIs, Broken Images

BigCommerce has a genuine technical advantage in its API architecture. The platform's API-first design translates well to UCP's endpoint model, and the stores we've tracked generally produce clean, well-structured UCP responses.

But there's a specific, persistent issue: BigCommerce's S3-hosted image URLs break agent image parsing. This is a real failure mode we've observed in Playground sessions. When an agent can't parse product images, it loses a significant input signal for product matching and variant selection. For a platform that otherwise has strong UCP fundamentals, this is an unfortunate gap — and one that BigCommerce merchants should pressure their platform to fix. For now, it's worth investigating whether your image delivery pipeline produces URLs that agents can reliably consume. Our BigCommerce guide walks through the specifics.

Magento (Adobe Commerce): Enterprise Muscle, Enterprise Complexity

Magento implementations tend to be enterprise-grade, which means the UCP output is thorough but the setup complexity is high. These stores generally have rich product data, detailed catalog structures, and the kind of attribute depth that agents love. But the implementation burden falls more heavily on the merchant's development team compared to Shopify or BigCommerce, where the platform handles the heavy lifting.

If you're on Magento and aren't UCP ready yet, expect a meaningful engineering investment. If you have started, you're probably in good shape — the platform's data model maps well to what UCP expects, especially for multi-variant products and complex catalog hierarchies. See our Magento guide for implementation specifics.

What Agents Actually Do (vs. What Guides Tell You to Optimize For)

Here's where our data diverges most sharply from the advisory content circulating about UCP preparation.

Agents Skip the Cart

The conventional model of ecommerce — browse, add to cart, review cart, checkout — doesn't describe how AI agents behave. In our Playground data, we've recorded 395 checkout operations versus just 104 cart operations. Agents are going direct to checkout nearly four times more often than they're using the cart.

This has major implications for agentic commerce optimization. If you've invested heavily in cart-level features — upsells, cross-sells, minimum order messaging, cart-based promotions — agents are likely bypassing all of it. The checkout endpoint is where the action happens. Your optimization effort should weight accordingly — compare your store against competitors to see where you stand: make sure checkout handles single-product and multi-product flows cleanly, with clear variant specification and unambiguous pricing.

Variant Mismatches Are the Top Failure Mode

Cart variant mismatches remain the most common reason agent sessions fail to complete a purchase. An agent selects a product, identifies the desired variant (size, color, configuration), and submits a cart or checkout request with a variant ID that doesn't match what the endpoint expects. The session stalls or errors out.

This isn't an agent intelligence problem — it's a data clarity problem. Stores with clean, unambiguous variant structures and consistent ID schemes see dramatically higher agent completion rates. Stores with complex variant matrices, inconsistent naming, or variant IDs that change between API responses create confusion that even the best models struggle to resolve.

If you do one thing for ACO today: audit your variant data. Make sure every variant has a stable identifier, a clear human-readable name, and consistent representation across your discovery and checkout endpoints.

Token Consumption Tells You Where Agents Struggle

We've consumed 43 million tokens over 1,000 Playground sessions. The per-session cost varies dramatically based on store complexity and model choice, but a telling pattern emerges in checkout flows: completing a purchase takes approximately 55,000 tokens with the best-performing models.

That number is a proxy for friction. A 55K-token checkout means the agent is making multiple round-trips, parsing product data, resolving variants, handling errors, and re-trying. Stores that produce clean, predictable UCP responses see lower token counts — which directly translates to faster agent interactions and lower cost for the platforms running these agents at scale.

Model Performance Varies Significantly

Not all AI models handle UCP interactions equally. Claude Sonnet 4.5 leads our Playground leaderboard with 205 sessions, and the checkout completion rate across all sessions sits at 41%. That might sound low, but consider what it represents: four out of ten fully autonomous purchase attempts succeed end-to-end, without any human intervention, across a diverse set of merchants with varying UCP implementation quality.

The model performance gap matters for merchants because it signals where your UCP implementation has rough edges. If top-tier models struggle with your checkout flow, every agent will struggle. Testing your store in UCP Playground with multiple models gives you a direct read on where your implementation creates unnecessary friction.

The Capabilities Gap That Will Define Winners

Go back to those adoption numbers: identity at 3 merchants, payment at 0. These aren't just gaps — they're the entire frontier of competitive differentiation in agentic commerce.

Right now, every UCP checkout ends with a handoff. The agent gets the customer to the point of purchase, then drops them into a traditional checkout flow to enter their identity and payment information. That handoff is where conversion dies. Every redirect, every form field, every authentication step is a chance for the customer to abandon.

The merchants who figure out identity and payment first — who let an agent complete a purchase end-to-end without a handoff — will have a structural conversion advantage that no amount of Schema.org optimization can match. This is where UCP's roadmap points: loyalty integration, post-purchase management, multi-vertical capabilities. But the foundation is identity and payment.

We don't yet know what the winning implementation pattern looks like for these capabilities. The spec supports them, but the ecosystem hasn't built them. This is the space to watch, and the space where early investment will pay disproportionate returns.

An Optimization Checklist Grounded in Data

Most ACO checklists are derived from the spec. This one is derived from watching >1,000 agent sessions succeed and fail across 4,024 merchants. Here's what actually moves the needle, ranked by observed impact:

1. Fix your variant data first. Stable IDs, clear names, consistent representation across endpoints. This is the single highest-impact fix based on our failure-mode analysis.

2. Optimize for direct-to-checkout flows. Agents skip the cart. Make sure your checkout endpoint handles product selection, variant specification, and pricing in a single clean interaction.

3. Audit your product images. If you're on BigCommerce or any platform using CDN-hosted images with complex URL structures, verify that agents can parse your image URLs. Broken image parsing degrades product matching accuracy.

4. Migrate to the latest spec version immediately. The v2026-04-08 migration happened in four days across the ecosystem. If you're still on an older version, you're already behind 99.3% of verified merchants.

5. Test with actual agents, not just validators. Schema validation tells you if your markup is syntactically correct. It tells you nothing about whether an agent can actually complete a purchase. Run your store through UCPPlayground and watch what happens.

6. Validate your full UCP endpoint output. Use UCPChecker to see exactly what your store exposes to agents — capabilities, product data, structured attributes — and where the gaps are.

7. Clean up your Merchant Center feed. Return policies, product identifiers, and the native commerce attributes that feed into UCP discovery. This is table-stakes, but our data confirms that stores with complete feed data see higher agent engagement in discovery flows.

8. Start thinking about identity and payment. You won't implement these today — almost nobody has. But understanding the spec's identity and payment capabilities now positions you — our April ecosystem report tracks adoption monthly to move fast when the ecosystem catches up. The jump from 0 to first-mover will be worth more than incremental improvements to discovery or checkout.

9. Monitor your platform's UCP updates. If you're on Shopify, WooCommerce, BigCommerce, or Magento, your platform is doing most of the UCP work. Stay current with their releases — set up domain alerts to get notified when your store's status changes. Platform-level updates drove 99.3% spec migration in four days — the single most effective "optimization" most merchants can do is simply keeping their platform current.

10. Get listed in the UCP directory. UCPRegistry is the open UCP index where agents discover merchants. Your listing is what agents see when deciding which merchants to route a customer to. Make sure you're listed, your data is accurate, and your capabilities are competitive with peers in your vertical.

The Bottom Line

Agentic commerce optimization isn't a theoretical exercise anymore. UCP ecommerce is live, it's measurable, and it's growing fast. Our UCP index tracks 4,024 verified merchants serving UCP endpoints today. AI agents are completing purchases 41% of the time. The gap between being UCP ready and being UCP optimized is measurable in variant data quality, checkout flow design, and capabilities adoption.

The merchants who treat ACO as a data problem — not just a markup problem — are the ones who'll convert when agents come shopping. And agents are already shopping. We've got 43 million tokens of proof.

Check if your store is UCP ready at UCPChecker.com. Browse the UCP directory at UCPRegistry. Test agent interactions in UCPPlayground. Platform-specific implementation guides: Shopify · WooCommerce · BigCommerce · Magento.

The State of Agentic Commerce — April 2026

Benji Fisher — Sat, 18 Apr 2026 09:48:53 +0000

In March, we crossed 3,000 verified stores and started seeing the first non-Shopify platforms in the directory. We said the next question was whether UCP would remain a Shopify story or become a real multi-platform standard.

April answered that. We crossed 4,000 verified stores, Shopify migrated its entire fleet to the new v2026-04-08 spec in a four-day window, BigCommerce entered the directory with its first three stores, and WooCommerce and Magento integrations started appearing from independent developers. The ecosystem grew 33% in one month while simultaneously upgrading the protocol underneath.

This is the third monthly state-of-the-ecosystem report from UCP Checker. Here's what the data says.

The numbers

As of April 17, 2026:

4,014 verified UCP stores (up from ~3,000 in March, +33%)
4,481 total domains tracked
47,154 total checks run
1,436 new merchants discovered this month
866 new merchants this week alone
3,988 stores on the latest v2026-04-08 spec (99.4%)

The growth curve is worth examining. February was discovery: we scanned our first thousand Shopify stores and found UCP everywhere on the platform. March was expansion: we broadened the crawler, crossed 3,000, and started seeing non-Shopify manifests for the first time. April is consolidation: the store count grew 33%, but the more significant movement was the spec migration and the first signs of platform diversification.

The weekly run rate matters here. At 866 new merchants discovered this week alone, the ecosystem is adding roughly 125 stores per day. But the growth isn't organic in the way a consumer product grows — it comes in waves, driven by platform-level deployments. When Shopify flips a switch, hundreds of stores appear overnight. When BigCommerce ships UCP, three appear. The question for May isn't "how many stores" but "which platforms ship next" — because each platform deployment is a step function, not a slope.

The Shopify spec migration

This is the story of the month. Between April 13 and April 17, Shopify migrated nearly its entire UCP fleet from v2026-01-23 to v2026-04-08.

On April 13, our crawler showed 2 stores on the new spec. By April 17: 3,988. That's 3,986 stores upgraded in roughly four days — a coordinated platform-level migration, not individual merchants updating their manifests.

The v2026-04-08 spec introduced three breaking changes:

signing_keys moved from nested to root level. Previously at ucp.signing_keys, now at the document root alongside ucp. This is the structural change that required a manifest rewrite, not just a version bump.
Business profile distinction. The spec now formally separates business profiles (individual store manifests at /.well-known/ucp) from platform profiles, with different requirements for spec and schema fields on services and capabilities. Business profiles are lighter — spec and schema are optional.
a2a transport formally added. Google's Agent2Agent Protocol is now a recognised transport alongside REST, MCP, and Embedded, though adoption is effectively zero in the wild.

The migration means 99.4% of the verified directory is now on the latest spec. Only 26 stores remain on older versions: 19 on v2026-01-11, 6 on v2026-01-23, and 1 on v2026-01-14. These are almost entirely non-Shopify stores that need to upgrade manually.

For the full spec breakdown, see our v2026-04-08 spec announcement and the spec versions page.

Beyond Shopify: platform diversification accelerates

Shopify still dominates at 3,982 of 4,014 verified stores (99.2%). But the other 32 verified stores tell a more interesting story — these are developers who chose to publish a UCP manifest without a platform-level integration doing it for them.

BigCommerce entered the directory with its first three verified stores: untilgone.com, touchupdirect.com, and midwoodflowershop.com. All three are on v2026-04-08 with checkout and cart capabilities declared. Notably, their average manifest latency (~890ms) is significantly higher than Shopify's (~130ms) — BigCommerce manifests are served from the storefront origin rather than a CDN-cached endpoint. Platform-level latency differences like this will matter as agent response budgets tighten.

WooCommerce now has 3 verified stores, up from zero in March. These are hand-built integrations — WooCommerce doesn't have native UCP support, so each merchant published their manifest manually. We fixed a validation bug this month that was incorrectly rejecting WooCommerce manifests with payment_handlers: [] (valid for stores using checkout-link redirect flows).

Magento has 1 verified store. Custom/headless stacks account for 25 verified stores — the most architecturally diverse group, including our own ucpchecker.com manifest.

Salesforce Commerce Cloud has zero verified stores in the directory today. But industry signals suggest SFCC is exploring UCP support at the platform level — not as a one-off client integration, but as a feature that would ship to all Commerce Cloud merchants. If it follows the Shopify pattern — a single platform-level deployment bringing thousands of enterprise storefronts (Puma, Ralph Lauren, Under Armour, Adidas) into the ecosystem in one wave — the directory composition would shift significantly. SFCC is natively REST-based, so a REST-first UCP transport would be the natural fit, compared to Shopify's MCP-first approach. We're watching this closely.

The full platform breakdown is live on our new /platforms page.

How agents actually perform

The numbers above tell you which stores have UCP. This section tells you which stores work when an AI agent actually tries to shop them — and which models do it best.

Store benchmarks

Playground benchmarks grade stores A through F on end-to-end agent shopping performance:

Grade	Count	What it means
A	9	Agent completes the full flow flawlessly
B+	422	Works with minor issues — the largest cohort
B	222	Cart succeeds, checkout has friction
C+ / C	225	Discovery and browse work, deeper flow breaks
D	16	Significant failures across the flow
F	289	Manifest validates but the agent can't complete any step

The B+ tier at 422 stores is the most important number here. These stores are close — an agent can reliably discover, search, and cart them, but checkout friction (slow responses, variant mismatches, payment handler quirks) stops the flow short. The path from B+ to A is usually a single fix. The 289 F-grade stores are the other end: technically verified but functionally broken when an agent actually tries to shop them.

Model leaderboard

UCP Playground now supports 15 frontier LLMs from 7 vendors, tested against 76 unique stores, generating over $114,000 in aggregate cart value. The model leaderboard scores every model on search, cart completion, and checkout conversion:

Model	Shopping Score	Checkout %	Search %	Vendor
DeepSeek V3.2	63	53.1%	85.7%	DeepSeek
Gemini 3 Flash	59	51.4%	90.3%	Google
Grok 4	59	42.0%	92.0%	xAI
Claude Opus 4.6	52	41.9%	80.0%	Anthropic
Claude Sonnet 4.5	50	54.6%	86.8%	Anthropic

And the speed rankings — because latency is the other dimension that matters:

Model	Avg Session	Vendor
Gemini 2.5 Flash	~12s	Google
GPT-4o	~14s	OpenAI
Gemini 3 Flash	~17s	Google
Claude Opus 4.6	~31s	Anthropic
Grok 4	~76s	xAI

Three takeaways

DeepSeek V3.2 leads the leaderboard. An open-weight model tops the composite shopping score at 63 — ahead of every Anthropic, Google, and OpenAI model. The agentic commerce stack is genuinely model-agnostic in practice, not just in spec language.

Search works everywhere. Checkout is the bottleneck. Every model scores above 70% on product search. But checkout conversion drops to 13–56% depending on the model. The gap between "can find products" and "can actually buy them" is the reliability frontier for the ecosystem. This is where the work is.

Reasoning models underperform. QwQ 32B (0% checkout), o4-mini (16.7%), Grok 3 Mini (13.3%), and DeepSeek R1 (21.4%) all score below 40. Models optimised for chain-of-thought reasoning burn tokens on deliberation and struggle to execute the simple, sequential tool-call patterns shopping requires. The best shopping agents are fast and decisive, not thoughtful.

Full model profiles are on the Playground models page.

The reliability gap: verified is not ready

This is the editorial point we want to make clearly, because the headline number (4,014 verified stores) obscures the more important one: 9 stores score A.

Four thousand stores have valid UCP manifests. Nine of them deliver a flawless end-to-end agent shopping experience. That's a 0.2% flawless rate. The gap between "technically verified" and "actually shoppable by an AI agent without friction" is the central infrastructure problem for agentic commerce in 2026.

The B+ tier — 422 stores — is where the leverage is. These stores work most of the time. An agent can discover them, search their catalog, build a cart, and usually reach a checkout URL. But "usually" isn't good enough when the agent is spending someone's money. The failures at B+ level are specific and fixable:

Cart variant mismatches — the agent selects a size/colour variant that doesn't match the store's internal variant ID scheme. The cart call succeeds but adds the wrong item.
Payment handler timeouts — the tokenization step takes longer than the agent's timeout window, and the session drops silently.
Stale product data — the catalog returns products that are out of stock by the time the agent tries to cart them. No error — just an empty cart.
Checkout redirect loops — the checkout URL the store returns sends the agent into an authentication loop that a human browser would handle with cookies but an MCP client can't.

Each of these is a single-fix problem for the store operator. But at scale, across 422 stores, the aggregate effect is that agents fail more often than they succeed at the final step. The ecosystem doesn't need more stores. It needs the stores it has to work more reliably. That's the infrastructure investment that will actually unlock agent commerce at scale — and it's where we're focusing our tooling work for May.

Capability coverage: the ceiling hasn't moved

Across 4,014 verified stores:

Capability	Coverage	Stores
Checkout	99.6%	3,996
Cart	99.3%	3,985
Identity linking	0.07%	3
Payment	0%	0

Same pattern as March. Checkout and cart are effectively universal because Shopify ships them by default. The advanced capabilities — identity, loyalty, payment — haven't moved. The gap between "technically verified" and "deeply agent-ready" is still the story. Until more stores declare capabilities beyond the Shopify defaults, the ecosystem depth chart stays flat.

The broader ecosystem

April was quieter on the announcements front than March — which saw Splitit, PayPal, and Google all making public UCP commitments in a single week. But the signals that matter in April are structural, not press-release-shaped.

Shopify's fleet-wide spec migration is itself an ecosystem signal. It demonstrates that a major platform can coordinate a breaking spec upgrade across thousands of stores in days, not months. Every other platform considering UCP adoption now has a reference point for what a managed migration looks like. The v2026-04-08 changes (signing_keys relocation, business profile distinction) were non-trivial — and Shopify shipped them to its entire fleet without a single store going offline. That's the kind of platform engineering confidence that accelerates the next platform's decision to build UCP support.

The endorsed partner roster continues to grow. Adyen, American Express, Mastercard, Stripe, Visa, Checkout.com, Affirm, Splitit, and PayPal are all publicly committed to the protocol's payment layer. For any platform evaluating UCP, the payment handler ecosystem is no longer a gap — it's arguably the most mature part of the stack.

The model ecosystem is widening faster than the store ecosystem. In February, we tested 3 models. In March, 8. In April, 16 — from 7 vendors across the US, China, and Europe. The number of AI models that can speak MCP and execute a UCP shopping flow is growing faster than the number of stores that can serve one. This suggests the bottleneck is shifting from "agents that can shop" to "stores that can be shopped reliably" — which circles back to the reliability gap above.

What we shipped

Heavy shipping month on the tooling side:

Side-by-side store comparison — compare any two stores head-to-head on metrics, capabilities, transports, and payment handlers. Embeddable via iframe for blog posts and docs.
Platform pages — live landing pages for Shopify, BigCommerce, WooCommerce, Magento, and Custom. Leaderboards, capability coverage, and transport adoption — auto-populates as stores verify.
/.well-known/ucp developer guide — field reference, minimal examples, publishing guides for Nginx/Cloudflare/Node, the six most common validation mistakes.
Product discovery guide — the MCP tool call sequence agents use to find and buy products. Live demo, discovery-ready stores, three-way CTA to Playground + Registry + Rails.
Build an Agent quickstart — from zero to a working agent in 30 minutes. Copy-paste code in Python and TypeScript.
Spec validation fixes — accepted the payment.handlers nested format (WooCommerce), downgraded empty payment_handlers: [] from hard fail to warning, upgraded our own manifest to v2026-04-08.

What to watch in May

Salesforce Commerce Cloud. First platform-level deployment from the enterprise tier would be the most significant ecosystem event since Shopify's initial rollout. We'll catch any SFCC store that publishes on the next crawl.

The B+ → A path. 422 stores are one fix away from flawless agent shopping. We're building tooling to surface the specific issue per store so operators can action it.

Non-Shopify growth rate. 32 non-Shopify stores this month vs ~15 last month. If this doubles again in May, UCP stops being a "Shopify project" and becomes a genuine multi-platform standard.

AP2 / A2A adoption. Zero stores declare either protocol. The v2026-04-08 spec formally added a2a as a transport. First adopter will be notable.

Sources

All data comes from the UCP Checker crawler, which re-checks every tracked domain at least every 24 hours. The raw verified-merchant dataset is published monthly on Hugging Face under CC-BY 4.0.

Browse the directory: ucpchecker.com/directory
Track adoption live: ucpchecker.com/stats
Compare two stores: ucpchecker.com/compare
Platform breakdown: ucpchecker.com/platforms
Build your own agent: ucpchecker.com/agents