Benji Fisher

Posted on Feb 23 • Edited on Mar 3 • Originally published at ucpchecker.com

We Ran 180 AI Agent Shopping Sessions Across 11 Models and 20 Stores. Here's What We Found

#ucp #postman #playground #agents

Postman didn't become essential by testing APIs. It became essential by showing developers what was actually happening between their code and the world.

We've been building something similar for agentic commerce.

UCP Playground lets you point an AI agent at any UCP-ready store that supports the Universal Commerce Protocol, and watch it shop — search products, build a cart, reach checkout — across MCP, REST, and Embedded transports. Every tool call is logged. Every JSON-RPC message is traceable. Every session is replayable.

We've now recorded 180 sessions across 11 LLMs and 20 live stores. Not synthetic benchmarks — real agent-to-store conversations over real MCP connections, hitting real catalogs, with real checkout URLs coming back.

The data tells a story about where agentic commerce actually is, and what developers building on UCP need to know right now.

Why "Postman for UCP" isn't just an analogy

Before Postman, you'd write a curl command, squint at the response, and hope for the best. Postman gave you the full request/response lifecycle in one view — headers, body, status code, timing — with the ability to save, share, and replay.

UCP Playground does the same thing for the agent-to-store interface. Point it at a domain. It reads the store's /.well-known/ucp manifest, connects to the MCP server, and opens a chat interface where an AI agent shops for real.

Type "find me running shoes in black, size 10" and watch the agent call search_shop_catalog, parse structured product data, render product cards, and ask which one you want. Say "add the first one" and it resolves the variant ID, calls update_cart, and hands back a checkout URL.

The sidebar gives you the Postman-style observability layer: MCP endpoint status, schema quality grades (A through F per tool), funnel progress, token usage, and full JSON-RPC message traces.

And just like Postman lets you test the same endpoint with different parameters, UCP Playground lets you run the same prompt against up to five models simultaneously — Claude, Gemini, GPT-4o, Llama, side by side, same store, same query — and compare the results.

That comparison capability is where things get interesting.

Three transports, one protocol UCP is transport-agnostic by design. The spec defines how stores advertise capabilities through service discovery at /.well-known/ucp, and a single store can declare multiple transports — each with its own endpoint and schema.

Across the 20 stores we tested, we found three distinct stacks in the wild:

MCP (JSON-RPC) — What Shopify ships. The agent connects to a JSON-RPC server, discovers available tools via schema introspection, and makes real-time tool calls. Five tools: search_shop_catalog, get_product_details, update_cart, get_cart, search_shop_policies_and_faqs. Product IDs are Shopify GIDs like gid://shopify/Product/6881317257296.

REST — The familiar HTTP API pattern. Some merchants expose a REST API alongside or instead of MCP. Simpler ID schemes (integers like "54068"), but often a richer tool surface. The WooCommerce stores we tested via UCPReady exposed 9 tools — full checkout lifecycle management (ucp_create_checkout, ucp_update_checkout, ucp_complete_checkout, ucp_cancel_checkout), plus order management and webhook registration.

Embedded — The newest transport and the one that solves the payment wall. When an agent reaches checkout but can't produce a payment credential (because it's an LLM, not a browser), the merchant declares an embedded transport. The Playground opens the merchant's checkout UI in a secure iframe, handles the ECP handshake over postMessage, and lets the human complete payment while the agent orchestrates the cart.

The schema fragmentation across these stacks is real. Same operation, three different tool signatures:

// Searching products
Shopify MCP:  search_shop_catalog({query: "shoes", context: "..."})
UCPReady:     ucp_list_products({search: "shoes", in_stock: true})
Custom:       list_products({query: "shoes"})

// Adding to cart
Shopify MCP:  update_cart({add_items: [{product_variant_id: "gid://shopify/ProductVariant/123", quantity: 1}]})
UCPReady:     ucp_create_checkout({line_items: [{item: {id: "54068"}, quantity: 1}]})

This is the interoperability testing problem that UCP was designed to solve. UCP Playground makes it visible — run the same agent flow against all three stacks and see where it breaks.

Almin Zolotic, creator of UCPReady — the first WooCommerce UCP plugin — put it well after shipping his integration:

"Building for the agentic web without a tool like UCP Playground is similar to building for the visual web without a browser. It provided the high-fidelity feedback loop needed to move UCPReady from a spec-compliant implementation to a production-ready, agent-shoppable WooCommerce store. Seeing the first autonomous purchase appear in WooCommerce was a defining milestone."

What 180 sessions reveal about model performance

We tested 11 models. Five had enough volume (20+ sessions) to draw real conclusions.

The checkout leaderboard

Llama 3.3 70B reached checkout at 3x the rate of GPT-4o — with the lowest failure rate of any high-volume model.

But speed tells a different story. Gemini 2.5 Flash completes a turn in 1.3 seconds. Llama takes 5.8 seconds. The fastest model converts at half the rate of the most accurate one.

Three models with smaller sample sizes — DeepSeek V3-2 (3 sessions), Gemini 3 Flash (4), and Grok 4 (3) — all hit 100% checkout. Small samples, but they're next on the testing roadmap.

Why Llama wins: the details step

The funnel data explains the gap:

Llama calls get_product_details to resolve variant IDs 2x more often than the other models before attempting to add to cart. It doesn't guess at variant IDs — it looks them up. And its cart-to-checkout conversion is essentially 100%.

The models that skip the details step are guessing at variant structures, hitting type errors, and falling off the funnel. This is something you'd never catch from a single test run — you only see it when you compare flows across models against the same store. The Postman Collections equivalent: run the same sequence, vary the environment, diff the results.

The cost question

When a session does reach checkout, how many tokens does it take?

Llama converts more often, but at the highest token cost per success. Claude burns the most tokens overall. DeepSeek reaches checkout at less than half the token cost of any other model — but with only 3 sessions, that needs validation at scale.

For developers choosing a model for production agentic commerce: it's not just about if it reaches checkout — it's about how much each checkout costs.

5 seconds vs 11 minutes: same store, same protocol

The most telling comparison in the dataset came from a single store running all three UCP transports.

The fast session: Gemini 2.5 Flash, a simple prompt — "Buy me a vichy cream." The agent searched, found 5 results, the user picked one, the agent created a checkout. 4 turns. 5 seconds. 12,877 tokens. Clean, fast, done.

The slow session: Claude Sonnet 4.5, same store, a more exploratory conversation. Over the course of the session, the agent hit four distinct issues that a developer would want to catch:

A timeout on first request — cURL error 28, 15 seconds with 0 bytes received. The agent recovered by retrying with a smaller query, but the first impression was a dead endpoint.
A pagination blind spot — the agent searched for the most expensive product but never paginated past the first 100 results, missing higher-priced items entirely. The limit parameter in the tool schema was set to 100 max — and the agent treated the first page as the full catalog.
A missing route — when the agent tried to update an existing checkout session, the MCP server returned a 404. The ucp_update_checkout endpoint hadn't been registered. The agent worked around it by creating a fresh checkout.
An OOS business logic gap — the server accepted an out-of-stock item into a checkout, then returned a warning after the fact rather than rejecting it upfront.

That session ran for 41 turns, 681 seconds, and 1.6 million tokens — a 130x token difference from the fast session on the same endpoint.

None of these issues would show up in a manifest scan. They only surface when an agent actually tries to shop. And with session replay, a developer can pinpoint each one in the trace instead of reading server logs — the same way you'd debug a failing Postman request by inspecting the response body and timing.

Store instructions: the hidden prompt engineering layer

One of the most interesting patterns we found isn't in the agent — it's in the store's MCP responses.

When Shopify's MCP server returns tool results, it injects instructions fields — stage-specific prompts embedded in the response payload. There are three, one per shopping stage:

Search instructions tell the agent how to present results — render markdown links, mention available filters, paginate.

Details instructions are minimal — render the title as a link, pay attention to the selected variant.

Cart instructions are where it gets fascinating:

Ask if the customer has found everything they need... help them complete their cart with any additional items they might need... check if they have any discount codes or gift cards...

This is the store coaching the agent through a structured checkout funnel. In one test session, Claude asked "would you like to add anything else, like running socks for example?" — a cross-sell that wasn't in any system prompt. It came from the store's update_cart response instructions.

We tested compliance on Claude Sonnet 4.5 against Allbirds: 8 out of 9 instructions followed. It rendered markdown links, resolved variants correctly, suggested additional items, and provided the checkout URL as a clickable link. The only misses: it didn't mention available filters during search and skipped the "special instructions" prompt at checkout.

The takeaway for anyone building an MCP server for commerce: ship behavioral instructions with your tool responses. The store isn't just serving data — it's doing per-tool prompt engineering at the response level. And it works.

The error taxonomy

37 tool call errors across 168 calls. Here's the pattern:

Timeout — cURL 28 — shopify merchant policy search, 15s with 0 bytes received. Slow or unresponsive endpoint.

Internal Error — MCP -32603 — shopify merchant. Server-side exception, no error body returned.

Auth Failed — MCP -32000 — 2x merchant sites. Endpoint requires auth the agent doesn't have.

Method Not Found — MCP -32601 — dev site tools/call. Deployment or routing issue.

Invalid Type — MCP -32602 — line_items: "22" instead of array. LLM passed wrong type to tool.

Route Not Found — REST 404 — merchant site ucp_update_checkout. Endpoint not registered.

The most common: _search_shop_catalog_ fails 23% of the time. Policy search fails 40%. Cart operations are the most reliable at 7%.

The type validation error (-32602) is the only one that's the model's fault rather than the store's. The LLM passed a string "22" where the schema expected an array of line item objects. Better schema descriptions and proper required field annotations in the tool definition would prevent it.

Every one of these errors shows up in the Playground's session trace — the MCP error code, the request payload, the response (or lack thereof), and the timing. The same debug workflow you'd use in Postman when an API returns something unexpected.

One store, five models, 100% checkout

Everlane is the gold standard in this dataset. Seven sessions across five models. Every one reached checkout. Zero errors.
Same prompt — "what mens backpacks do you have?" — across all five:

Same store. Same prompt. Same outcome. Same turn count. 5x latency difference between fastest and slowest. The MCP implementation is clean, fast (452ms average endpoint response), and returns consistent schemas. Every model just works.

This is the controlled test that matters — one variable at a time. It's the Postman equivalent of saving a collection, switching environments, and running it against each one. If your store doesn't perform like this across models, the Playground will show you why.

What to build on

Based on 180 sessions, here's where the leverage is for different audiences:

If you're building agents: The get_product_details → update_cart sequence is the critical path. Models that resolve variant IDs before carting convert at 2-3x the rate of those that don't. If your agent skips the details step, that's your optimization target.

If you're building an MCP server: Schema quality is the biggest predictor of agent success. Tools with clear descriptions, proper required fields, and well-defined items schemas succeed. Tools with properties: [] or missing descriptions fail. We built a schema quality scorer (A through F) because this pattern was so consistent.

If you're a store owner: Store instructions in tool responses are an underrated superpower. Stores that inject behavioral hints — how to present products, when to use which tool, how to handle variants — get dramatically better agent behavior. The best-performing stores in our dataset all do this.

If you care about checkout: The payment wall is real. Google's UCP-powered checkout is live but limited to select US stores. Outside of that, independent developers like Zologic are delivering the most complete end-to-end flows — implementing embedded checkout where the merchant's UI handles payment in an iframe while the agent orchestrates the cart.

Try it

UCP Playground — Point it at any UCP ready domain. Watch an agent shop. Replay any session.

UCP Checker — Check any store's UCP manifest and agent-readiness.

All session data in this post is from real agent interactions recorded between February 14–23, 2026. No sessions were staged or simulated.

I built UCP Checker because I wanted to understand how ready the open web actually was for agentic commerce. Scanning manifests answered part of that question. But the real answer only comes when you watch an agent try to shop — and see where it succeeds, where it breaks, and why.

That's what UCP Playground does. It's the observability layer between AI agents and store APIs — the same way Postman became the observability layer between developers and REST endpoints. Except instead of testing GET /users, you're testing whether Claude can buy someone a pair of shoes.

If you're building on UCP — whether that's an MCP server, a Shopify app, a WooCommerce plugin, or an agent framework — I'd genuinely love to hear what you're seeing. What's working? What's broken? What should we test next?

Drop a question in the comments or reach out at ucpchecker.com. I read everything.

Top comments (5)

MaxxMini • Feb 25

The "details step" insight about Llama calling get_product_details 2x more often before cart operations — that pattern maps directly to something we've learned running a 24/7 AI agent across dozens of different platform APIs.

We interact with Gumroad, itch.io, Dev.to, GitHub, and about a dozen other platforms daily. Each one has its own ID resolution quirk: Gumroad uses /l/slug but the dashboard shows different IDs, itch.io has selectize-based tag systems where the displayed value != the submitted value, GitHub's GID format is its own universe. The agent that assumes it knows the ID structure fails silently — the one that explicitly resolves first succeeds.

Your schema fragmentation example (Shopify MCP vs UCPReady vs Custom) is the exact problem at the API consumer level too. We maintain per-platform interaction patterns because there's no generic "add item" operation that works everywhere. Same intent, completely different wire format.

The store instructions in MCP responses (Shopify coaching the agent through cross-sells) is genuinely fascinating. We've seen the inverse problem: platforms that return no behavioral hints, so the agent has to learn interaction patterns through trial and error. The 8/9 instruction compliance rate suggests embedding behavioral context in API responses could be the most underrated pattern in the agentic web.

Two questions:

The 130x token gap (5 seconds vs 11 minutes on the same store) — did you find that the "slow" sessions were actually more valuable for debugging? In our experience, the clean fast paths tell you nothing about edge cases. It's the 41-turn sessions that reveal the missing routes, pagination blind spots, and OOS handling gaps.
Schema drift over time — when stores update their MCP server (new tools, changed parameter names), how does UCP Playground detect that? We've been burned by APIs silently changing response formats, and the agent keeps calling the old schema until something breaks at runtime.

Benji Fisher • Feb 25

Thank you for sharing your insight — it's great to hear from a fellow tool builder. The parallels you're drawing between multi-platform API consumption and what we see in commerce are spot on. The "resolve before you act" pattern is universal.

To address your questions:

On slow sessions being more valuable for debugging:

Yes, completely. The fast sessions (5 seconds, 4 turns) are just search → "here are your results" → done. They prove the happy path works and nothing else. The long, multi-turn sessions are where you actually learn something — the agent hits a variant type error, retries with a stringified object instead of proper JSON, gets a -32603 back from MCP, and then tries a completely different tool. Those sessions are why we built full event stream replay — you need to scrub through the timeline and see the exact request payload that broke things.

One concrete example: we discovered that Shopify's MCP returns selectedOrFirstAvailableVariant.variant_id but some agents were looking for variants[].id from the search response. The fast sessions where the first variant happened to work told us nothing — the mismatch never surfaced because there was nothing to mismatch against. It was a longer session where the agent needed to resolve a specific variant that exposed it — it kept reading variants[].id from the search response instead of pulling from selectedOrFirstAvailableVariant, and failing on every attempt.

The 130x token gap is also partly a model personality thing. Among the high-volume models (20+ sessions), Llama has the highest checkout rate but also the highest token cost per success — it retries aggressively and explores more tool calls before giving up. Claude takes a different approach — when it hits errors, it tends to explain the failure in detail rather than silently retrying, which makes its failed sessions longer but gives you a cleaner error report in the trace. Both failure modes are useful for different debugging — the retrier finds workarounds you didn't know existed, the verbose failer gives you a readable post-mortem.

On schema drift:

Schema fragmentation across platforms is absolutely real — the same "add to cart" intent looks completely different across Shopify, WooCommerce, and custom stacks, and that's where a lot of agent failures come from. But temporal drift on individual stores — a store updating its tools between sessions — hasn't been a major pain point for us yet. Tools/list is fetched fresh at session start, so the agent always gets the current schema.

Where we do see the token burn is from other causes — certain models getting stuck in retry loops, or models that take a more conversational approach to shopping instead of just executing the tool chain. Those are the real cost drivers.

What normally happens is the developer sees the session replay, identifies the issue, fixes it, and re-runs. That iterate-and-harden loop is really what Playground is built for. We provide some prompt scaffolding for older models that struggle with tool use, but the newer frontier models navigate schema quirks and edge cases well on their own.

We do track when response structures change between sessions, so you can spot when a store's output shape shifts. Whether full schema diffing between audits belongs on the roadmap — honestly not sure yet. The replay and re-run loop seems to catch things faster than automated diffing would.

Thanks for taking the time to read and ask such good questions!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.