Postman didn't become essential by testing APIs. It became essential by showing developers what was actually happening between their code and the world.
We've been building something similar for agentic commerce.
UCP Playground lets you point an AI agent at any UCP-ready store that supports the Universal Commerce Protocol, and watch it shop — search products, build a cart, reach checkout — across MCP, REST, and Embedded transports. Every tool call is logged. Every JSON-RPC message is traceable. Every session is replayable.
We've now recorded 180 sessions across 11 LLMs and 20 live stores. Not synthetic benchmarks — real agent-to-store conversations over real MCP connections, hitting real catalogs, with real checkout URLs coming back.
The data tells a story about where agentic commerce actually is, and what developers building on UCP need to know right now.
Why "Postman for UCP" isn't just an analogy
Before Postman, you'd write a curl command, squint at the response, and hope for the best. Postman gave you the full request/response lifecycle in one view — headers, body, status code, timing — with the ability to save, share, and replay.
UCP Playground does the same thing for the agent-to-store interface. Point it at a domain. It reads the store's /.well-known/ucp manifest, connects to the MCP server, and opens a chat interface where an AI agent shops for real.
Type "find me running shoes in black, size 10" and watch the agent call search_shop_catalog, parse structured product data, render product cards, and ask which one you want. Say "add the first one" and it resolves the variant ID, calls update_cart, and hands back a checkout URL.
The sidebar gives you the Postman-style observability layer: MCP endpoint status, schema quality grades (A through F per tool), funnel progress, token usage, and full JSON-RPC message traces.
And just like Postman lets you test the same endpoint with different parameters, UCP Playground lets you run the same prompt against up to five models simultaneously — Claude, Gemini, GPT-4o, Llama, side by side, same store, same query — and compare the results.
That comparison capability is where things get interesting.
Three transports, one protocol UCP is transport-agnostic by design. The spec defines how stores advertise capabilities through service discovery at /.well-known/ucp, and a single store can declare multiple transports — each with its own endpoint and schema.
Across the 20 stores we tested, we found three distinct stacks in the wild:
MCP (JSON-RPC) — What Shopify ships. The agent connects to a JSON-RPC server, discovers available tools via schema introspection, and makes real-time tool calls. Five tools: search_shop_catalog, get_product_details, update_cart, get_cart, search_shop_policies_and_faqs. Product IDs are Shopify GIDs like gid://shopify/Product/6881317257296.
REST — The familiar HTTP API pattern. Some merchants expose a REST API alongside or instead of MCP. Simpler ID schemes (integers like "54068"), but often a richer tool surface. The WooCommerce stores we tested via UCPReady exposed 9 tools — full checkout lifecycle management (ucp_create_checkout, ucp_update_checkout, ucp_complete_checkout, ucp_cancel_checkout), plus order management and webhook registration.
Embedded — The newest transport and the one that solves the payment wall. When an agent reaches checkout but can't produce a payment credential (because it's an LLM, not a browser), the merchant declares an embedded transport. The Playground opens the merchant's checkout UI in a secure iframe, handles the ECP handshake over postMessage, and lets the human complete payment while the agent orchestrates the cart.
The schema fragmentation across these stacks is real. Same operation, three different tool signatures:
// Searching products
Shopify MCP: search_shop_catalog({query: "shoes", context: "..."})
UCPReady: ucp_list_products({search: "shoes", in_stock: true})
Custom: list_products({query: "shoes"})
// Adding to cart
Shopify MCP: update_cart({add_items: [{product_variant_id: "gid://shopify/ProductVariant/123", quantity: 1}]})
UCPReady: ucp_create_checkout({line_items: [{item: {id: "54068"}, quantity: 1}]})
This is the interoperability testing problem that UCP was designed to solve. UCP Playground makes it visible — run the same agent flow against all three stacks and see where it breaks.
Almin Zolotic, creator of UCPReady — the first WooCommerce UCP plugin — put it well after shipping his integration:
"Building for the agentic web without a tool like UCP Playground is similar to building for the visual web without a browser. It provided the high-fidelity feedback loop needed to move UCPReady from a spec-compliant implementation to a production-ready, agent-shoppable WooCommerce store. Seeing the first autonomous purchase appear in WooCommerce was a defining milestone."
What 180 sessions reveal about model performance
We tested 11 models. Five had enough volume (20+ sessions) to draw real conclusions.
The checkout leaderboard
Llama 3.3 70B reached checkout at 3x the rate of GPT-4o — with the lowest failure rate of any high-volume model.
But speed tells a different story. Gemini 2.5 Flash completes a turn in 1.3 seconds. Llama takes 5.8 seconds. The fastest model converts at half the rate of the most accurate one.
Three models with smaller sample sizes — DeepSeek V3-2 (3 sessions), Gemini 3 Flash (4), and Grok 4 (3) — all hit 100% checkout. Small samples, but they're next on the testing roadmap.
Why Llama wins: the details step
The funnel data explains the gap:
Llama calls get_product_details to resolve variant IDs 2x more often than the other models before attempting to add to cart. It doesn't guess at variant IDs — it looks them up. And its cart-to-checkout conversion is essentially 100%.
The models that skip the details step are guessing at variant structures, hitting type errors, and falling off the funnel. This is something you'd never catch from a single test run — you only see it when you compare flows across models against the same store. The Postman Collections equivalent: run the same sequence, vary the environment, diff the results.
The cost question
When a session does reach checkout, how many tokens does it take?
Llama converts more often, but at the highest token cost per success. Claude burns the most tokens overall. DeepSeek reaches checkout at less than half the token cost of any other model — but with only 3 sessions, that needs validation at scale.
For developers choosing a model for production agentic commerce: it's not just about if it reaches checkout — it's about how much each checkout costs.
5 seconds vs 11 minutes: same store, same protocol
The most telling comparison in the dataset came from a single store running all three UCP transports.
The fast session: Gemini 2.5 Flash, a simple prompt — "Buy me a vichy cream." The agent searched, found 5 results, the user picked one, the agent created a checkout. 4 turns. 5 seconds. 12,877 tokens. Clean, fast, done.
The slow session: Claude Sonnet 4.5, same store, a more exploratory conversation. Over the course of the session, the agent hit four distinct issues that a developer would want to catch:
A timeout on first request —
cURL error 28, 15 seconds with 0 bytes received. The agent recovered by retrying with a smaller query, but the first impression was a dead endpoint.A pagination blind spot — the agent searched for the most expensive product but never paginated past the first 100 results, missing higher-priced items entirely. The
limitparameter in the tool schema was set to 100 max — and the agent treated the first page as the full catalog.A missing route — when the agent tried to update an existing checkout session, the MCP server returned a 404. The
ucp_update_checkoutendpoint hadn't been registered. The agent worked around it by creating a fresh checkout.An OOS business logic gap — the server accepted an out-of-stock item into a checkout, then returned a warning after the fact rather than rejecting it upfront.
That session ran for 41 turns, 681 seconds, and 1.6 million tokens — a 130x token difference from the fast session on the same endpoint.
None of these issues would show up in a manifest scan. They only surface when an agent actually tries to shop. And with session replay, a developer can pinpoint each one in the trace instead of reading server logs — the same way you'd debug a failing Postman request by inspecting the response body and timing.
Store instructions: the hidden prompt engineering layer
One of the most interesting patterns we found isn't in the agent — it's in the store's MCP responses.
When Shopify's MCP server returns tool results, it injects instructions fields — stage-specific prompts embedded in the response payload. There are three, one per shopping stage:
Search instructions tell the agent how to present results — render markdown links, mention available filters, paginate.
Details instructions are minimal — render the title as a link, pay attention to the selected variant.
Cart instructions are where it gets fascinating:
Ask if the customer has found everything they need... help them complete their cart with any additional items they might need... check if they have any discount codes or gift cards...
This is the store coaching the agent through a structured checkout funnel. In one test session, Claude asked "would you like to add anything else, like running socks for example?" — a cross-sell that wasn't in any system prompt. It came from the store's update_cart response instructions.
We tested compliance on Claude Sonnet 4.5 against Allbirds: 8 out of 9 instructions followed. It rendered markdown links, resolved variants correctly, suggested additional items, and provided the checkout URL as a clickable link. The only misses: it didn't mention available filters during search and skipped the "special instructions" prompt at checkout.
The takeaway for anyone building an MCP server for commerce: ship behavioral instructions with your tool responses. The store isn't just serving data — it's doing per-tool prompt engineering at the response level. And it works.
The error taxonomy
37 tool call errors across 168 calls. Here's the pattern:
Timeout — cURL 28 — shopify merchant policy search, 15s with 0 bytes received. Slow or unresponsive endpoint.
Internal Error — MCP -32603 — shopify merchant. Server-side exception, no error body returned.
Auth Failed — MCP -32000 — 2x merchant sites. Endpoint requires auth the agent doesn't have.
Method Not Found — MCP -32601 — dev site tools/call. Deployment or routing issue.
Invalid Type — MCP -32602 — line_items: "22" instead of array. LLM passed wrong type to tool.
Route Not Found — REST 404 — merchant site ucp_update_checkout. Endpoint not registered.
The most common: _search_shop_catalog_ fails 23% of the time. Policy search fails 40%. Cart operations are the most reliable at 7%.
The type validation error (-32602) is the only one that's the model's fault rather than the store's. The LLM passed a string "22" where the schema expected an array of line item objects. Better schema descriptions and proper required field annotations in the tool definition would prevent it.
Every one of these errors shows up in the Playground's session trace — the MCP error code, the request payload, the response (or lack thereof), and the timing. The same debug workflow you'd use in Postman when an API returns something unexpected.
One store, five models, 100% checkout
Everlane is the gold standard in this dataset. Seven sessions across five models. Every one reached checkout. Zero errors.
Same prompt — "what mens backpacks do you have?" — across all five:
Same store. Same prompt. Same outcome. Same turn count. 5x latency difference between fastest and slowest. The MCP implementation is clean, fast (452ms average endpoint response), and returns consistent schemas. Every model just works.
This is the controlled test that matters — one variable at a time. It's the Postman equivalent of saving a collection, switching environments, and running it against each one. If your store doesn't perform like this across models, the Playground will show you why.
What to build on
Based on 180 sessions, here's where the leverage is for different audiences:
If you're building agents: The get_product_details → update_cart sequence is the critical path. Models that resolve variant IDs before carting convert at 2-3x the rate of those that don't. If your agent skips the details step, that's your optimization target.
If you're building an MCP server: Schema quality is the biggest predictor of agent success. Tools with clear descriptions, proper required fields, and well-defined items schemas succeed. Tools with properties: [] or missing descriptions fail. We built a schema quality scorer (A through F) because this pattern was so consistent.
If you're a store owner: Store instructions in tool responses are an underrated superpower. Stores that inject behavioral hints — how to present products, when to use which tool, how to handle variants — get dramatically better agent behavior. The best-performing stores in our dataset all do this.
If you care about checkout: The payment wall is real. Google's UCP-powered checkout is live but limited to select US stores. Outside of that, independent developers like Zologic are delivering the most complete end-to-end flows — implementing embedded checkout where the merchant's UI handles payment in an iframe while the agent orchestrates the cart.
Try it
UCP Playground — Point it at any UCP ready domain. Watch an agent shop. Replay any session.
UCP Checker — Check any store's UCP manifest and agent-readiness.
All session data in this post is from real agent interactions recorded between February 14–23, 2026. No sessions were staged or simulated.
I built UCP Checker because I wanted to understand how ready the open web actually was for agentic commerce. Scanning manifests answered part of that question. But the real answer only comes when you watch an agent try to shop — and see where it succeeds, where it breaks, and why.
That's what UCP Playground does. It's the observability layer between AI agents and store APIs — the same way Postman became the observability layer between developers and REST endpoints. Except instead of testing GET /users, you're testing whether Claude can buy someone a pair of shoes.
If you're building on UCP — whether that's an MCP server, a Shopify app, a WooCommerce plugin, or an agent framework — I'd genuinely love to hear what you're seeing. What's working? What's broken? What should we test next?
Drop a question in the comments or reach out at ucpchecker.com. I read everything.





Top comments (5)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.