Discussion on: We Ran 180 AI Agent Shopping Sessions Across 11 Models and 20 Stores. Here's What We Found

View post

Replies for: Really like the “Postman for UCP” framing - debugging agentic workflows without a high-fidelity trace is basically guesswork. One thing I’ve seen ...

Thanks Mahima — schema drift is exactly what we're seeing. The same "add to cart" intent looks completely different across stacks: product_variant_id (Shopify), item.id (UCPReady), productId (custom). And the type enforcement varies too — one endpoint expects a string ID, another expects an integer, another wants an array of line item objects. We had a -32602 error in the dataset where the model passed line_items: "22" instead of an array — that's a schema description problem, not a model problem.

This is actually something we're tracking on the UCP Checker side too. We monitor domains continuously and get alerts when things drift — both at the static manifest level (the /.well-known/ucp declaration) and at runtime (tool schemas, response shapes, error handling). It happens more often than you'd expect, and it's rarely intentional.

"Strict schema adherence vs goal completion" is a great framing — we're already tracking tool call error rates per store in the Playground, so normalizing that into a conformance score per MCP endpoint is a natural next step. The session replay data is all there to build it from.

Appreciate you reading it properly — the replayable trace is the core of the whole thing.