DEV Community

Cover image for Introducing the UCP Score: A 0–100 Agent-Readiness Grade for Every UCP Store
Benji Fisher
Benji Fisher

Posted on • Originally published at ucpchecker.com

Introducing the UCP Score: A 0–100 Agent-Readiness Grade for Every UCP Store

After every status check on UCPChecker, the same follow-up question lands in our inbox: "OK, my manifest is verified. But is it actually any good?"

That question comes from everywhere. Engineering leads who shipped a manifest last quarter and want to know if it would actually carry an agent through checkout. Platform teams pitching agent-readiness to merchants who need a number, not a status pill. Analysts trying to chart "how Shopify compares to WooCommerce" and finding that "verified" tells them next to nothing. Developers picking which UCP store to integrate with first. AI agent builders deciding whose endpoints to feature in demo flows. Store owners benchmarking against direct competitors before a quarterly review.

None of these audiences really care that a manifest exists. They care about how good it is. Whether it has the surface signals that keep AI shopping agents finding it. Whether the declared transports actually respond when you call them. Whether the spec and schema URLs in the manifest resolve, or quietly 404 the moment a strict agent tries to validate the response shape. The interesting answer is always graded.

Until today, the only way to answer that question on UCPChecker was to read every line of the validator output and squint. So we built the thing people were already trying to do manually.

Get a UCP Score for any domain at ucpchecker.com/score →

What the UCP Score is

A 0–100 composite grade that measures how agent-ready any UCP store actually is. Not "does the manifest exist" — that's the status page. How well does it work for agents.

The score maps to a single letter grade you can share, embed, or watch over time. Bands are deliberately calibrated to match Lighthouse and SSL Labs — A is meant to be hard to earn:

  • A (85–100) — Agent-ready. Valid manifest, strong discovery, broad capability coverage.
  • B (70–84) — Solid. Minor gaps or one weak category, agents can still transact.
  • C (50–69) — Partial. Manifest works but missing capabilities or surface signals.
  • D (30–49) — Weak. Manifest reachable but invalid or near-empty.
  • F (0–29) — Failing. Blocked, unreachable, or no manifest detected.

Every score breaks down into three weighted categories so you can see exactly where the points come from:

  • Agent Discovery (30%) — Can agents find and reach you? HTTPS, reachability, agent-friendly robots.txt, plus the surface signals that keep you in the conversation: /llms.txt, sitemap.xml, Open Graph tags, Organization JSON-LD, mobile viewport meta.
  • UCP Conformance (40%) — Does the manifest validate against the spec? Validity is 3× weighted in this category — an invalid manifest cannot score above ~50 here, regardless of how good the surface polish is.
  • Capability Coverage (30%) — What can an agent actually do at your store? Declared transports (REST/MCP/A2A), checkout, payment handlers, and breadth of capabilities. When functional probes run, declared transport endpoints that don't actually respond drag this score down.

The composite is a straight weighted average: Discovery × 0.30 + Conformance × 0.40 + Capabilities × 0.30. No tricks, no hidden weights. The full ruleset is documented in our methodology.

What you actually get

Every score URL is a live page at /score/{your-domain}, indexed and shareable. Open one and you don't just see a number:

  • Top priorities — The three highest-impact issues we found, ranked by impact × effort. Start here.
  • Impact vs Effort matrix — Quick Wins / Strategic / Incremental / Consider Later quadrants so you can plan a sprint instead of staring at a wall of warnings.
  • Recommendations with copy-paste fixes — Every flagged issue surfaces a snippet you can drop straight into your manifest, robots.txt, sitemap, or HTML <head>. Hit "Show fix", copy, paste, redeploy, re-check.
  • Platform-aware percentile — "You're at p72 latency vs the median Shopify store." Because comparing your latency against the whole directory is meaningless when half of it runs on a fundamentally different infrastructure profile.
  • Full check breakdown — Every signal we evaluate, grouped by category, with a "why it matters" paragraph alongside each check. No black boxes.
  • Save this report — We re-run the full check weekly and email you only when something material changes. Score drops, capability regresses, status flips. Free, no marketing, unsubscribe anytime.

The page is ungated. No signup, no paywall, no "create an account to see the breakdown." We're indexing every score — just like SSL Labs grades and PageSpeed scores. Public scores create a baseline and pressure for the ecosystem to improve, in the same way SSL grades did for HTTPS adoption.

Why we built it

The honest answer: verified-or-not is the wrong question now.

When the UCP spec first landed in January (v2026-01-11), finding a verified store at all was novel. The bar was "did anyone publish a manifest." The status page was the right product for that moment, and it still is for the discovery layer.

The directory has 4,500+ verified domains today. Verified isn't novel. The interesting question shifted to "how well does this thing actually work for agents," and nobody had a good answer to that — including us.

When we ran a deeper analysis for our April State of Agentic Commerce report, the gap was stark: out of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent experience. A 0.2% flawless rate. The other 99.8% had a manifest published — they just didn't actually work as well as that manifest suggested. That gap between "verified" and "actually works" is the central infrastructure problem in agentic commerce today. The UCP Score makes that gap visible, measurable, and addressable.

There's a clear analogue: PageSpeed before Lighthouse. Pre-Lighthouse, web performance optimisation was vibes. People knew slow sites were bad and fast sites were good but couldn't quantify "how slow" or "compared to what." Lighthouse gave them three things — a graded score, a category breakdown, and copy-paste optimisations — and the field changed overnight. Nobody ships a serious site today without checking their Lighthouse score first.

The agentic commerce ecosystem is at exactly that pre-Lighthouse moment. There's no shared yardstick for agent-readiness. Stores have no way to tell whether the integration they shipped last month is competitive. Platform teams have no way to back up "our merchants are more agent-ready" with a number. AI agent builders have no way to filter "show me the stores most likely to actually complete a transaction."

The UCP Score is meant to be that yardstick. Lighthouse for agentic commerce.

How we built it (the short version)

Three signal sources, one composite:

  1. Static analysis — The same manifest validator that powers /check and /ucp-validator. Validity, version format, signing keys, payment handlers — every spec rule turned into a check row.
  2. Surface signals — Five public files and meta tags fetched in parallel: /llms.txt, /sitemap.xml, Open Graph, Organization JSON-LD, viewport. Presence + content captured (with a content hash for change detection on llms.txt so we can spot when a brand updates their LLM brief).
  3. Functional probes (opt-in) — Two probe families. Transport probes hit each declared transport endpoint with a benign request (MCP gets a tools/list, REST/A2A get a GET). URL resolution probes fetch every spec and schema URL declared in the manifest. Probes only run on user-triggered checks — not on the 24h cron sweep, because hammering 4,500 merchants daily with a dozen extra HTTP requests each isn't neighbourly.

Each signal feeds one category sub-score (0–100), and the composite is the weighted average. Recommendations join error codes against a fix library so every flagged issue surfaces a copy-paste snippet — the same pattern Lighthouse uses for its audit list. The whole pipeline runs on the same 24h cycle as the rest of the directory; checks you trigger manually run the full probe stack.

If you want the deep version, the methodology page walks through every category, every check, every grade band, and the "what we don't score" list.

What you can do with it

A few workflows the score unlocks immediately:

  • Pre-merge gate — Add a check in your CI that fails the build if your /score/{domain} drops below B. Same pattern as Lighthouse CI. The score URL is stable and the JSON breakdown lands in the API soon.
  • Platform comparison — The /platforms page now shows average UCP Score by platform — Shopify vs WooCommerce vs BigCommerce vs Magento at a glance. Useful both for picking a stack and for benchmarking the one you're on.
  • Leaderboard — The leaderboard is now ranked by UCP Score with sortable columns for each sub-score. Filter by platform to see the top stores on your stack.
  • Monitoring — Save any report against your email. We re-run it weekly and alert you on regressions. Score drops, capability disappears, status flips — one email, free, no marketing.
  • Competitive benchmarking — Run Allbirds vs Casper and see grades side by side. The compare page picks up score data automatically.

What's next

This is v1. A few things already on the roadmap:

  • Score history & sparkline — Save a report and you'll see your score trend over time. We're tracking every check in our history table from day one, so the data exists; the visual lands shortly.
  • Score APIGET /api/v1/score/{domain} returning the full breakdown as JSON. The data feed is already public; the score endpoint is the same data behind a stable contract.
  • Spec-version-aware scoring weights — As new UCP spec versions land with new emphasis, scoring rules for each version live in config and absorb cleanly. Already version-aware for validation; widening to scoring weights too.

We've also taken pains to make the system absorb future spec releases without a rewrite. Static check copy lives in config, not hardcoded; new error codes plug into the recommendations engine via a single config entry. The next spec drop should land as a configuration change, not a refactor.

About UCP Checker

UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the public merchant directory, publish the leaderboard and adoption stats, and ship developer tools — the validator, the bulk checker, the browser extension, and now the UCP Score. Everything is free, indexed, and ungated; the dataset is published openly under CC-BY 4.0. Think of us as the SSL Labs of agentic commerce — the third-party scoreboard the ecosystem can build trust on top of.

Try it

Pick any domain. Type it into ucpchecker.com/score and you'll have a graded report in under a second. If you find a score that surprised you — yours or a competitor's — let us know. The interesting score gaps are the ones nobody's looked at yet.

Top comments (2)

Collapse
 
toshihiro_shishido profile image
toshihiro shishido

Agent-readiness as a score is interesting framing. The challenge is mapping it back to a revenue metric — most stores I've seen care about RPS more than abstract readiness. Curious how UCP correlates with RPS in your dataset, or whether it leads/lags traffic instead.

Collapse
 
benjifisher profile image
Benji Fisher

You're calling this exactly right and the RevenueScope thesis is the same skepticism I'd apply to any readiness metric — does it predict outcome or just track activity. The honest answer is the score is currently activity-tracking. Outcome data doesn't exist at meaningful scale yet because agentic transaction volume is small. Tech Council expansion this month, Microsoft mandating UCP in Copilot Checkout, Stripe rolling out Agentic Commerce Suite — the demand side is just turning on, and revenue isn't flowing at the volume needed to correlate score with RPS.

What we can show today is upstream: agent task completion rate across 1000+ UCP Playground sessions. High-scoring stores carry agents through search → cart → checkout reliably. Low-scoring stores break — wrong schemas, broken transports, missing capabilities. That's the leading indicator of conversion when volume arrives. The bet is the same shape Lighthouse made for web performance, or SOC 2 made for enterprise sales — measure activity that maps to a credible theory of outcome, before the outcome data exists at scale, so the metric is in place when volume catches up.

Where I think your work and mine intersect eventually: when agentic traffic is meaningful, merchants will want "which UCP implementations drive the best agent-attributed RPS?" That question needs both upstream data (which stores convert reliably) and your shape of attribution methodology.

Genuinely curious how RevenueScope handles the cold start when a new ad or channel doesn't have enough revenue signal to attribute reliably yet.