Rhumb

Posted on Mar 28 • Edited on Apr 1 • Originally published at rhumb.dev

MCP Server Scoring Methodology: How to Evaluate APIs for Agents

#ai #mcp #api #opensource

MCP Server Scoring Methodology: How Rhumb Evaluates API Agent-Nativeness

You're building a multi-provider MCP server. You've found 5 APIs that do what you need. How do you know which one is actually safe to call unsupervised?

Most developers guess. They pick the one with the best documentation or the most GitHub stars. They wire it in. Then at 3am, their agent hits a 500 with no error context, retries blindly, and creates duplicate transactions.

This is why we built the Agent-Native (AN) Score.

What Makes an MCP Server Reliable?

When we talk about "API compatibility for MCP," we're really asking two questions:

Can the agent reliably get work done? (Execution, 70% weight)
Can the agent even get started? (Access, 30% weight)

The first question is almost entirely ignored by existing tools. Most "API readiness" checks scan a website's metadata or robots.txt rules. Useful for AI crawlers. Useless for MCP servers.

An MCP server doesn't read your landing page. It calls your endpoints. It needs error codes it can parse. Rate limit headers it can understand. Idempotency it can rely on. A sandbox it can test in.

The 20 Dimensions

Rhumb scores APIs across 20 specific dimensions grouped into these two buckets:

Execution (70% of what matters)

Error Handling (signal quality)

Does the API return structured JSON errors, or HTML 500s?
Does the error include a machine-readable code (e.g., error_code: "rate_limited")?
Can the agent infer what went wrong without a human reading the response?

Schema Stability (reliability)

Do response objects stay backward-compatible between versions?
Do new fields get added without breaking existing parsers?
Are deprecations communicated in advance?

Idempotency (safety)

Can you safely retry a failed request without creating duplicates?
Does the API support idempotency keys? Nonce tracking?
Or will a retry on a payment API create two charges?

Latency Consistency (predictability)

Are response times stable enough for agents to set reasonable timeouts?
Or do you see 500ms responses followed by 5s timeouts on the same endpoint?
Can you build a circuit breaker that doesn't give up too early?

Rate Limit Transparency (observability)

Does the API tell the agent how many requests it has left?
Does it send X-RateLimit-Remaining and Retry-After headers?
Or does it just return 429 with no guidance on when to retry?

Access Readiness (30% of what matters)

Signup Friction (onboarding cost)

Email verification? ✓
Phone number? ⚠️ (agents can't receive SMS)
CAPTCHA? ❌ (automation dies)

Authentication Complexity (setup time)

API key in a header? ✓ (minutes to set up)
OAuth? ⚠️ (requires a browser redirect; agents need a special handler)
Multi-factor authentication? ❌ (agents can't scan a phone)

How We Score

Each dimension gets 0-10 points based on how well the API supports unsupervised agent use.

9-10: Native to agents. No friction, no surprises.
7-8: Agents can use it reliably with minor configuration.
5-6: Usable but expect workarounds.
3-4: Significant barriers. Not recommended without human fallback.
0-2: Fundamentally incompatible.

Tier System

L4 Native (8.0–10.0): Built for agents. Minimal friction, reliable execution.
L3 Fluent (6.0–7.9): Agents can use it reliably with minor configuration.
L2 Developing (4.0–5.9): Usable with workarounds. Expect friction points.
L1 Emerging (0.0–3.9): Significant barriers. Not recommended for unsupervised use.

Real Examples

Stripe: 8.1 (L4 Native)

Execution: 9.0
Access: 6.6

Stripe has idempotency keys, structured errors with machine-readable codes, versioned webhooks, documented rate limit headers, and an official Agent Toolkit.

PayPal: 4.9 (L2 Developing)

Execution: 5.9
Access: 3.7

PayPal's OAuth2-only auth method is a blocker for unsupervised setup. Sandbox signup requires CAPTCHA. It's a 2024 payment API with 2004-era onboarding friction.

Five Questions Before Your Agent Calls Any API

What happens when the request fails? Get structured JSON with an error code? Or a 500 with an HTML error page?
Can the agent create credentials without a human? If signup requires CAPTCHA, your agent can't self-provision.
Are rate limits explicit and machine-readable? Good APIs return X-RateLimit-Remaining and Retry-After headers.
Does the API version its responses? Breaking changes in response schemas are the #1 cause of silent agent failures.
Is there a sandbox that doesn't require production credentials?

Why This Matters for MCP Servers

An MCP server is only as reliable as the APIs it calls.

If you're building with:

Resend (L3 Fluent, 7.2)
Stripe (L4 Native, 8.1)
Supabase (L3 Fluent, 7.8)

You get a system that works reliably at 3am. All three handle errors gracefully, provide rate limit headers, and support idempotency.

But swap Stripe for PayPal, and the whole system degrades. Your payment handler can't set up credentials without a human. Your retry logic might create duplicates.

One weak link breaks the whole chain. The AN Score helps you find weak links before integration.

Originally published at rhumb.dev/blog/mcp-server-scoring-methodology

Learn more at https://rhumb.dev

Start here: Want the full map of agent API selection, comparisons, reliability checklists, and the full infrastructure series? Read The Complete Guide to API Selection for AI Agents (2026).

Top comments (2)

Renato Marinho • Apr 10

The idempotency dimension is the one that trips up most teams in production. Stripe's idempotency key pattern is the gold standard, but so many APIs just return 429 with no Retry-After header, and your agent gets stuck in a backoff loop or worse, creates duplicates.

One thing I'd add to the framework: a "schema versioning" dimension specifically about how breaking changes are communicated. API versioning in the URL is table stakes, but the real differentiator is whether the changelog is machine-parseable. A lot of our work building MCP servers for 2000+ SaaS APIs at Vinkius (vinkius.com) comes down to handling the tail: APIs that version their responses but never announce it, so you get silent parse failures at 3am.

The Stripe vs PayPal comparison is a perfect illustration. PayPal's OAuth + CAPTCHA requirement is a death sentence for any agent that needs to self-provision credentials. It's almost like it was designed in an era before anyone imagined non-human API consumers.

Rhumb • Apr 11

That is a really good addition.

I think the useful distinction is exactly what you pointed at: URL versioning is table stakes, but the agent-grade question is whether change communication is machine-consumable enough to catch drift before it becomes a 3am parse failure.

A lot of APIs look stable until response shape changes silently. Then what shows up in production looks like a retry or reliability problem when it is really change-management failure.

So I would probably treat this as either:

a stronger subdimension inside schema stability, or
an adjacent “change communicability” dimension

The real bar is not “did they version the API?” It is “can a non-human client detect what changed in time to fail safely, adapt, or alert?”

And yes, agreed on the PayPal pattern too. OAuth plus CAPTCHA is basically a declaration that self-provisioning agents are not part of the intended caller model.