MCP Server Scoring Methodology: How Rhumb Evaluates API Agent-Nativeness
You're building a multi-provider MCP server. You've found 5 APIs that do what you need. How do you know which one is actually safe to call unsupervised?
Most developers guess. They pick the one with the best documentation or the most GitHub stars. They wire it in. Then at 3am, their agent hits a 500 with no error context, retries blindly, and creates duplicate transactions.
This is why we built the Agent-Native (AN) Score.
What Makes an MCP Server Reliable?
When we talk about "API compatibility for MCP," we're really asking two questions:
- Can the agent reliably get work done? (Execution, 70% weight)
- Can the agent even get started? (Access, 30% weight)
The first question is almost entirely ignored by existing tools. Most "API readiness" checks scan a website's metadata or robots.txt rules. Useful for AI crawlers. Useless for MCP servers.
An MCP server doesn't read your landing page. It calls your endpoints. It needs error codes it can parse. Rate limit headers it can understand. Idempotency it can rely on. A sandbox it can test in.
The 20 Dimensions
Rhumb scores APIs across 20 specific dimensions grouped into these two buckets:
Execution (70% of what matters)
Error Handling (signal quality)
- Does the API return structured JSON errors, or HTML 500s?
- Does the error include a machine-readable code (e.g.,
error_code: "rate_limited")? - Can the agent infer what went wrong without a human reading the response?
Schema Stability (reliability)
- Do response objects stay backward-compatible between versions?
- Do new fields get added without breaking existing parsers?
- Are deprecations communicated in advance?
Idempotency (safety)
- Can you safely retry a failed request without creating duplicates?
- Does the API support idempotency keys? Nonce tracking?
- Or will a retry on a payment API create two charges?
Latency Consistency (predictability)
- Are response times stable enough for agents to set reasonable timeouts?
- Or do you see 500ms responses followed by 5s timeouts on the same endpoint?
- Can you build a circuit breaker that doesn't give up too early?
Rate Limit Transparency (observability)
- Does the API tell the agent how many requests it has left?
- Does it send
X-RateLimit-RemainingandRetry-Afterheaders? - Or does it just return 429 with no guidance on when to retry?
Access Readiness (30% of what matters)
Signup Friction (onboarding cost)
- Email verification? ✓
- Phone number? ⚠️ (agents can't receive SMS)
- CAPTCHA? ❌ (automation dies)
Authentication Complexity (setup time)
- API key in a header? ✓ (minutes to set up)
- OAuth? ⚠️ (requires a browser redirect; agents need a special handler)
- Multi-factor authentication? ❌ (agents can't scan a phone)
How We Score
Each dimension gets 0-10 points based on how well the API supports unsupervised agent use.
- 9-10: Native to agents. No friction, no surprises.
- 7-8: Agents can use it reliably with minor configuration.
- 5-6: Usable but expect workarounds.
- 3-4: Significant barriers. Not recommended without human fallback.
- 0-2: Fundamentally incompatible.
Tier System
- L4 Native (8.0–10.0): Built for agents. Minimal friction, reliable execution.
- L3 Fluent (6.0–7.9): Agents can use it reliably with minor configuration.
- L2 Developing (4.0–5.9): Usable with workarounds. Expect friction points.
- L1 Emerging (0.0–3.9): Significant barriers. Not recommended for unsupervised use.
Real Examples
Stripe: 8.1 (L4 Native)
- Execution: 9.0
- Access: 6.6
Stripe has idempotency keys, structured errors with machine-readable codes, versioned webhooks, documented rate limit headers, and an official Agent Toolkit.
PayPal: 4.9 (L2 Developing)
- Execution: 5.9
- Access: 3.7
PayPal's OAuth2-only auth method is a blocker for unsupervised setup. Sandbox signup requires CAPTCHA. It's a 2024 payment API with 2004-era onboarding friction.
Five Questions Before Your Agent Calls Any API
What happens when the request fails? Get structured JSON with an error code? Or a 500 with an HTML error page?
Can the agent create credentials without a human? If signup requires CAPTCHA, your agent can't self-provision.
Are rate limits explicit and machine-readable? Good APIs return
X-RateLimit-RemainingandRetry-Afterheaders.Does the API version its responses? Breaking changes in response schemas are the #1 cause of silent agent failures.
Is there a sandbox that doesn't require production credentials?
Why This Matters for MCP Servers
An MCP server is only as reliable as the APIs it calls.
If you're building with:
- Resend (L3 Fluent, 7.2)
- Stripe (L4 Native, 8.1)
- Supabase (L3 Fluent, 7.8)
You get a system that works reliably at 3am. All three handle errors gracefully, provide rate limit headers, and support idempotency.
But swap Stripe for PayPal, and the whole system degrades. Your payment handler can't set up credentials without a human. Your retry logic might create duplicates.
One weak link breaks the whole chain. The AN Score helps you find weak links before integration.
Originally published at rhumb.dev/blog/mcp-server-scoring-methodology
Learn more at https://rhumb.dev
Top comments (0)