DEV Community

Rhumb
Rhumb

Posted on • Originally published at rhumb.dev

How to Evaluate APIs for AI Agents: A 20-Dimension Framework

Most people are asking the wrong question.

When they say an API or tool is "agent-ready," they often mean the website is easy for AI systems to crawl or cite. That matters for discoverability, but it tells you very little about what actually happens when an autonomous agent tries to use the API in production.

The real question is simpler and harsher:

Will this API still work when my agent calls it at 3am with no human supervision?

That depends on operational details like auth friction, retry safety, error quality, schema stability, and sandbox support — not just website metadata.

The wrong question everyone's asking

Search for "agent compatibility scoring" and you'll find tools that scan websites for AI crawlability — whether your site has llms.txt, structured data, or robots.txt rules for GPTBot. That's useful if you're optimizing a marketing page for ChatGPT citations.

But if you're building an AI agent that needs to use an API — send an email, process a payment, query a database — website crawlability tells you nothing. Your agent doesn't read your landing page. It calls your endpoints.

What actually matters: execution vs. access

After scoring 645+ developer APIs across 86 categories, we've found that agent compatibility comes down to two axes:

Execution (70% of what matters)

Can the agent reliably get work done through this API?

  • Error handling: Does the API return structured, parseable errors? Or vague 500s that leave the agent guessing?
  • Schema stability: Do response shapes change between versions without warning?
  • Idempotency: Can the agent safely retry a failed request without creating duplicates?
  • Latency consistency: Are response times predictable enough for timeout management?
  • Rate limit transparency: Does the API tell the agent how long to wait, or just reject requests?

Access Readiness (30% of what matters)

Can the agent even get started?

  • Signup friction: Does creating credentials require email verification, phone numbers, or CAPTCHAs?
  • Authentication complexity: API key in a header? Or a multi-step OAuth dance requiring a browser?
  • Documentation quality: Can the agent (or the developer configuring it) understand the API from docs alone?
  • Sandbox availability: Is there a test environment that doesn't require production credentials?
  • Rate limits: Are free-tier limits high enough for development and testing?

We weight execution at 70% because access friction is a one-time cost — you solve it during setup. Execution reliability is an ongoing cost that compounds every time the agent makes a call.

The AN Score: quantifying agent-nativeness

The Agent-Native (AN) Score evaluates each API across 20 specific dimensions on these two axes, producing a score from 0 to 10:

Tier Score What it means
L4 Native 8.0-10.0 Built for agents. Minimal friction, reliable execution, structured everything.
L3 Fluent 6.0-7.9 Agents can use this reliably with minor configuration.
L2 Developing 4.0-5.9 Usable with workarounds. Expect friction points.
L1 Emerging 0.0-3.9 Significant barriers. Not recommended for unsupervised agent use.

Real example: payments

Stripe scores 8.1 (L4 Native): execution score 9.0, access readiness 6.6. It has idempotency keys, structured errors, versioned webhooks, and an official agent toolkit. The access readiness score is lower because restricted API keys can silently scope-limit results.

PayPal scores 4.9 (L2 Developing): execution score 5.9, access readiness 3.7. OAuth2 is the only auth method. Sandbox requires CAPTCHA verification. The moment your agent needs to click "I am not a robot," the automation dies.

The gap between 8.1 and 4.9 isn't marginal. It's the difference between an agent that processes payments at 3am and one that pages a human.

Five questions before your agent calls any API

You don't need a formal scoring framework to make better tool selections:

1. What happens when the request fails?
Check the API's error responses. Do you get structured JSON with an error code and suggested fix? Or a generic 500 with an HTML error page?

2. Can the agent create credentials without a human?
If signup requires email verification, phone number, or CAPTCHA, your agent can't self-provision.

3. Are rate limits explicit and machine-readable?
Good APIs return X-RateLimit-Remaining and Retry-After headers. Bad APIs just return 429 with no guidance.

4. Does the API version its responses?
Breaking changes in response schemas are the #1 cause of silent agent failures.

5. Is there a sandbox that doesn't require production credentials?
If the sandbox requires the same onboarding friction as production, development iteration time explodes.

Website agent-readiness vs. API agent-nativeness

Most tools calling themselves "agent readiness scanners" evaluate websites for AI chatbot crawlability. They check llms.txt, robots.txt, structured data, and content formatting.

That's a different problem:

What's measured Website agent readiness API agent-nativeness (AN Score)
Target audience AI search engines AI agents calling APIs
Key metrics llms.txt, robots.txt, Schema.org Error handling, auth friction, idempotency
Score meaning "Can AI find your content?" "Can AI use your service?"

Both matter. But if you're building agents, API agent-nativeness determines whether your system works unsupervised.

How to use this

Building an agent that calls external APIs?

  • Check the AN Score leaderboard for any service you're considering
  • Prefer L3+ services for critical paths; use L2 only with fallback logic
  • Run npx rhumb-mcp in your agent to get scores at decision time

Evaluating a new API without a score?
Use the five questions above. If an API fails on questions 1 or 2, it's likely L1-L2 regardless of other strengths.

API provider wanting to improve?
The 20 dimensions are published and transparent. Most impactful improvements: structured errors, API key auth (not just OAuth), and explicit rate limit headers.


Based on scoring 645+ APIs across 86 categories. Full methodology and scores at rhumb.dev. MCP tooling is open source: github.com/supertrained/rhumb.

Top comments (0)