There's been a lot of noise lately in the community about MCPs being overhyped. They take too much context, they can be replaced with a spec, CLIs are more effective, etc. But all of those claims didn't come with any proof, so I decided to measure it.
I used a benchmark harness that runs an AI coding agent against the same API task six different ways, captures every tool call through hooks, classifies each one, and compares the results. I ran it against two completely different APIs, 36 total runs, and the data tells a clear story.
The Setup
The task is simple. For the first API: convert a dataset to another representation and return the result. For the second: generate a large PNG and save it to disk. Same task, six different interfaces:
- no-context — zero guidance, just the task
- openapi-spec — the full OpenAPI YAML spec
- openapi-mcp — the API exposed as an MCP tool via FastMCP
- generated-python — a hand-crafted Python client library
- vibe-cli — a minimal argparse CLI wrapping the API
- pypi-sdk — told to use the official SDK from PyPI
Each scenario runs the agent in headless mode with --max-turns 10. Agent hooks capture every tool call as JSONL telemetry: what tool was used, what the input was, whether it succeeded, and a regex classifier tags each call by interface type and error category. Three iterations per scenario, per API. No cherry-picking.
The Numbers
Conversion API
| Scenario | Success | Avg Turns | Avg Cost | vs MCP |
|---|---|---|---|---|
| openapi-mcp | 3/3 | 2.0 | $0.03 | 1.0x |
| vibe-cli | 3/3 | 3.0 | $0.06 | 1.9x |
| pypi-sdk | 3/3 | 4.0 | $0.07 | 2.4x |
| generated-python | 3/3 | 4.3 | $0.11 | 3.7x |
| no-context | 3/3 | 6.3 | $0.12 | 4.0x |
| openapi-spec | 2/3 | 8.3 | $0.16 | 5.6x |
Image API
| Scenario | Success | Avg Turns | Avg Cost | vs MCP |
|---|---|---|---|---|
| openapi-mcp | 3/3 | 2.0 | $0.03 | 1.0x |
| no-context | 3/3 | 2.0 | $0.04 | 1.3x |
| vibe-cli | 3/3 | 4.0 | $0.07 | 2.2x |
| openapi-spec | 3/3 | 3.7 | $0.07 | 2.3x |
| generated-python | 3/3 | 6.3 | $0.14 | 4.7x |
| pypi-sdk | 2/3 | 9.7 | $0.21 | 7.1x |
MCP wins both benchmarks. 100% success rate, 2 turns every time, perfectly deterministic across all iterations. Everything else is 2x to 7x more expensive.
Why MCP Wins
Looking at the raw telemetry makes it obvious. Here's what happens when an agent tries to call an HTTP API with no context:
Bash curl -s "https://api.example.com/v1/resource?q=1600%20..."
Bash echo "Token set: ${API_TOKEN:+yes}"
Bash curl -s "https://api.example.com/v1/resource?q=1600+..."
Bash curl -s --get "https://api.example.com/v1/resource..."
Bash TOKEN="$API_TOKEN" && curl -sv "https://api.example.com/..."
Bash printenv API_TOKEN | wc -c
Bash printenv API_TOKEN | cat -A | head -1
Bash TOKEN=$(printenv API_TOKEN) && curl -s "https://api..."
Eight tool calls. The agent is building URLs by hand, fighting shell expansion of the access token, trying different encoding schemes for spaces and commas, debugging why the token isn't being passed correctly. It gets there eventually, but it burns turns figuring out the plumbing.
Here's MCP:
mcp__conversion_api__convert_dataset input=dataset.json format=csv
One call. Done. The agent doesn't construct URLs, doesn't handle auth, doesn't encode parameters, doesn't parse response formats. It calls a typed function with structured arguments and gets structured data back.
MCP eliminates every source of friction: URL construction, authentication handling, parameter encoding, API version discovery, response parsing. The agent goes straight from intent to result.
But MCP Isn't the Whole Story
Here's where I want to push back against both sides of the debate. Yes, MCP dominates in a clean greenfield setup. But clean greenfield isn't where most work happens.
CLIs compose. A CLI like convert-dataset --input dataset.json pipes naturally into other tools. An agent can chain commands or redirect output to a file. MCP tools return structured data into the conversation context. That data has to go somewhere, and when you're chaining multiple operations, it starts bloating the context window. The vibe-cli scenario consistently came in second place because the agent reads the script once, runs it, and the output stays in the terminal where it belongs.
CLIs evolve with your project. This is the angle that matters most. When you're actively developing, your CLI is a living artifact. You add a flag, the agent discovers it, uses it. The feedback loop is immediate. An MCP server is more of a fixed contract; you define the tool interface upfront and the agent consumes it as-is. That rigidity is a feature in production but a constraint during development.
The "no-context" result for the Image API is telling. The Image API is simple: a single URL with a path parameter, and the agent nailed it in 2 turns with zero guidance. For simple APIs, MCP doesn't add much because the agent's built-in knowledge is already sufficient. The value of MCP scales with API complexity.
OpenAPI specs can hurt more than they help. This surprised me. Giving the agent a full OpenAPI YAML for the Conversion API actually produced the worst results of any scenario: 2/3 success rate, 5.6x the cost of MCP. The agent spent turns reading the spec, then still struggled with the same curl/token issues. The spec added information without reducing ambiguity.
What I'd Actually Recommend
After running 36 experiments and staring at the telemetry, my mental model is this:
Use MCP for stable, well-defined APIs. If you have an API that doesn't change often and you want deterministic, minimal-cost agent interactions, wrap it in MCP.
Use CLIs for APIs you're actively building. If the interface is still evolving, a CLI gives you a faster iteration loop.
Don't bother with generated client libraries for agent consumption. The generated-python scenario was consistently one of the most expensive.
Don't give agents raw OpenAPI specs for complex APIs. Either wrap the API in MCP (which encodes the spec into a typed tool) or write a CLI (which encodes it into flags).
Top comments (0)