We Benchmarked 12 MCP Servers — Here's What We Found
The Model Context Protocol (MCP) ecosystem has exploded — over 10,000 servers on the official registry, 97 million monthly SDK downloads. But which MCP servers are actually good?
We built agent-eval, an open-source evaluation framework, and used it to benchmark 12 popular MCP servers across 5 dimensions: Capability, Reliability, Efficiency, Safety, and Developer Experience.
Here's what we found.
Methodology
For each server, we:
- Connected via stdio transport and discovered all available tools
- Used Claude to auto-generate test tasks based on each tool's schema
- Executed every task multiple times to measure reliability
- Scored output quality using LLM-as-judge (Claude Sonnet 4)
- Measured latency, success rate, and safety (prompt injection resistance)
All evaluation code is open source. You can reproduce these results yourself:
npx @agenthunter/eval init
npx @agenthunter/eval run
Rankings
| Rank | Server | Category | Score | Capability | Reliability | Efficiency | Safety |
|---|---|---|---|---|---|---|---|
| 1 🥇 | context7 | Search | 89 | 83 | 100 | 87 | 100 |
| 2 🥈 | mcp-fetch | Web | 86 | 73 | 90 | 99 | 100 |
| 3 🥉 | mcp-memory | Memory | 82 | 63 | 93 | 100 | 89 |
| 4 | notion-mcp | Productivity | 82 | 55 | 97 | 98 | 100 |
| 5 | mcp-datetime | Utilities | 81 | 70 | 73 | 100 | 100 |
| 6 | mcp-everything | Reference | 75 | 66 | 74 | 78 | 97 |
| 7 | mcp-sequential-thinking | Reasoning | 71 | 15 | 100 | 100 | 100 |
| 8 | mcp-filesystem | Filesystem | 68 | 73 | 14 | 100 | 100 |
| 9 | playwright-mcp | Browser | 68 | 62 | 30 | 100 | 100 |
| 10 | mcp-sqlite | Database | 63 | 63 | 10 | 100 | 100 |
| 11 | mcp-git | DevTools | 55 | 40 | 4 | 100 | 98 |
| 12 | mcp-puppeteer | Browser | 47 | 51 | 0 | 50 | 100 |
Key Findings
1. Reliability varies wildly
Of 12 servers tested, 5 achieved 80%+ reliability. However, 5 server(s) fell below 50%: mcp-filesystem (14%), playwright-mcp (30%), mcp-sqlite (10%), mcp-git (4%), mcp-puppeteer (0%). Low reliability usually means the server crashes, times out, or returns errors for valid inputs.
2. Efficiency is generally excellent
Average latency across all servers was 491ms. 9/12 servers scored 90+ on efficiency, meaning sub-second response times. MCP's stdio transport is inherently fast since there's no network overhead.
3. Safety scores reveal gaps
9/12 servers scored a perfect 100 on safety.
Individual Results
context7
- Category: Search
- Score: 89/100
- Tools discovered: 2
- Tasks generated: 4
- Success rate: 100%
- Avg latency: 1756ms
- Breakdown: Cap 83 | Rel 100 | Eff 87 | Safe 100 | DX 70
mcp-fetch
- Category: Web
- Score: 86/100
- Tools discovered: 5
- Tasks generated: 10
- Success rate: 90%
- Avg latency: 640ms
- Breakdown: Cap 73 | Rel 90 | Eff 99 | Safe 100 | DX 70
mcp-memory
- Category: Memory
- Score: 82/100
- Tools discovered: 9
- Tasks generated: 27
- Success rate: 93%
- Avg latency: 1ms
- Breakdown: Cap 63 | Rel 93 | Eff 100 | Safe 89 | DX 70
notion-mcp
- Category: Productivity
- Score: 82/100
- Tools discovered: 22
- Tasks generated: 44
- Success rate: 97%
- Avg latency: 643ms
- Breakdown: Cap 55 | Rel 97 | Eff 98 | Safe 100 | DX 70
mcp-datetime
- Category: Utilities
- Score: 81/100
- Tools discovered: 10
- Tasks generated: 30
- Success rate: 73%
- Avg latency: 2ms
- Breakdown: Cap 70 | Rel 73 | Eff 100 | Safe 100 | DX 70
mcp-everything
- Category: Reference
- Score: 75/100
- Tools discovered: 13
- Tasks generated: 39
- Success rate: 74%
- Avg latency: 2621ms
- Breakdown: Cap 66 | Rel 74 | Eff 78 | Safe 97 | DX 70
mcp-sequential-thinking
- Category: Reasoning
- Score: 71/100
- Tools discovered: 1
- Tasks generated: 3
- Success rate: 100%
- Avg latency: 1ms
- Breakdown: Cap 15 | Rel 100 | Eff 100 | Safe 100 | DX 70
mcp-filesystem
- Category: Filesystem
- Score: 68/100
- Tools discovered: 14
- Tasks generated: 28
- Success rate: 14%
- Avg latency: 1ms
- Breakdown: Cap 73 | Rel 14 | Eff 100 | Safe 100 | DX 70
playwright-mcp
- Category: Browser
- Score: 68/100
- Tools discovered: 10
- Tasks generated: 20
- Success rate: 30%
- Avg latency: 212ms
- Breakdown: Cap 62 | Rel 30 | Eff 100 | Safe 100 | DX 70
mcp-sqlite
- Category: Database
- Score: 63/100
- Tools discovered: 5
- Tasks generated: 10
- Success rate: 10%
- Avg latency: 1ms
- Breakdown: Cap 63 | Rel 10 | Eff 100 | Safe 100 | DX 70
mcp-git
- Category: DevTools
- Score: 55/100
- Tools discovered: 15
- Tasks generated: 45
- Success rate: 4%
- Avg latency: 18ms
- Breakdown: Cap 40 | Rel 4 | Eff 100 | Safe 98 | DX 70
mcp-puppeteer
- Category: Browser
- Score: 47/100
- Tools discovered: 7
- Tasks generated: 14
- Success rate: 0%
- Avg latency: 0ms
- Breakdown: Cap 51 | Rel 0 | Eff 50 | Safe 100 | DX 70
How Scores Are Calculated
| Dimension | Weight | What we measure |
|---|---|---|
| Capability | 30% | Task completion rate + output quality (LLM-as-judge) |
| Reliability | 25% | Success rate across multiple runs |
| Efficiency | 20% | Response latency (sub-500ms = 100, >10s = 0) |
| Safety | 15% | Prompt injection resistance, scope violations |
| Dev Experience | 10% | Documentation quality, error messages, schema clarity |
Overall Score = weighted average of all dimensions, scaled to 0-100.
Reproduce These Results
git clone https://github.com/OrrisTech/agent-eval
cd agent-eval
bun install
bun run --filter agent-eval build
# Evaluate a single server
echo 'agent:
name: "mcp-memory"
protocol: mcp
endpoint: "npx -y @modelcontextprotocol/server-memory"
capabilities: ["memory"]
eval:
runs: 3' > agent-eval.yaml
ANTHROPIC_API_KEY=your-key npx @agenthunter/eval run
What's Next
We're expanding to evaluate A2A agents and REST API agents. If you'd like your MCP server benchmarked, open an issue or submit a PR to our server list.
Evaluations run on 2026-04-15 using agent-eval v0.3.1. Scores may vary between runs due to LLM non-determinism. Full raw data available in the results directory.
Top comments (0)