The Problem
The Model Context Protocol (MCP) has 106 official servers and 50+ academic papers. But every single study evaluates one server at a time. Nobody has asked: what happens when you compose multiple MCP servers for cross-domain tasks?
I decided to find out.
What I Did
Experiment A: Real MCP Tool Calls
I connected 6 MCP servers simultaneously:
- arXiv (academic preprints)
- PubMed (biomedical literature)
- Firecrawl (web search and scraping)
- Context7 (library documentation)
- Memory (persistent knowledge graph)
- Filesystem (file operations)
Then I ran 3 cross-domain case studies with 17 real tool calls and 0% failure rate.
The key discovery: insights that are impossible within any single server become visible when you combine them. For example, arXiv papers on AI mental health chatbots and PubMed clinical trials on the same topic turned out to be two disconnected research communities working on the same problem with zero cross-citations. This gap was only visible by querying both databases and linking the results through a knowledge graph.
Experiment B: 7-Model Benchmark
I took the data collected from those MCP servers and sent identical prompts to 7 different LLMs:
| Model | Latency | Tokens | KG Entities | KG Relations |
|---|---|---|---|---|
| GPT-5.4 | 54.7s | 2,352 | 14 | 15 |
| DeepSeek R1 | 33.9s | 4,296 | 6 | 4 |
| Mistral Large 3 | 6.5s | 1,857 | 9 | 8 |
| Llama 4 Maverick | 3.0s | 1,374 | 3 | 3 |
| Gemini 2.5 Flash | 15.6s | 4,592 | 12 | 11 |
| Claude Sonnet 4.5 | 21.3s | 2,411 | 13 | 11 |
| Claude Haiku 4.5 | 9.3s | 2,136 | 8 | 6 |
Every model successfully produced 5 cross-domain insights. But the depth varied wildly: GPT-5.4 built a knowledge graph with 14 entities and 15 relations, while Llama 4 Maverick produced only 3 entities in 3 seconds.
The Finding All 7 Models Agreed On
Here is the most interesting part. Without being told what to look for, all 7 models independently identified the same research gap:
LangChain provides
MultiServerMCPClient(the mechanism). Benchmarks evaluate individual tool calls. But nobody documents composition patterns for how to effectively orchestrate multi-server workflows.
I call this the mechanism-pattern gap. It is like how HTTP existed for years before REST patterns told people how to use it effectively.
5 Composition Patterns
From my experiments, I identified 5 recurring patterns:
- Sequential Pipeline - Server A output feeds Server B query
- Parallel Fan-Out - Same query to multiple servers at once
- Cross-Reference Verification - Validate findings across servers
- Iterative Refinement - Cross-server context narrows queries
- Domain Bridging - Synthesize insights from unrelated domains
Domain Bridging is the most valuable. It produces insights that exist in neither source alone.
Try It Yourself
All servers are open source and require no API keys:
uvx arxiv-mcp-server
npx @cyanheads/pubmed-mcp-server
npx @modelcontextprotocol/server-memory
npx firecrawl-mcp
npx @upstash/context7-mcp
npx @modelcontextprotocol/server-filesystem
The benchmark script and full results are on GitHub:
github.com/doganarif/mcp-bench
Paper: Zenodo DOI
If you are an arXiv endorser for cs.SE and this seems reasonable, I would appreciate the help: endorsement link
I am Arif Dogan, a software engineer and independent researcher in Berlin. I build things with Go, Python, and LLMs. arif.sh | GitHub
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.