I Benchmarked 7 LLMs on Cross-Domain MCP Orchestration. All 7 Found the Same Gap.

#ai #llm #mcp #benchmark

The Problem

The Model Context Protocol (MCP) has 106 official servers and 50+ academic papers. But every single study evaluates one server at a time. Nobody has asked: what happens when you compose multiple MCP servers for cross-domain tasks?

I decided to find out.

What I Did

Experiment A: Real MCP Tool Calls

I connected 6 MCP servers simultaneously:

arXiv (academic preprints)
PubMed (biomedical literature)
Firecrawl (web search and scraping)
Context7 (library documentation)
Memory (persistent knowledge graph)
Filesystem (file operations)

Then I ran 3 cross-domain case studies with 17 real tool calls and 0% failure rate.

The key discovery: insights that are impossible within any single server become visible when you combine them. For example, arXiv papers on AI mental health chatbots and PubMed clinical trials on the same topic turned out to be two disconnected research communities working on the same problem with zero cross-citations. This gap was only visible by querying both databases and linking the results through a knowledge graph.

Experiment B: 7-Model Benchmark

I took the data collected from those MCP servers and sent identical prompts to 7 different LLMs:

Model	Latency	Tokens	KG Entities	KG Relations
GPT-5.4	54.7s	2,352	14	15
DeepSeek R1	33.9s	4,296	6	4
Mistral Large 3	6.5s	1,857	9	8
Llama 4 Maverick	3.0s	1,374	3	3
Gemini 2.5 Flash	15.6s	4,592	12	11
Claude Sonnet 4.5	21.3s	2,411	13	11
Claude Haiku 4.5	9.3s	2,136	8	6

Every model successfully produced 5 cross-domain insights. But the depth varied wildly: GPT-5.4 built a knowledge graph with 14 entities and 15 relations, while Llama 4 Maverick produced only 3 entities in 3 seconds.

The Finding All 7 Models Agreed On

Here is the most interesting part. Without being told what to look for, all 7 models independently identified the same research gap:

LangChain provides MultiServerMCPClient (the mechanism). Benchmarks evaluate individual tool calls. But nobody documents composition patterns for how to effectively orchestrate multi-server workflows.

I call this the mechanism-pattern gap. It is like how HTTP existed for years before REST patterns told people how to use it effectively.

5 Composition Patterns

From my experiments, I identified 5 recurring patterns:

Sequential Pipeline - Server A output feeds Server B query
Parallel Fan-Out - Same query to multiple servers at once
Cross-Reference Verification - Validate findings across servers
Iterative Refinement - Cross-server context narrows queries
Domain Bridging - Synthesize insights from unrelated domains

Domain Bridging is the most valuable. It produces insights that exist in neither source alone.

Try It Yourself

All servers are open source and require no API keys:

uvx arxiv-mcp-server
npx @cyanheads/pubmed-mcp-server
npx @modelcontextprotocol/server-memory
npx firecrawl-mcp
npx @upstash/context7-mcp
npx @modelcontextprotocol/server-filesystem

The benchmark script and full results are on GitHub:

github.com/doganarif/mcp-bench

Paper: Zenodo DOI

If you are an arXiv endorser for cs.SE and this seems reasonable, I would appreciate the help: endorsement link

I am Arif Dogan, a software engineer and independent researcher in Berlin. I build things with Go, Python, and LLMs. arif.sh | GitHub