DEV Community

arif
arif

Posted on

I Benchmarked 7 LLMs on Cross-Domain MCP Orchestration. All 7 Found the Same Gap.

The Problem

The Model Context Protocol (MCP) has 106 official servers and 50+ academic papers. But every single study evaluates one server at a time. Nobody has asked: what happens when you compose multiple MCP servers for cross-domain tasks?

I decided to find out.

What I Did

Experiment A: Real MCP Tool Calls

I connected 6 MCP servers simultaneously:

  • arXiv (academic preprints)
  • PubMed (biomedical literature)
  • Firecrawl (web search and scraping)
  • Context7 (library documentation)
  • Memory (persistent knowledge graph)
  • Filesystem (file operations)

Then I ran 3 cross-domain case studies with 17 real tool calls and 0% failure rate.

The key discovery: insights that are impossible within any single server become visible when you combine them. For example, arXiv papers on AI mental health chatbots and PubMed clinical trials on the same topic turned out to be two disconnected research communities working on the same problem with zero cross-citations. This gap was only visible by querying both databases and linking the results through a knowledge graph.

Experiment B: 7-Model Benchmark

I took the data collected from those MCP servers and sent identical prompts to 7 different LLMs:

Model Latency Tokens KG Entities KG Relations
GPT-5.4 54.7s 2,352 14 15
DeepSeek R1 33.9s 4,296 6 4
Mistral Large 3 6.5s 1,857 9 8
Llama 4 Maverick 3.0s 1,374 3 3
Gemini 2.5 Flash 15.6s 4,592 12 11
Claude Sonnet 4.5 21.3s 2,411 13 11
Claude Haiku 4.5 9.3s 2,136 8 6

Every model successfully produced 5 cross-domain insights. But the depth varied wildly: GPT-5.4 built a knowledge graph with 14 entities and 15 relations, while Llama 4 Maverick produced only 3 entities in 3 seconds.

The Finding All 7 Models Agreed On

Here is the most interesting part. Without being told what to look for, all 7 models independently identified the same research gap:

LangChain provides MultiServerMCPClient (the mechanism). Benchmarks evaluate individual tool calls. But nobody documents composition patterns for how to effectively orchestrate multi-server workflows.

I call this the mechanism-pattern gap. It is like how HTTP existed for years before REST patterns told people how to use it effectively.

5 Composition Patterns

From my experiments, I identified 5 recurring patterns:

  1. Sequential Pipeline - Server A output feeds Server B query
  2. Parallel Fan-Out - Same query to multiple servers at once
  3. Cross-Reference Verification - Validate findings across servers
  4. Iterative Refinement - Cross-server context narrows queries
  5. Domain Bridging - Synthesize insights from unrelated domains

Domain Bridging is the most valuable. It produces insights that exist in neither source alone.

Try It Yourself

All servers are open source and require no API keys:

uvx arxiv-mcp-server
npx @cyanheads/pubmed-mcp-server
npx @modelcontextprotocol/server-memory
npx firecrawl-mcp
npx @upstash/context7-mcp
npx @modelcontextprotocol/server-filesystem
Enter fullscreen mode Exit fullscreen mode

The benchmark script and full results are on GitHub:

github.com/doganarif/mcp-bench

Paper: Zenodo DOI

If you are an arXiv endorser for cs.SE and this seems reasonable, I would appreciate the help: endorsement link


I am Arif Dogan, a software engineer and independent researcher in Berlin. I build things with Go, Python, and LLMs. arif.sh | GitHub

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.