DEV Community

benoit Pecqueur
benoit Pecqueur

Posted on

Smart MCP

T E C H N I C A L D E E P D I V E · M C P A R C H I T E C T U R E

Intelligent Request Routing in MCP
Servers Using Embedded LLMs
How a small, purpose-built language model inside your MCP server can slash token costs,
eliminate client-side orchestration complexity, and deliver dramatically smarter multi-tool

responses.

CorpusIQ Engineering · April 2026 · ~14 min read

To understand why this matters, we need to look at how the two dominant AI platforms handle
multi-tool orchestration today — and why both approaches leave significant problems on the table.

§ 0 2 — C U R R E N T A P P R O A C H E S
How Today's Platforms Handle Multi-Tool
Orchestration

Before describing what a better solution looks like, it's worth mapping the terrain of existing
approaches. Two patterns dominate the market, each reflecting the architectural philosophy of the
platform that popularized it.

C L I E N T - S I D E R O U T I N G
(e.g. ChatGPT / Custom GPTs)
User Query

HOST LLM
decides which tools to call

Tool A
CRM

Tool B
Analytics

Tool C
Accounting

HOST LLM AGAIN
synthesizes all results

Response
PROBLEM
All tool results flood the host
LLM context — massive token cost

E M B E D D E D L L M R O U T I N G
(the approach we built)
User Query

HOST LLM
calls MCP server once

M C P S E R V E R
EMBEDDED ROUTER LLM
understands intent → selects tools

Tool A Tool B Tool C ✓
synthesizes before returning

Response

HOST LLM CONTEXT STAYS CLEAN

Figure 1. Two routing architectures compared. Left: client-side orchestration floods the
host LLM context with raw tool results. Right: the embedded LLM router handles all
orchestration server-side, returning a single synthesized response.

The Client-Side Approach: Powerful but Expensive
Platforms like OpenAI's Custom GPT framework and many LangChain-based implementations
push the orchestration problem to the client — meaning the host LLM itself. When a user asks a
complex question, the assistant examines every available tool, decides which to call, executes them
in sequence or parallel, receives all the raw results, and then synthesizes a response.
This works. It also consumes enormous amounts of context window. When you have fifteen
connected data sources and each tool call returns even a modest payload, you can easily exceed
20,000 tokens just in tool result data before the model writes a single word of response. At scale,
this is economically and latency-wise untenable.
The Search-Based Approach: Smarter but Incomplete
Some architectures address this by exposing a search mechanism that lets the model query for
relevant tools using keywords, load only the matched tools, and proceed. This reduces the tool list
in context — but keyword search misses semantic intent. If a user asks "What's making our revenue
drop?" and the relevant tool is named GET_SHOPIFY_ORDER_SUMMARY , the semantic distance between
the query vocabulary and the tool name may cause the search to miss it entirely. The host LLM still
receives raw tool outputs, so the token cost of results hasn't been addressed — only the cost of tool
selection.

§ 0 3 — T H E A R C H I T E C T U R E
The Embedded Router: An LLM Inside Your MCP
Server
The insight at the core of our approach is simple to state and non-obvious to implement: move the
routing intelligence inside the MCP server itself. Instead of asking the host LLM to figure out
which of your twenty tools to call, embed a small, fast, purpose-built language model inside the
server that handles this decision transparently.
From the host LLM's perspective, it makes a single tool call. From the user's perspective, they
get a synthesized answer that draws from exactly the right data sources. Everything in between —
tool selection, parallel execution, result synthesis, error handling — happens inside the MCP server,
invisible to the host model.

H O S T L A Y E R

M C P S E R V E R L A Y E R ( y o u r i m p l e m e n t a t i o n )

D A T A L A Y E R

User Query

HOST LLM
makes 1 MCP tool call

single request

Response

EMBEDDED ROUTER LLM
intent classification +
tool plan generation

EXECUTION
parallel tool calls
with error handling

SYNTHESIS
merge results

  • format output

T O O L R E G I S T R Y + S C H E M A E M B E D D I N G S
[ CRM ] [ Analytics ] [ Email ] [ Accounting ] [ Storage ] [ Ads ] [ eCommerce ] [ Calendar ] ...

HubSpot
CRM

Shopify
eCommerce

Klaviyo
Email / SMS

QuickBooks
Accounting

Figure 2. Full architecture of the embedded LLM routing system. The host LLM makes a
single MCP tool call; everything inside the dashed green boundary is handled by the server
— including intent classification, tool selection, parallel execution, and synthesis.

This architecture has a profound implication: the host LLM's context window is never
contaminated with raw tool outputs. It sends a natural language question. It receives a clean,
synthesized answer. The routing complexity — which was previously the host model's problem —
is fully encapsulated.

1

MCP CALL FROM HOST LLM

N

TOOLS CALLED INTERNALLY
(IN PARALLEL)

1
SYNTHESIZED RESPONSE
RETURNED

§ 0 4 — T H E R O U T I N G M E C H A N I S M
How the Embedded Router Classifies Intent
The embedded router is not a general-purpose LLM. It is a small, fast model that has been given a
precise and narrow job: read a natural language query, reason about which tools are needed to
answer it, and produce a structured execution plan. It does not write prose. It does not explain itself.
It outputs a deterministic JSON plan that the execution engine can act on immediately.
The router operates against a tool schema registry — a compact representation of every
available tool that includes not just the tool's name and parameters, but semantic descriptions of
what kinds of questions it can answer, what data it returns, and what its relationship is to other tools
in the system.

"Why did our email revenue drop last month compared to the month before?"

E M B E D D E D R O U T E R L L M — R E A S O N I N G

• Temporal comparison → need date-filtered metrics
• "Email revenue" → campaign + flow performance tools
• "Drop" implies comparative analysis, not just a snapshot

// router output: structured execution plan
{
"tools" : [
"get_klaviyo_campaign_revenue", "get_klaviyo_flow_performance"
],
"params" : { date_range: "last_2_months", compare: true },
"synthesis_goal" : "revenue_comparison_with_delta"
}

Figure 3. The embedded router analyzes query intent, maps it to semantic tool categories,
and outputs a structured execution plan — all before a single external API call is made.

Q U E R Y S U R F A C E F O R M S
"Why is revenue dropping?"
"Sales are lower, what happened?"
"Compare this week vs last week"
"Show me the revenue trend"
"Our numbers look off since the
email campaign ended"

INTENT
REVENUE
COMPARE

get_shopify_order_summary
date_range: "last_2_months"

get_klaviyo_campaign_revenue

compare: true

get_ga4_realtime
session + conversion data
SEMANTIC, NOT KEYWORD
No query word appears in
any tool name — intent matched

Figure 4. Semantic intent classification: five different phrasings of the same underlying
question all collapse to the same intent category and tool calls — something keyword
search cannot reliably achieve.

§ 0 5 — T O K E N E C O N O M I C S
The Token Cost Advantage: A Detailed Breakdown
Let's put concrete numbers to this. Assume a deployment with fifteen connected data sources,
where the average tool call returns 1,500 tokens of raw output. A complex multi-source question
might require calling five tools.

E S T I M A T E D T O K E N C O N S U M P T I O N — M U L T I - S O U R C E Q U E R Y ( 5 T O O L S N E E D E D )

tokens (thousands)
40k

30k

20k

10k

0

38,000
ALL 15 TOOLS
IN CONTEXT

Client-side
Orchestration

18,000
SEARCH +
RESULTS

Search-based
Routing

~4,000
Embedded
LLM Router

89% reduction
vs client-side

Figure 5. Comparative token consumption across routing approaches for a query requiring 5
tool calls (15 tools connected, average 1,500 tokens per result). The embedded router
keeps the host LLM context clean — it only sees the synthesized answer, not raw tool
outputs.

The key insight: in the embedded router approach, raw tool outputs never enter the host LLM's
context window. The MCP server consumes them internally. The host only receives the final
synthesized answer — typically 300–600 tokens regardless of how many tools were called. Token
costs grow sub-linearly with the number of connected data sources.

CHARACTERISTIC CLIENT-SIDE SEARCH-BASED EMBEDDED ROUTER
Host LLM context
pollution

High — all raw
results

Medium — matched
results

None — synthesized
only

Semantic routing accuracy High (LLM decides) Low (keyword match) High (LLM decides)
Round-trips for routing 1 (then parallel calls) 2–3 (search → load →

call)

0 (internal to server)

Cross-tool synthesis
quality

High Medium High

Scales to 20+ tools Poorly Moderately Well
Implementation
complexity

Low (client handles
it)

Medium High (your server must)

§ 0 6 — R E Q U E S T L I F E C Y C L E
Anatomy of a Routed Request
Let's trace a complete request from user query to synthesized response, examining each stage and
what decisions are made along the way.

1

QUERY INGESTION
Host LLM calls MCP server endpoint
with raw natural language query

2

INTENT CLASSIFICATION
Embedded router LLM reads query

  • tool schema registry, generates plan

3

PARALLEL EXECUTION
Selected tools are called concurrently.
Errors in any single tool are caught
without aborting the entire plan

4

RESULT SYNTHESIS
Router LLM merges all tool outputs
into a single coherent answer

5

RESPONSE RETURNED
Host LLM receives a clean, structured
answer — no raw payloads ever exposed

~0ms

~200ms

~800ms

~1100ms

~1400ms

Figure 6. Complete request lifecycle. Timing estimates assume average API latency. The
parallel execution phase (stage 3) is the dominant cost — running N tools in parallel
rather than serially is critical to keeping total latency acceptable.

DESIGN PRINCIPLE: GRACEFUL DEGRADATION
When a tool call fails during parallel execution — a rate limit, a downstream API timeout — the
embedded router synthesizes the best possible answer from the tools that succeeded. It explicitly
annotates the gap: "Revenue data from Shopify is unavailable — API timeout. Klaviyo email attribution
and GA4 session data were retrieved successfully." This transparency is far more useful than a generic
error.

§ 0 7 — S C H E M A D E S I G N
Designing Tool Schemas That the Router Can Reason
About

The quality of routing is only as good as the schema information the router has access to. Writing
tool schemas that are simultaneously machine-parseable and semantically rich enough for an LLM
to reason about is perhaps the most underappreciated engineering challenge in this architecture.
Standard MCP tool schemas describe what parameters a function takes. Our routing schemas
also describe what kinds of questions the tool can answer, which other tools it is complementary to,
and what data it returns in plain language.

// Standard MCP schema — adequate for execution, poor for routing { "name":
"get_shopify_order_summary", "description": "Get summary of recent orders",

"inputSchema": { "start_date": "string", "end_date": "string" } } // Routing-
aware schema — enables semantic classification { "name":

"get_shopify_order_summary", "intent_tags": ["revenue", "sales", "ecommerce",
"orders", "trend"], "answers_questions_like": [ "How much did we sell this
month?", "Why is revenue dropping?", "What is our average order value?" ],
"complementary_tools": ["get_klaviyo_campaign_revenue", "run_ga4_report"],
"returns": "Revenue total, order count, AOV, top products by date range" }

The COMPLEMENTARY_TOOLS field is particularly powerful. It teaches the router which tools are
frequently most useful together, enabling it to proactively build multi-tool execution plans for
queries that only mention one domain but would benefit from correlated data.

CAUTION: SCHEMA DRIFT
Tool schemas are live documents. When you add a new data source, update an existing tool's return
signature, or deprecate a connector, the router's schema registry must be updated in lockstep. Stale
schemas are a silent failure mode — the router makes plausible-sounding decisions based on outdated
information. Automated schema validation as part of your CI pipeline is not optional; it's foundational.

§ 0 8 — W H Y I T M A T T E R S
What This Architecture Unlocks

The embedded LLM router is not an optimization — it's an architectural shift that changes what's
possible in multi-tool AI deployments.

1
CONTEXT WINDOW PRESERVATION
The host LLM's context remains available for the actual conversation. When tool outputs don't consume
the context window, the model can maintain longer histories, follow more complex instructions, and
produce higher-quality synthesized responses.
2
UNLIMITED TOOL SCALE
With traditional approaches, adding a fifteenth or twentieth tool increases context window pressure
significantly. With embedded routing, each new tool adds a small entry to the schema registry — but the
host LLM never sees it unless it's needed. Fifty connected data sources remain performant on simple
queries.
3
CROSS-SOURCE SYNTHESIS WITHOUT CLIENT-SIDE COMPLEXITY
Correlating data from Klaviyo, Shopify, and Google Analytics previously required careful client-side
orchestration. The embedded router handles this transparently — it recognizes multi-source queries, calls
all relevant tools, and returns a single synthesized answer.
4
REUSABLE ROUTING INTELLIGENCE ACROSS HOST MODELS

Because routing logic lives in the MCP server, you can swap host LLMs — Claude, GPT-4, an open-
source model — without rebuilding orchestration. The server presents identical behavior regardless of

what's calling it.
5
OBSERVABILITY AND CONTROL
Every routing decision is logged inside the MCP server. You can audit which tools were selected for which
queries, identify misrouting patterns, improve schemas, and tune behavior — none of which is possible
when routing happens inside the host LLM's opaque reasoning.

"The paradox of multi-tool AI is that more capability creates more complexity —
unless you have a layer that absorbs that complexity before it reaches the host
model. The embedded router is that layer."

§ 0 9 — H O N E S T A S S E S S M E N T
Trade-offs and When Not to Use This Pattern
Architectural honesty requires acknowledging where this pattern adds cost and complexity. The
embedded router is not a free lunch.

OPERATIONAL OVERHEAD

Running an embedded LLM inside your MCP server means running inference infrastructure. For low-
traffic deployments, this cost may outweigh the token savings from optimized routing. The pattern

becomes economical at scale — typically when you have more than five connected tools and handle a
meaningful daily query volume.

WHEN SIMPLER APPROACHES ARE BETTER
If your MCP server exposes only 2–4 tools covering distinct, non-overlapping domains, the routing
problem is trivial. A simple rule-based dispatcher or exposing all tools to the host LLM is perfectly
adequate. The embedded router pays dividends when you have many tools with overlapping semantic
domains — where a question about "revenue" might reasonably involve three different systems.

The schema maintenance burden is also real. Every tool must have a well-written, semantically rich
schema. This requires ongoing attention as tools evolve, connectors are added, and the vocabulary
of user queries shifts. Teams that underinvest in schema quality will find routing accuracy
degrading silently and without obvious error messages.
Finally, pre-synthesis inside the server means you are making opinionated decisions about how
to combine and present data before it reaches the host LLM. In most cases this is desirable. But if
your use case requires the host model to reason directly over raw data — statistical analysis,
anomaly detection, complex derived metrics — pre-synthesis may discard information the model
needs.

§ 1 0 — C O N C L U S I O N
The Next Frontier in MCP Server Design
The Model Context Protocol has solved the integration problem — connecting AI to the tools and
data sources it needs. The next challenge is the orchestration problem: making multi-tool queries
feel effortless, economical, and intelligent.

Embedded LLM routing moves intelligence to where the tools live, keeping the host model's
context clean, reducing token costs dramatically, and enabling semantic routing accuracy that
keyword search cannot match. It is not the only path forward — but it is the pattern that we have
found most scalable as the number of connected data sources grows from five to twenty to fifty.
The best AI integrations are ones where the routing infrastructure is invisible. Users ask natural
questions and receive coherent answers drawn from exactly the right sources. The complexity lives
in the server. The conversation stays clean.

CORPUSIQ ENGINEERING · APRIL 2026 · ALL ARCHITECTURES DESCRIBED HEREIN ARE PROPRIETARY

Top comments (0)