DEV Community

Rhumb
Rhumb

Posted on

Runtime MCP Discovery Needs Trust Filters Before Giant Indexes Become Useful

Runtime MCP Discovery Needs Trust Filters Before Giant Indexes Become Useful

A giant MCP index sounds obviously useful.

More servers should mean better coverage.
More coverage should mean better agent capability.
And if an agent can discover tools at runtime instead of depending on a hand-curated list, that sounds like real progress.

It is progress, but only if the discovery layer solves the right problem.

Once an agent can browse a large live catalog for itself, discovery stops being a convenience feature and starts becoming part of the control plane.

That changes the design goal.
The question is no longer just:

How many tools can the agent find?

It becomes:

How many safe, relevant, caller-appropriate tools can the agent see before it starts choosing?

That is a much harder question.
It is also the one that matters.


1. Why giant indexes feel like progress

The current MCP ecosystem has a real discovery problem.

There are too many demos, too many abandoned experiments, too many half-working endpoints, and too many directories that make every entry look equally real.
A large index feels like a cure for that because it offers breadth.

Instead of one tool or one narrow catalog, the agent gets access to a broad ecosystem:

  • local helpers
  • remote MCP servers
  • read-only knowledge tools
  • write-capable integrations
  • adjacent AI tools outside strict MCP packaging

That is useful in one specific sense.
It increases recall.

If the right tool exists somewhere in the ecosystem, a large index improves the odds that the agent can discover it.
But recall is only one layer of runtime usefulness.

The harder layer is selection.
And selection gets more dangerous as the candidate pool gets broader.


2. Runtime discovery changes the problem from browsing to mediation

A human browsing a directory can apply common sense before clicking anything.

They can notice:

  • this one is local and low risk
  • this one writes to production systems
  • this one looks stale
  • this one probably needs auth I do not have
  • this one is not worth the blast radius

An agent does not inherit that judgment automatically.
If the runtime hands the model one giant candidate pool, the model is being asked to solve several different problems at once:

  • what is relevant
  • what is available
  • what is safe
  • what is allowed
  • what is worth the side-effect risk

That is too much to collapse into one ranking step.

This is the moment where discovery becomes part of the control plane.
The runtime is no longer just describing what exists.
It is shaping what choices the model is even allowed to consider.

That means the discovery layer should be designed less like search infrastructure and more like policy-aware mediation.


3. The wrong abstraction is “best search over the whole catalog”

A lot of discovery systems implicitly aim for the same outcome:

  • gather as many tools as possible
  • attach descriptions and metadata
  • search them semantically
  • let the model pick the best match

That sounds reasonable until the catalog mixes wildly different trust classes.

A local read-mostly helper and a remote write-capable business system should not appear as interchangeable ranking candidates just because they both match the same task description.

That is how wrong-tool selection gets normalized.
The model is not just choosing relevance anymore.
It is choosing blast radius.

If the only safety layer is “hope the ranker prefers the harmless one,” then the system has already failed at discovery design.

The real job of the discovery layer is to remove bad candidate classes before semantic ranking begins.

That is why better embeddings are not enough.
Better search over the wrong pool still produces the wrong kind of risk.


4. What trust filters should come before ranking

If runtime discovery is going to scale safely, the candidate pool needs to be narrowed by operational metadata first.

The most important filters are not exotic.
They are the same things human operators ask about immediately.

Trust class

What kind of surface is this?

  • local or read-mostly helper
  • reversible write tool
  • high-side-effect execution surface
  • remote or shared business integration

That one distinction already changes what the agent should be allowed to consider for a given task.

Auth shape

What credential model is involved?

  • public or no-auth
  • static API key
  • delegated user auth
  • tenant-bound or scoped runtime credential

This matters because a candidate the agent cannot actually authenticate to is not a real candidate.
A candidate that authenticates through the wrong principal may be worse than unavailable.

Side-effect class

What can this thing actually do?

  • inspect
  • write
  • execute
  • egress

Those should not be hidden behind generic tool descriptions.
If the agent is deciding between “read issue” and “run shell command,” the runtime should make that difference explicit before the model starts reasoning.

Caller-visible scope

What can this principal see right now?
A global catalog is not the same thing as the live allowed surface for the current caller, tenant, session, or environment.

Freshness and viability

Is the service actually operational?

  • handshake works
  • auth can complete
  • failures classify cleanly
  • stale or dead entries are suppressed

A giant index without freshness becomes a context tax disguised as capability.


5. The useful discovery surface is the smallest caller-safe subset

This is the part many ecosystems still get backward.

They optimize for maximum exposed inventory.
But the agent does not need the biggest possible catalog.
It needs the best bounded candidate set.

A useful runtime-discovery system does not say:

“Here are 14,000 things. Good luck.”

It says something more like:

“For this caller, in this environment, under this policy, here are the 12 candidates that are both relevant enough and safe enough to consider.”

That is a much stronger product outcome.

It lowers context pressure.
It lowers wrong-tool risk.
It lowers the chance that the model confuses broad power with appropriate power.
And it makes auditability much cleaner because the candidate set itself reflects policy, not just search quality.

The useful discovery surface is not the largest global directory.
It is the smallest caller-safe subset that still preserves enough choice to route well.

That is what runtime mediation should optimize for.


6. A better ladder for runtime MCP discovery

The cleanest way to think about dynamic tool discovery is as a ladder.

1. Discoverable

The service exists in an index.
This is the lowest bar.
It only proves that the entry is known.

2. Caller-visible

This principal can actually see it right now.
The global catalog has already been narrowed by environment, tenant, policy, or auth preconditions.

3. Trust-classed

The runtime exposes read/write/execute/egress shape and local/remote/shared trust class clearly enough that candidate selection is not blind to side effects.

4. Auth-viable

The intended caller can actually complete auth and receive the expected scope.
No fake availability.
No hidden principal mismatch.

5. Rankable

Only after the pool is bounded by the earlier layers should semantic search, rules, classifiers, or LLM ranking choose among the remaining candidates.

That ordering matters.
If ranking happens before trust filtering, the system is asking the model to do control-plane work that should have been solved upstream.


7. What Rhumb should evaluate here

This is a strong evaluation lane because current discovery discussions often over-reward catalog size and underweight admission control.

A useful methodology would ask:

  • does the system expose caller-specific visibility, or only a global list
  • are trust class and side-effect class visible before selection
  • is auth shape legible before the agent commits to a candidate
  • can stale, dead, or auth-broken entries be suppressed automatically
  • does the runtime bound the pool before semantic ranking
  • can operators audit which candidates were considered and why

Those questions get closer to what teams actually care about.
They separate search quality from control quality.
And for agent systems, that separation is essential.


8. Bigger catalogs are only better when the runtime is stricter

There is nothing wrong with giant indexes by themselves.
They are useful infrastructure.

The mistake is treating size as the main story.

As runtime discovery gets broader, the runtime has to get stricter:

  • stricter about scope
  • stricter about trust classification
  • stricter about auth viability
  • stricter about side-effect labeling
  • stricter about what the model is even allowed to rank

Otherwise the system gets the worst of both worlds.
It gains coverage, but loses control.

And in agent systems, losing control is usually more expensive than lacking one more connector.

So yes, a runtime index of thousands of MCP services is interesting.
But the real milestone is not that the agent can see the whole catalog.

It is that the agent never sees the wrong part of it.

Top comments (0)