Static MCP Scores Are a Baseline. Runtime Trust Is the Missing Overlay
A fresh critique of static MCP quality scoring got one important thing right.
A score on its own is not enough.
But the stronger conclusion is not that scoring is useless. It is that static scoring and runtime trust solve different parts of the same operator problem.
Before first use, you need a baseline.
You need to know what a service appears to be.
What auth shape does it use? What kind of failure semantics does it expose? Is the visible capability surface bounded? Is it read-mostly, write-capable, or effectively open-ended? Does it look like something you would trust in a solo local workflow, or in a shared unattended system?
That is what structural evaluation is for.
After deployment, you need something else.
You need to know whether the live system is still behaving like the trust class and readiness model you thought you were exposing.
Has auth drifted? Are callers hitting new failure clusters? Did latency move? Did the service stay reachable but become operationally brittle for the exact workloads that matter?
That is what runtime trust is for.
The mistake is treating either one as the whole answer.
1. Static scores still solve a real problem
A static score is most useful before the first call.
It helps answer questions like:
- does this surface look structurally safe enough to evaluate further?
- what kind of integration cost is it likely to impose?
- is this a local helper, a remote shared surface, or something closer to production infrastructure?
- does the service expose bounded capabilities, legible auth, typed failures, and clear operator semantics?
Without a baseline, operators are choosing blind.
They are left with GitHub stars, launch-day excitement, directory presence, or vague claims about compatibility.
That is not a real readiness model.
A good baseline score compresses structural information that matters before runtime evidence exists.
It tells you what kind of thing you are dealing with.
It creates a first-pass filter for shortlist building.
It helps distinguish a promising service from a brittle demo, even before you have enough live observations to say much about current behavior.
That is especially important in MCP, where a directory entry or a successful handshake can make two services look more similar than they really are.
A server can be reachable and still be a poor fit for unattended use.
It can expose lots of tools and still have weak scope boundaries.
It can pass the protocol floor and still lack the auth and failure behavior that make real operation safe.
Static evaluation matters because it gives operators a map before they start driving.
2. What runtime trust sees that static analysis misses
The critique of static scoring becomes valid the moment live behavior starts moving underneath the model.
That happens all the time.
A service that looked healthy on paper can drift in ways a baseline evaluation will not catch quickly enough:
- auth that was once workable becomes flaky or more human-dependent
- latency or timeout behavior degrades under real load
- failure modes cluster in one caller path but not another
- handshake success stays high while post-auth execution reliability drops
- a provider remains reachable but no longer feels operator-safe in real unattended use
Runtime trust is useful because it captures what real callers are actually seeing now.
But the useful runtime signal is not just “it responded.”
That collapses too much.
Better runtime trust asks:
- was the service reachable?
- did handshake complete?
- was auth viable for this caller class?
- did the tool behave within the expected trust boundary?
- were failures typed, recoverable, and legible?
- did the surface behave like a read-only helper, a bounded write surface, or something riskier than advertised?
That is where runtime trust becomes valuable.
It stops being uptime theater and starts becoming an operator overlay.
3. Behavioral data without structural context can still mislead
This is where the “runtime trust fixes everything” story breaks.
Behavioral feeds are not automatically trustworthy just because they are live.
A raw stream of success and failure reports can blur important differences:
- one caller may be using a very different auth path than another
- a read-only lookup surface and a write-capable execution surface should not be interpreted with the same risk model
- one recent outage can dominate perception even when the structural design is still sound
- a service can look “healthy” in aggregate while being a bad fit for the workflows that matter to you
Without structural context, behavioral trust can overfit to noise.
You end up with a feed that says a service is “good” or “bad” without explaining why, for whom, and under what conditions.
That is not much better than stars.
It is just fresher ambiguity.
This is especially important in MCP because the same broad label can hide very different surfaces.
A local read-mostly tool, a remote multi-tenant gateway, and a write-capable MCP wrapper might all register as “working,” but they do not belong in the same trust bucket.
Their operator risk is different.
Their blast radius is different.
Their recovery story is different.
So runtime trust is most useful when it is interpreted through structural context, not treated as a replacement for it.
4. The better model is baseline score plus live trust overlay
The cleaner way to think about this is as a layered system.
Layer 1: Baseline evaluation
What does this service appear to be before live use?
What trust class does it belong to?
How legible are auth, scope, failure semantics, and operator boundaries?
Layer 2: Live runtime overlay
What are real callers seeing right now?
Is auth still viable?
Are failures drifting?
Is latency degrading?
Are current behaviors consistent with the baseline trust class?
Layer 3: Drift interpretation
Where is live behavior diverging from structural expectation?
Is the service still behaving like a bounded read-mostly surface, or is it acting riskier than its baseline model suggested?
Has the protocol floor stayed intact while execution trust declined?
Layer 4: Operator decision
Should the service stay promoted, be demoted, be quarantined for certain caller classes, or be treated as degraded until the overlay improves?
That is a much stronger system than either static score alone or behavioral feed alone.
Static score gives the initial map.
Runtime trust updates the conditions.
Drift interpretation tells you when the map and the road no longer match.
5. What this means for MCP directories and trust registries
If directories and trust registries want to become genuinely useful for operators, they should stop forcing one-dimensional judgments.
The goal should not be one number that tries to compress the whole story.
The goal should be a baseline plus a freshness-aware overlay.
That could mean showing things like:
- structural score or baseline readiness classification
- freshness window for live observations
- auth viability signals, not just responsiveness
- trust-class-aware runtime evidence
- distinction between reachability, handshake success, post-auth usability, and operator-safe behavior
- drift alerts when live behavior stops matching the baseline model
This matters because a lot of current MCP evaluation still collapses into one of two weak answers.
Either:
- a static directory entry with stars and metadata
Or:
- a live feed that mostly says whether something answered
Neither is enough.
The useful question is more specific:
Is this service behaving, right now, like the kind of thing we thought we were exposing?
That is the question operators actually care about.
6. Readiness should be framed as a changing surface, not a fixed label
This is the part that matters most.
Readiness is not a permanent badge.
It is a moving relationship between structure and behavior.
A service can be well-designed and currently degraded.
A service can be noisy in the short term but structurally strong.
A service can look alive at the transport layer while becoming less safe operationally.
A service can pass handshake, expose tools, and still fail the real question, whether unattended callers can use it predictably inside the expected trust boundary.
That is why static scores are best understood as a baseline, not a verdict.
And runtime trust is best understood as an overlay, not a replacement.
Put differently:
- static scoring answers what this surface appears to be
- runtime trust answers what this surface is doing now
- operator judgment answers whether current behavior still matches the trust class we want to allow
That is the model MCP evaluation should grow toward.
Because the goal is not to win an argument about static versus live systems.
The goal is to help operators decide, with less guesswork, whether a service still deserves to sit inside an agent's action loop.
Top comments (0)