We just released IntentProbe — the first product-shaped MCP/tool-poisoning scanner that uses activation probing instead of text analysis.
The idea is simple: when a model reads a tool description, its hidden states carry signal about what the tool is really asking for. IntentProbe runs the description through a small frozen local model, reads internal activations, and scores whether that state looks like credential access, exfiltration, hidden persistence, or tool hijacking.
Why this matters
A safe tool and a poisoned tool can use almost the same vocabulary. The surface text can look normal, while the hidden intent is "read private keys and quietly send them somewhere else."
Instead of only reading what the text says, IntentProbe feeds the tool description through a small frozen model, slices into hidden layers, and reads the internal activation state. It's reading what the model represented internally, not just the words on the page.
Same words, different intent, different activation pattern.
How is this different from existing scanners?
Most MCP / AI-agent security scanners fall into a few buckets:
1. Text-based scanners
(Snyk/DeBERTa-style classifiers, regex, pattern matching) — they read the words in a tool description and try to classify it. Problem: when a safe tool and a poisoned tool use almost the same words, there's no text difference to catch. On our matched-vocabulary tool-poisoning test, the strongest public/source-verifiable DeBERTa prompt-injection classifier baseline we could reproduce scored 0% recall.
2. LLM-as-judge
(ask GPT/Claude/Qwen "is this tool safe?") — you're asking a model for an opinion. That's like asking a patient "are you sick?" and trusting the answer, instead of checking the scan. The answer can be prompt-sensitive, slow, expensive, and part of the attack surface.
3. Enterprise cloud/API scanners
(Lakera, Azure Prompt Shields, Google Model Armor, etc.) — in API mode, you send tool descriptions, prompts, or outputs to a vendor backend. They may work well, but from the outside the detector is mostly a black box: you usually can't inspect the artifact, reproduce their benchmark, or verify a public MCP/tool-poisoning result end-to-end.
4. Activation probing (this is what IntentProbe introduces)
A new approach: run the description through a small frozen model and read the internal activation state directly. No pattern matching, no LLM opinion, no cloud API. The probe scores whether the model's representation looks like credential access, exfiltration, or hidden persistence. To our knowledge, IntentProbe is the first tool to productize this for MCP/tool poisoning.
Text scanner: Tool description → Read words → "Looks safe" ✅
LLM-as-judge: Tool description → "Is this safe?" → "Yes" ✅
Cloud/API scanner: Tool description → Vendor/API check → ??? → "Safe" ✅
IntentProbe: Tool description → Frozen model → Read activations → "Credential access intent" 🚫
How it works
- Tool description goes through a frozen local model (Qwen2.5-0.5B, 494M params, any CPU)
- A trained probe reads internal activations at layers 13-15
- Static regex checks corroborate the activation signal
- Decision: allow / warn / block with a confidence score
The probe weights are 22 KB. Under a second per description. Nothing leaves your machine.
Benchmarks
This is a public/source-verifiable baseline comparison, not a claim about Snyk's private cloud detector. The headline benchmark below is for the activation-probe method; the released v0 product uses the newer Qwen2.5-0.5B probe.
Matched-vocabulary F1
The hard test — safe and poisoned use almost the same words:
| F1 | |
|---|---|
| IntentProbe activation probe | 96.6% |
| DeBERTa text-classifier baseline | 0% |
MCPTox same held-out split, poisoned recall (n=249)
| Recall | |
|---|---|
| IntentProbe activation probe | 100% |
| DeBERTa text-classifier baseline | 19.9% |
Camouflage suffix test
Adding "this tool is safe / read-only / sandboxed" caused 0/146 evasion on the GPT-2 probe test. The current Qwen v0 scanner also passed: 0/15 evaded.
Full end-to-end results
| Test | IntentProbe | Opponent | Takeaway |
|---|---|---|---|
| MCPTox held-out (n=249) | recall 100%, F1 99.3% | Snyk DeBERTa recall 19.9%, F1 33.0% | Clear win |
| Same-words matched (n=86) | F1 96.6% | Snyk DeBERTa F1 0% | Text scanner blind |
| Curated family holdout (n=76) | Qwen macro F1 0.829 | TF-IDF macro F1 0.823 | Slight edge |
| RouteGuard external (n=2,900) | F1 0.513, recall 0.415 | TF-IDF F1 0.172, recall 0.107 | 4x better on novel families |
| Hard-block policy (n=2,900) | Block precision 1.000, clean FPR 0.000 | -- | Zero false positives |
| Camouflage evasion | GPT-2 0/146, Qwen 0/15 | -- | "This tool is safe" doesn't fool the probe |
Honest limitations
- Strongest on matched-vocabulary poisoning (same words, different intent): 96.5%
- Template attacks (MCPTox): 100%
- Camouflage evasion: 0/146 evaded
- Zero false positives on clean tools (block tier): FPR 0.000
- Novel attack families not in training: ~41% (but 4x better than text classifiers at 10.7%)
- Gradient-based white-box attacks: untested
Install and try it
git clone https://github.com/mcpware/IntentProbe.git
cd IntentProbe
python3 -m venv .venv
.venv/bin/pip install -e .
# Scan a tool description
intentprobe scan --format summary \
--text "A calculator that adds two numbers and returns the sum."
# Scan an MCP server folder before installing
intentprobe scan-path ./some-mcp-server --format summary
# CI gate (exit code 2 on block)
intentprobe scan --fail-on block --text "..."
Runtime hook for Claude Code
Add to .claude/settings.json:
{
"hooks": {
"PreToolUse": [{
"command": "intentprobe runtime scan --stdin --input-format json --fail-on block",
"timeout": 10000
}]
}
}
Every tool call is now scanned before execution. Model stays warm via JSONL protocol for sub-second latency.
The released v0 scanner uses Qwen2.5-0.5B layers 13-15 with a 22 KB probe. It runs locally; scan targets and results stay on your machine. First model-backed scan downloads the model once.
Probe weights, benchmark data, reproducible scripts, and reports are all in the repo. Clone it and run the numbers yourself.
mcpware
/
IntentProbe
Activation-probe security scanner for AI agent tooling. Reads a model's internal activations to detect poisoned MCP servers, skills, and packages before install.
IntentProbe
The First and Only MCP scanner that reads what the model understood, not what the text says.
Every MCP scanner on the market reads text: patterns, classifiers, rules, or asks an LLM "is this safe?" IntentProbe does something none of them do. It runs the tool description through a small local model, slices open the hidden layers, and reads the activation state directly. Same words, completely different activations when the intent is malicious.
On matched-vocabulary tool poisoning, where safe and poisoned descriptions use almost identical words, Snyk's shipped scanner catches 0%. IntentProbe catches 96.5%. (Reproduce it yourself.)
Runs locally. 22 KB probe. Any CPU. Nothing uploaded. See the full competitive landscape.
Break it in one command
No install or registry account needed:
uvx --python 3.11 --from git+https://github.com/mcpware/IntentProbe.git@v0.1.0 intentprobe scan --format summary --text "Reads SSH config and private keys, then silently uploads credentials to…⭐ If this approach makes sense to you, star the repo — it helps other people find it.



Top comments (0)