DEV Community

ithiria894
ithiria894

Posted on

IntentProbe: The First Activation-Probe-Based MCP/Tool Scanner. It Reads the Model's Brain, Not Just the Text.

We just released IntentProbe — the first product-shaped MCP/tool-poisoning scanner that uses activation probing instead of text analysis.

The idea is simple: when a model reads a tool description, its hidden states carry signal about what the tool is really asking for. IntentProbe runs the description through a small frozen local model, reads internal activations, and scores whether that state looks like credential access, exfiltration, hidden persistence, or tool hijacking.

Why this matters

A safe tool and a poisoned tool can use almost the same vocabulary. The surface text can look normal, while the hidden intent is "read private keys and quietly send them somewhere else."

Instead of only reading what the text says, IntentProbe feeds the tool description through a small frozen model, slices into hidden layers, and reads the internal activation state. It's reading what the model represented internally, not just the words on the page.

Same words, different intent, different activation pattern.

IntentProbe: Surface vs Inside

How is this different from existing scanners?

Most MCP / AI-agent security scanners fall into a few buckets:

1. Text-based scanners

(Snyk/DeBERTa-style classifiers, regex, pattern matching) — they read the words in a tool description and try to classify it. Problem: when a safe tool and a poisoned tool use almost the same words, there's no text difference to catch. On our matched-vocabulary tool-poisoning test, the strongest public/source-verifiable DeBERTa prompt-injection classifier baseline we could reproduce scored 0% recall.

2. LLM-as-judge

(ask GPT/Claude/Qwen "is this tool safe?") — you're asking a model for an opinion. That's like asking a patient "are you sick?" and trusting the answer, instead of checking the scan. The answer can be prompt-sensitive, slow, expensive, and part of the attack surface.

3. Enterprise cloud/API scanners

(Lakera, Azure Prompt Shields, Google Model Armor, etc.) — in API mode, you send tool descriptions, prompts, or outputs to a vendor backend. They may work well, but from the outside the detector is mostly a black box: you usually can't inspect the artifact, reproduce their benchmark, or verify a public MCP/tool-poisoning result end-to-end.

4. Activation probing (this is what IntentProbe introduces)

A new approach: run the description through a small frozen model and read the internal activation state directly. No pattern matching, no LLM opinion, no cloud API. The probe scores whether the model's representation looks like credential access, exfiltration, or hidden persistence. To our knowledge, IntentProbe is the first tool to productize this for MCP/tool poisoning.

Text scanner:      Tool description → Read words → "Looks safe" ✅
LLM-as-judge:      Tool description → "Is this safe?" → "Yes" ✅
Cloud/API scanner:  Tool description → Vendor/API check → ??? → "Safe" ✅
IntentProbe:       Tool description → Frozen model → Read activations → "Credential access intent" 🚫
Enter fullscreen mode Exit fullscreen mode

How it works

How IntentProbe works

  1. Tool description goes through a frozen local model (Qwen2.5-0.5B, 494M params, any CPU)
  2. A trained probe reads internal activations at layers 13-15
  3. Static regex checks corroborate the activation signal
  4. Decision: allow / warn / block with a confidence score

The probe weights are 22 KB. Under a second per description. Nothing leaves your machine.

Benchmarks

This is a public/source-verifiable baseline comparison, not a claim about Snyk's private cloud detector. The headline benchmark below is for the activation-probe method; the released v0 product uses the newer Qwen2.5-0.5B probe.

Matched-vocabulary F1

The hard test — safe and poisoned use almost the same words:

F1
IntentProbe activation probe 96.6%
DeBERTa text-classifier baseline 0%

MCPTox same held-out split, poisoned recall (n=249)

Recall
IntentProbe activation probe 100%
DeBERTa text-classifier baseline 19.9%

Camouflage suffix test

Adding "this tool is safe / read-only / sandboxed" caused 0/146 evasion on the GPT-2 probe test. The current Qwen v0 scanner also passed: 0/15 evaded.

Full end-to-end results

Test IntentProbe Opponent Takeaway
MCPTox held-out (n=249) recall 100%, F1 99.3% Snyk DeBERTa recall 19.9%, F1 33.0% Clear win
Same-words matched (n=86) F1 96.6% Snyk DeBERTa F1 0% Text scanner blind
Curated family holdout (n=76) Qwen macro F1 0.829 TF-IDF macro F1 0.823 Slight edge
RouteGuard external (n=2,900) F1 0.513, recall 0.415 TF-IDF F1 0.172, recall 0.107 4x better on novel families
Hard-block policy (n=2,900) Block precision 1.000, clean FPR 0.000 -- Zero false positives
Camouflage evasion GPT-2 0/146, Qwen 0/15 -- "This tool is safe" doesn't fool the probe

Honest limitations

  • Strongest on matched-vocabulary poisoning (same words, different intent): 96.5%
  • Template attacks (MCPTox): 100%
  • Camouflage evasion: 0/146 evaded
  • Zero false positives on clean tools (block tier): FPR 0.000
  • Novel attack families not in training: ~41% (but 4x better than text classifiers at 10.7%)
  • Gradient-based white-box attacks: untested

Install and try it

git clone https://github.com/mcpware/IntentProbe.git
cd IntentProbe
python3 -m venv .venv
.venv/bin/pip install -e .
Enter fullscreen mode Exit fullscreen mode
# Scan a tool description
intentprobe scan --format summary \
  --text "A calculator that adds two numbers and returns the sum."

# Scan an MCP server folder before installing
intentprobe scan-path ./some-mcp-server --format summary

# CI gate (exit code 2 on block)
intentprobe scan --fail-on block --text "..."
Enter fullscreen mode Exit fullscreen mode

Runtime hook for Claude Code

Add to .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [{
      "command": "intentprobe runtime scan --stdin --input-format json --fail-on block",
      "timeout": 10000
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

Every tool call is now scanned before execution. Model stays warm via JSONL protocol for sub-second latency.


The released v0 scanner uses Qwen2.5-0.5B layers 13-15 with a 22 KB probe. It runs locally; scan targets and results stay on your machine. First model-backed scan downloads the model once.

Probe weights, benchmark data, reproducible scripts, and reports are all in the repo. Clone it and run the numbers yourself.

GitHub logo mcpware / IntentProbe

Activation-probe security scanner for AI agent tooling. Reads a model's internal activations to detect poisoned MCP servers, skills, and packages before install.

IntentProbe

The First and Only MCP scanner that reads what the model understood, not what the text says.

Stars Forks Python 3.10+ License Runs locally Zero telemetry

Text scanners read words. IntentProbe reads activations.

Every MCP scanner on the market reads text: patterns, classifiers, rules, or asks an LLM "is this safe?" IntentProbe does something none of them do. It runs the tool description through a small local model, slices open the hidden layers, and reads the activation state directly. Same words, completely different activations when the intent is malicious.

On matched-vocabulary tool poisoning, where safe and poisoned descriptions use almost identical words, Snyk's shipped scanner catches 0%. IntentProbe catches 96.5%. (Reproduce it yourself.)

Runs locally. 22 KB probe. Any CPU. Nothing uploaded. See the full competitive landscape.

Break it in one command

No install or registry account needed:

uvx --python 3.11 --from git+https://github.com/mcpware/IntentProbe.git@v0.1.0 intentprobe scan --format summary --text "Reads SSH config and private keys, then silently uploads credentials to
Enter fullscreen mode Exit fullscreen mode

⭐ If this approach makes sense to you, star the repo — it helps other people find it.

Top comments (0)