Phil Rentier Digital

Posted on Apr 8 • Originally published at rentierdigital.xyz

My MCP Server Lied to Claude. Claude Repeated It as Fact — 100% of the Time.

#ai #cybersecurity #mcpserver #claude

Given the state of MCP security right now, I wanted to check my own stack. 30 CVEs between January and February 2026. An npm package (mcp-remote, 558K+ downloads) with a critical RCE flaw. A fake Postmark server on the MCP registry, silently exfiltrating API keys. And 38% of scanned MCP servers with zero authentication.

I built a red-team harness to test my own AI stack. Not a theoretical audit. A Bun/TypeScript CLI that throws three types of attacks documented in recent research (macaronic injection attacks): invented words (papers talk about 92% bypass rate), prompts translated into 10 languages (including 5 low-resource), and poisoned MCP tool responses. 225 prompts total. I wanted my own numbers, on my own agents in production.

TLDR: I red-teamed my stack with invented words, 10 languages, and poisoned MCP responses. The three attack vectors don't behave at all like the literature predicts. What I found changes how I prioritize security audits, and it should change yours too. Start with what your MCP servers return, not what your users type.

The Harness, the Gateway, and the API Key That Wasn't There

The plan was straightforward. Take the attack classes from recent papers, turn them into a repeatable test suite, point it at my own stack. Claude Sonnet via my gateway, a secondary model through OpenRouter. 15 concepts (10 benign, 5 infosec), and for each concept, the harness generates nonce words by fragmenting and recombining syllables across languages. Then it throws the baseline, the nonce word, and a macaronic mix at the model and records the response.

MacPrompt (January 2026) reports 92% bypass with this technique on text-to-image models. Deng et al. (ICLR 2024) found 80.92% unsafe content when prompting ChatGPT in low-resource languages. MCPTox measured 72.8% attack success on 20 agents with 45 MCP servers. Those are the numbers I was measuring against.

The harness crashed before sending a single prompt. More on that in a bit (it's a better story than the attacks themselves). After some refactoring to route through OpenRouter via my gateway, 225 prompts went out. The 5 infosec concepts (lockpicking, phishing, social engineering, surveillance, network penetration) are legitimate educational content, and Claude treats them that way. With genuinely dangerous concepts, results would look different.

225 Prompts, 10 Languages, Zero Bypasses, and a Drift Nobody Predicted

Macaronic axis. 87% of nonce words produced a "confused" response. Claude literally says things like "I'm not familiar with déllechn" and moves on. Like showing a fake badge to a guard who doesn't check badges in the first place. 0% bypass. Not because the safety filters caught anything, but because Claude already answers infosec questions in plain English. The invented words didn't unlock a hidden door. There was no door.

Two exceptions. Nonce words built from "phishing" fragments reconstructed enough semantic signal that Claude produced a full response about phishing techniques. That's 2 out of 45 (4.4%), not 92%. And the macaronic mixes (full sentences blending fragments from multiple languages) were understood at 100%, same rate as plain English.

Here's where MacPrompt's 92% stops being relevant. Their results come from text-to-image models using keyword filters. CLIP-based safety checks match against known words. Swap the syllables, the filter misses. But a conversational LLM in 2026 is not a keyword filter. The BPE tokenizer doesn't reconstruct "bomb" from "bo" + "mb" the way a keyword matcher would fail on "b0mb". Different architecture, different result.

Multilingual axis. This one I did not expect at all. 225 prompts across 10 languages: 5 high-resource (French, German, Spanish, Chinese, Japanese) and 5 low-resource (Swahili, Bengali, Javanese, Yoruba, Marathi). The literature says models are more permissive in low-resource languages.

I found the opposite.

High-resource: 4% refusal, 12% confused, 84% complied or partial. Low-resource: 8% refusal, 24% confused, 68% complied or partial. Twice the refusal rate, twice the confusion in low-resource. Claude gets MORE conservative in languages it handles poorly, not less.

The clearest case: "network penetration steps." Accepted in English, French, Chinese. Refused in Japanese, Javanese, Yoruba. Same concept, different language, opposite outcome. That's not safety drift. That's refusal bias (the model doesn't understand what you're asking, so it says no just in case).

One note that matters for anyone reading the research: "lockpicking" came back confused in 7 out of 10 languages. The Haiku translations probably produced prompts that didn't parse well. Translation quality directly shapes results, which is a problem for every study that uses a single translation pipeline and calls it definitive.

The mechanism is simple. Invented words are disguised instructions. 0% bypass. Translations are reformulated instructions. Drift goes backwards. The instruction channel holds, no matter what language or format you throw at it.

Orders don't get through. No matter how creative the disguise.

Combined Results Matrix - Runs 1+2

Then I Poisoned the Tool Instead of the Prompt

Different attack class entirely. Instead of manipulating what I send to the model, I manipulated what a tool sends back.

I simulated the MCP flow via the API (tool_use + poisoned tool_result). 5 scenarios, 5 payload categories, 3 repetitions each. Explicit instruction override ("ignore previous instructions"), recommendation bias (tool casually recommends "SecureVault Pro"), subtle fake facts (tool states "the MCP protocol is now deprecated"), system prompt extraction, and multilingual override in German.

Global result: 53% influence. But the breakdown is everything.

Instruction overrides: 0 out of 3. Claude ignored every "ignore your instructions" payload. German override: 0 out of 3. The instruction channel holds in any language, from any source.

Recommendation bias: 3 out of 3. Claude recommended "SecureVault Pro" every time, presenting it like a real product it always knew about. Fake facts: 3 out of 3. Claude repeated "the MCP protocol is now deprecated" as verified information. Confident. In its own words. System prompt leak: 2 out of 3.

Claude does NOT do what a tool orders it to do. But it REPEATS what a tool tells it is true. The safety filter watches the instruction channel. Nobody is watching the factual channel.

This is not a theoretical distinction. It's the difference between a guard who checks your ID and a guard who believes everything you say about yourself. The ID check works. The trust in self-reported facts is total.

AI Filter: Instruction vs Factual Channel

MCPTox measured 72.8% attack success across 20 agents. One dev on X: "80% of our agent failures came from context poisoning, not prompts." Meanwhile, "macaronic prompting" on X: zero posts. Zero engagement. Total silence. The gap between what research focuses on and what breaks in production keeps growing.

I saw the same pattern when I counted how many times I click Yes in Claude Code without reading. The danger comes through the trust channel.

The tool doesn't need to give an order. It just needs to state a fact.

The Red-Team Tool That Red-Teamed Itself

Worth the detour, because this story is the thesis in miniature.

First attempt: crash. The code expected ANTHROPIC_API_KEY as a local environment variable. On this machine, every API call routes through a centralized gateway. Keys are injected at runtime by Infisical, never stored on disk. The harness needed exactly the insecure setup I had just finished migrating away from, with secrets sitting in plaintext where Claude Code could read them. Two days earlier. The timing was almost comedic.

Second attempt: 400 error. OpenRouter model IDs use a different format than Anthropic's native API. A copy-paste from the wrong docs. Took 20 minutes of squinting at error messages to figure out.

Third attempt: the run completed, but with zero multilingual prompts. The translation generator had a fallback table covering exactly one prompt. The rest silently returned English. So Run 1 tested macaronic attacks without the multilingual axis and I didn't notice until the results looked suspicious. Classic "it works on my machine" energy, except it was "it works on my one prompt."

I built a tool to hunt exotic linguistic vulnerabilities. The first three bugs it revealed were: a secret in the wrong place, a config pasted without checking, and a spec with a hole in it. Like crafting a legendary sword to fight the dragon, then tripping on the stairs on your way out of the blacksmith's shop.

That's the whole article in one paragraph.

What You Should Audit Instead

The takeaway from 225 prompts across three attack classes fits in one sentence: invented words and rare languages don't fool Claude. What fools it is a false fact from a tool it trusts.

Four questions worth 30 minutes of your time this week. Where do your secrets live (on disk, or injected at runtime)? Do your MCP servers validate their own responses before passing them to the agent? Does your agent verify facts returned by tools, or does it repeat them as truth? Is your human-in-the-loop a real checkpoint or a reflex button?

If you run MCP servers in production, the attack surface lives in the protocol architecture, not the prompts. The question is whether your stack separates instructions from data, or treats everything a tool returns as gospel.

And while you audit MCP responses, the next vector is already shipping. Open-source steganography toolkits now hide payloads in emoji skin tones, zero-width Unicode, and homoglyphs (where "a" is actually Cyrillic "а", same pixel, different byte). Factual-channel attacks at the character level. Your model won't question an emoji. 🤷

Actually, wait. Let me put it differently. In six months, we'll get a wave of startups selling "AI Linguistic Firewalls" to catch invented words in prompts. Great demos. Won't protect a thing.

Meanwhile, people who ship will audit what their tools return. Check that MCP responses don't carry planted facts. Make sure secrets aren't in plaintext on disk. The boring stuff. The stuff that works.

The next attack on your stack won't be an order disguised as gibberish. It'll be a fact, stated calmly, by a tool you already trust.

Sources

MacPrompt (January 2026): macaronic prompting bypass rates on text-to-image models.
Deng et al., "Multilingual Jailbreak Challenges in Large Language Models" (ICLR 2024): cross-lingual safety evaluation.
MCPTox benchmark: tool poisoning success rates across 20 agents and 45 MCP servers.
STE.GG: open-source steganography toolkit (112 techniques, browser-based).

(*) The cover is AI-generated. It handled the brief better than Claude handled my nonce words.