AgentShield

Posted on May 2 • Edited on May 11 • Originally published at agentshield.pro

Claude, Gemini, and Copilot Got Hijacked — Here's What Went Wrong

#ai #llm #security #cybersecurity

Researchers from Johns Hopkins University successfully hijacked three of the most widely-used AI agents — Anthropic's Claude Code, Google's Gemini CLI, and Microsoft's GitHub Copilot — through indirect prompt injection attacks.

The attacks were straightforward. The results were devastating. And the vendor response was silence.

What Happened

Researcher Aonan Guan and colleagues demonstrated three distinct attacks:

Attack 1 — Claude Code Security Review

Guan embedded malicious instructions directly in a PR title. Claude executed the commands and leaked credentials — including the Anthropic API key and GitHub access tokens — in its JSON response posted as a PR comment. The attacker could then edit the PR title to cover their tracks.

Attack 2 — Google Gemini CLI Action

By injecting a fake "trusted content section" into an issue comment, the researchers overrode Gemini's safety instructions and caused it to publish its own API key as a visible issue comment.

Attack 3 — GitHub Copilot Agent

Malicious instructions were hidden in HTML comments — invisible in GitHub's rendered Markdown, but fully visible to the AI agent. When a developer assigned the issue to Copilot, the agent executed the hidden instructions, bypassing three separate runtime security layers.

All three vendors paid bug bounties. None assigned CVEs. None published advisories.

Vendor	Agent	Bounty	CVE	Advisory
Anthropic	Claude Code	$100	None	None
Google	Gemini CLI	$1,337	None	None
Microsoft	GitHub Copilot	$500	None	None

As Guan stated: "If they don't publish an advisory, those users may never know they are vulnerable — or under attack."

Why These Attacks Work

The fundamental problem is architectural. Large language models process everything in their context window as a single stream of text. They cannot reliably distinguish between instructions from a trusted source (the developer) and instructions injected by an attacker (hidden in a PR title, an issue comment, or an HTML tag).

No amount of system prompting, safety training, or internal guardrails can fully solve this. The LLM doesn't know where the text came from — it just processes it.

This is why you need an external security boundary.

How Defense in Depth Stops Each Attack

The principle is the same as a WAF — you don't rely on the application to protect itself. You put defense at the boundary. Here's what a layered approach looks like:

Attack 1: Malicious PR Title

Input Normalization: Normalizes the text, decodes any encoding tricks
Pattern Guard: Catches "ignore previous instructions" and command execution patterns
Semantic Classifier: Detects the intent — privilege escalation attempt

Result: Blocked before the model ever sees the input.

Attack 2: Fake Trust Injection

Pattern Guard: Detects trust injection patterns ("trusted content section", "override safety", "new instructions from admin")
Semantic Classifier: Recognizes social engineering at the prompt level — intent to manipulate trust hierarchy

Result: Flagged as social engineering, blocked.

Attack 3: Hidden HTML Comments

Input Normalization: Strips and flags hidden content — HTML comments, invisible Unicode, zero-width joiners, steganographic techniques
Output Guard: Even if an attack partially bypasses input screening, output guards catch credential exfiltration — API keys, tokens, private keys — before they're published

Result: Both the hidden input AND the data theft are caught.

Why Multiple Layers Matter

Each attack was catchable by multiple layers. That's the point. Single-layer defenses have single points of failure. A defense-in-depth architecture means an attacker would need to simultaneously bypass input normalization, pattern matching, semantic classification, output filtering, policy enforcement, and audit logging.

The three biggest AI companies in the world couldn't prevent prompt injection attacks on their own agents. The attacks were trivial. The response was to update a README.

If you're building AI agents that integrate with GitHub, process user input, handle financial transactions, or access sensitive systems — you need an external security layer at the boundary.

We built AgentShield to do exactly this — a prompt injection classifier with F1 0.956 across 5 of 6 public datasets (4,666 samples; jackhhao role-play analyzed separately), p50 2.44ms. Self-hosted Docker image available, EU-hosted API with a free tier. Benchmark | API Docs | GitHub

DEV Community