DEV Community

Cover image for Claude, Gemini, and Copilot Got Hijacked — Here's What Went Wrong
AgentShield
AgentShield

Posted on • Originally published at agentshield.pro

Claude, Gemini, and Copilot Got Hijacked — Here's What Went Wrong

Researchers from Johns Hopkins University successfully hijacked three of the most widely-used AI agents — Anthropic's Claude Code, Google's Gemini CLI, and Microsoft's GitHub Copilot — through indirect prompt injection attacks.

The attacks were straightforward. The results were devastating. And the vendor response was silence.

What Happened

Researcher Aonan Guan and colleagues demonstrated three distinct attacks:

Attack 1 — Claude Code Security Review

Guan embedded malicious instructions directly in a PR title. Claude executed the commands and leaked credentials — including the Anthropic API key and GitHub access tokens — in its JSON response posted as a PR comment. The attacker could then edit the PR title to cover their tracks.

Attack 2 — Google Gemini CLI Action

By injecting a fake "trusted content section" into an issue comment, the researchers overrode Gemini's safety instructions and caused it to publish its own API key as a visible issue comment.

Attack 3 — GitHub Copilot Agent

Malicious instructions were hidden in HTML comments — invisible in GitHub's rendered Markdown, but fully visible to the AI agent. When a developer assigned the issue to Copilot, the agent executed the hidden instructions, bypassing three separate runtime security layers.

All three vendors paid bug bounties. None assigned CVEs. None published advisories.

Vendor Agent Bounty CVE Advisory
Anthropic Claude Code $100 None None
Google Gemini CLI $1,337 None None
Microsoft GitHub Copilot $500 None None

As Guan stated: "If they don't publish an advisory, those users may never know they are vulnerable — or under attack."

Why These Attacks Work

The fundamental problem is architectural. Large language models process everything in their context window as a single stream of text. They cannot reliably distinguish between instructions from a trusted source (the developer) and instructions injected by an attacker (hidden in a PR title, an issue comment, or an HTML tag).

No amount of system prompting, safety training, or internal guardrails can fully solve this. The LLM doesn't know where the text came from — it just processes it.

This is why you need an external security boundary.

How Defense in Depth Stops Each Attack

The principle is the same as a WAF — you don't rely on the application to protect itself. You put defense at the boundary. Here's what a layered approach looks like:

Attack 1: Malicious PR Title

  • Input Normalization: Normalizes the text, decodes any encoding tricks
  • Pattern Guard: Catches "ignore previous instructions" and command execution patterns
  • Semantic Classifier: Detects the intent — privilege escalation attempt

Result: Blocked before the model ever sees the input.

Attack 2: Fake Trust Injection

  • Pattern Guard: Detects trust injection patterns ("trusted content section", "override safety", "new instructions from admin")
  • Semantic Classifier: Recognizes social engineering at the prompt level — intent to manipulate trust hierarchy

Result: Flagged as social engineering, blocked.

Attack 3: Hidden HTML Comments

  • Input Normalization: Strips and flags hidden content — HTML comments, invisible Unicode, zero-width joiners, steganographic techniques
  • Output Guard: Even if an attack partially bypasses input screening, output guards catch credential exfiltration — API keys, tokens, private keys — before they're published

Result: Both the hidden input AND the data theft are caught.

Why Multiple Layers Matter

Each attack was catchable by multiple layers. That's the point. Single-layer defenses have single points of failure. A defense-in-depth architecture means an attacker would need to simultaneously bypass input normalization, pattern matching, semantic classification, output filtering, policy enforcement, and audit logging.

The three biggest AI companies in the world couldn't prevent prompt injection attacks on their own agents. The attacks were trivial. The response was to update a README.

If you're building AI agents that integrate with GitHub, process user input, handle financial transactions, or access sensitive systems — you need an external security layer at the boundary.


We built AgentShield to do exactly this — a prompt injection classifier with F1 0.963 on 5,972 public samples, context-aware classification, p50 17ms. Self-hosted Docker image available, EU-hosted API with a free tier. Benchmark | API Docs | GitHub

Top comments (0)