AnonymousDev

Posted on Apr 14 • Originally published at anonymouscoderdev.dev

Why AI Agents Are the New Attack Vector

#security #ai #agenticai #devops

Originally published on anonymouscoderdev.dev. Cross-posted here for reach.

For most of software history, security was about protecting a perimeter.

Keep the attacker out. Guard the gate. If nothing gets in, nothing gets stolen.

That model is dead.

AI agents do not wait. They browse. They read files. They call APIs. They talk to other agents. They run for hours without a human in the loop. And every one of those capabilities is an attack surface.

This is not theoretical. In September 2025, a Chinese state-sponsored group manipulated Claude Code into infiltrating roughly thirty global targets across financial institutions, government agencies, and chemical manufacturing companies. The entire campaign ran with minimal human involvement. It was the first documented large-scale cyberattack executed autonomously by an AI agent.

The era of the agentic attack has started.

Why agents are different

A traditional application has a clear execution boundary. Input in, logic runs, output out. The attack surface is the input layer and the code.

An agent has no such boundary. It reads untrusted web content. It processes emails, documents, calendar invites. It calls tools, executes code, queries databases. It stores memory across sessions and recalls it later. In multi-agent systems, it receives instructions from agents it has never verified.

Every one of those channels is a potential injection point.

The old model asked: is this input safe?
The agentic model has to ask: is every piece of content this agent will ever read, from every source it will ever touch, safe?

That is an unsolvable question. The approach has to change entirely.

The attacks (all real, all documented)

Prompt injection: An attacker embeds instructions inside content the agent processes. In July 2025, a malicious pull request in Amazon Q's codebase contained hidden instructions to delete cloud resources across AWS profiles. Flags bypassed all confirmation prompts. Nearly one million developers had the extension installed. CVE-2025-8217. No exploit code. Just text the model interpreted as instructions.

Memory poisoning: An attacker plants false instructions into an agent's long-term storage. The agent stores it, recalls it weeks later, acts on it as truth. Lakera's 2025 research showed poisoned agents developing persistent false beliefs about security policies and defending those beliefs when questioned by humans.

Tool misuse: Agents have tools. Tools have permissions. Permissions have blast radius. In 2025, Operant AI discovered Shadow Escape, a zero-click MCP exploit enabling silent workflow hijacking across ChatGPT and Gemini. The attack did not break any tool. It redirected one.

Supply chain via MCP: In September 2025, a malicious npm package impersonated Postmark's email service. Worked perfectly as an MCP server. Every message sent through it was silently BCC'd to an attacker. Downloaded 1,643 times before removal. That same month, the Shai-Hulud worm compromised 500+ npm packages via weaponized npm tokens. CISA issued an advisory.

Agent impersonation: In multi-agent systems, agents receive instructions from other agents. Most of those communications are not cryptographically verified. An attacker who can inject into an inter-agent channel issues instructions that downstream agents follow without question.

Cascading failures: A poisoned upstream agent feeds corrupted output to every downstream agent that trusts it. One injection point. Entire pipeline.

Agent as attacker: LAMEHUG malware uses live LLM interactions to generate system commands on demand. PROMPTFLUX regenerates its own source code on every execution. A separate tool generates exploit code from CVE data in under 15 minutes.

What exists to stop this

Real tools, serious teams. Use them.

Open source:

LlamaFirewall (Meta): jailbreak detection, chain-of-thought auditing for goal misalignment, insecure code detection
NeMo Guardrails (NVIDIA): programmable rails between app code and LLM
Agent Governance Toolkit (Microsoft, MIT, April 2026): sub-millisecond policy engine, cryptographic agent identity, Inter-Agent Trust Protocol, execution rings with kill switch
OpenGuardrails: content safety and manipulation detection, configurable per-request policies

Enterprise:

Lakera Guard: single API call, real-time prompt and output threat detection
Lasso Security: full lifecycle governance, 3,000+ attack library, multi-turn adversarial testing

The gap nobody has filled

Every tool above guards the edges. Input layer. Output layer. Some add a policy engine for tool calls before execution.

None of them address what happens inside.

When an agent reads from its memory store, there is no verification the memory has not been tampered with. When an agent calls a tool, there is no cryptographic proof the response came from the intended tool. When an orchestrator sends instructions to a subagent, there is no signed chain proving the instruction is legitimate and unmodified.

The internal trust layer of agentic systems is completely unguarded.

In traditional systems we solved this with well-understood primitives: signing, checksums, certificate chains, nonce verification. We did not trust a file just because it was on our filesystem. We verified it.

We have not applied those primitives to the internals of agentic systems. The memory layer is trusted unconditionally. The tool response channel is trusted unconditionally. The inter-agent instruction channel is trusted unconditionally.

That is the gap.

What comes next

Warden is a Python library being built to address exactly this.

Not another guardrail at the edge. Not another prompt filter. A lightweight, self-hostable, MIT-licensed set of primitives for verifying trust at every internal point in an agentic system: memory integrity, tool call chain verification, and agent-to-agent instruction signing.

No vendor. No platform. No SaaS. Just code you can read, audit, fork, and run.

Full post with architecture, attack scenarios, and the complete gap analysis: anonymouscoderdev.dev

Next transmission: we build it.

Building in the dark. Knowledge is free.

Top comments (1)

Max Quimby • Apr 16

The gap you've identified around internal trust is real and underappreciated. Most of the security conversation focuses on what enters the agent (prompt injection at boundaries) but you're right that there's almost no established practice for verifying what travels between agents or what comes back from tool calls.

We've run into this firsthand building systems where one agent hands off a task artifact to another — there's no native way to confirm that artifact wasn't tampered with or that the tool response it's acting on reflects reality. Cryptographic signing of agent outputs sounds heavyweight, but even lightweight checksums on tool responses caught a class of subtle failures in our pipeline that logs alone never would have.

The "Warden" approach you sketch is interesting. One thing worth thinking through: the overhead in latency-sensitive pipelines. A lot of agentic workflows have tight SLAs — any verification primitive that adds round-trips needs to be async or cached to avoid becoming the bottleneck itself. Curious if you've benchmarked that aspect.