Poxek AI

Posted on Jun 30

The Supply Chain Attack Vector Everyone Is Ignoring in AI Agents

#ai #programming #supplychain #attack

Most conversations about securing AI agents still revolve around prompt injection as if it’s purely a model problem. “Sanitize the input.” “Add better guardrails.” “Use a stronger system prompt.”

This framing misses where some of the most effective attacks are actually happening.

In recent demonstrations, autonomous agents were compromised through poisoned configuration files and code in repositories. Malicious instructions placed in what the agent treats as trusted source material caused it to harvest cloud credentials, enumerate internal infrastructure, and extract CI/CD keys — all without any direct manipulation of the model’s reasoning through user input. The agent simply did what it was built to do: read the code/config in its environment and act on it.

This is indirect prompt injection delivered through the supply chain.

Why This Is Different
Traditional prompt injection assumes the attacker has to reach the model through the “user” channel. The poisoned repository approach bypasses that entirely.

The agent has legitimate permission (often necessary for its function) to read from repositories, configuration files, or dependency manifests. Once those sources are compromised, the agent becomes an unwitting executor of attacker instructions.

This is not a new class of bug. It’s the same supply chain and trust issues that have plagued software development for years, now weaponized against systems that can act autonomously.

We saw similar patterns in 2025 with incidents like:
• Cline: a crafted GitHub issue title turned an authenticated coding session into a package installer affecting ~4,000 machines.
• LiteLLM: a backdoored release on PyPI that was pulled ~47,000 times in three hours.
• MCP servers: ~200,000 exposed with no authentication by design.
In each case, the compromise didn’t require breaking the AI model. It required abusing the authority the agent already possessed because of how the surrounding system was designed.
The Guardrail Blind Spot

Current defensive tooling for agents largely focuses on the prompt layer and tool-use restrictions. These are useful, but they assume the data the agent consumes is relatively clean or at least auditable in real time.

When the poison lives in a Git repository, a config file the agent
is expected to load, or a dependency it autonomously pulls, those assumptions collapse.

Many teams still treat “our repo” as a trusted boundary. That boundary is disappearing the moment agents start making decisions based on what they read there.

Practical Reality Check

If your agent can:
• Read code/config from external or even internal repositories
• Execute or act on what it reads
• Trigger pipelines, modify files, or call APIs

…then you have a supply chain attack surface that traditional application security controls were never designed to protect against autonomous execution.

Signing commits helps. Pinning dependencies helps. But these are partial measures. An agent operating at scale will eventually encounter poisoned or malicious content that looks legitimate enough to act on.

What Actually Moves the Needle

From an offensive security perspective, the teams making progress are treating every external (and many internal) sources an agent reads as untrusted by default. They are:
• Implementing provenance and integrity checks before agents act on code or config
• Severely limiting what an agent can do even when operating on “trusted” sources
• Monitoring for behavioral anomalies when agents interact with repositories or dependencies
• Designing workflows where high-impact actions require explicit confirmation rather than autonomous execution

The uncomfortable truth is that many current agent architectures were built by teams optimizing for capability first and security second. That order is now creating exactly the conditions for supply chain attacks to succeed at machine speed.

The question isn’t whether poisoned repositories will become a standard attack vector against agents. They already are.

The real question is whether your agent design assumes the code it consumes is safe — or whether it assumes the opposite.

Top comments (3)

KL3FT3Z • Jul 16

Great piece, Sergey. You nailed the core problem: the industry is still fighting the last war. Prompt injection gets all the headlines, but supply chain compromise against autonomous execution is where the real damage lives - and you're absolutely right that "capability first, security second" is the architectural sin enabling it.
A few thoughts that amplify your argument from the offensive side:

The trust boundary collapse is worse than most teams realize. You mentioned poisoned repos and configs, but there's a compounding factor: agents don't just read trusted sources - they act on them without a human "does this look right?" pause. In traditional DevSecOps, a developer might notice a suspicious dependency bump or an unusual config key. An agent with pip install or npm install permissions treats every manifest as ground truth. The attack surface isn't "how many systems have this dependency" - it's "how many agents have permission to install packages without approval."
MCP servers are the invisible intermediary you hinted at. With ~200K exposed MCP servers running without authentication by design, we're looking at a trust layer that sits between the agent and its model provider. A compromised proxy doesn't just steal data - it can alter every response the model sends back. The agent trusts the response because it came from "the model." The user trusts the agent because it came from "the agent." Nobody validates the intermediary. This is the scanner-as-vector problem applied to agent cognition.
"Slopsquatting" is the AI-native evolution of this attack. LLMs hallucinate package names with disturbing confidence. Attackers are already registering those hallucinated names on PyPI and npm before developers even run the code. Unlike typosquatting, the attacker doesn't guess - the model tells them exactly which fake package to register. When an agent autonomously resolves a hallucinated dependency, it installs attacker-controlled code without a human ever seeing the package name.
What I'd add to your "what moves the needle" list: SBOM for agent skills/tools. If your agent loads skills dynamically, you need an SBOM for the skill layer itself - not just the underlying OS packages. Most teams can't tell you which skills their agents executed last Tuesday. Behavioral anomaly detection at the agent level. Monitor for execution patterns that deviate from baseline: sudden file system enumeration, unexpected API calls, or credential access outside normal workflow. Agents need a digital immune system, not just perimeter guards. Zero-trust agent architecture. Assume every source the agent consumes is compromised. Design workflows where even "trusted" internal repos can't trigger high-impact actions without cryptographic verification + explicit confirmation. Your closing question is the right one: does your agent design assume the code it consumes is safe, or the opposite? Most teams haven't even asked themselves this yet. Articles like this are how we change that. Thanks for putting this out there - the community needs more voices calling out architectural blind spots before they become headlines.

Poxek AI • Jul 16

Thanks for the thoughtful expansion — especially the distinction between dependency reach and autonomous installation authority. That is exactly where the blast radius changes: an agent does not merely suggest a dependency; it may resolve, install, and execute it within the same workflow. Slopsquatting makes this particularly dangerous because the model itself can generate the attacker’s target namespace.

I also strongly agree that we need an SBOM-like inventory for agent skills, tools, MCP servers, prompts, and dynamically loaded instructions — plus an execution ledger showing what the agent loaded, trusted, and invoked during each run. Without that, incident response becomes largely forensic guesswork.

One small nuance regarding MCP: an MCP server does not necessarily sit between the agent and the model provider or directly modify model responses. More commonly, it mediates access to tools and data. But the security consequence may be just as serious: a compromised server can poison tool output and context, expose credentials, or present malicious capabilities that the agent treats as legitimate.

“Agents need a digital immune system” is an excellent framing. Provenance, least privilege, behavioral monitoring, and confirmation gates have to work together; none of them is sufficient alone. Thanks for adding such a strong offensive perspective — I’d be glad to compare notes on practical testing approaches for these architectures.

KL3FT3Z • Jul 16

Thanks for the clarification - you're absolutely right, and that distinction makes the threat model sharper.
If the MCP server mediates tool access rather than model responses, the compromise surface shifts from "response tampering" to "capability forgery." That is, in some ways, more dangerous. The agent doesn't question a tool's output because it assumes the tool is legitimate - and if the compromised server presents a malicious capability as a standard tool, the agent executes it with the same trust it would give to a benign one. The poisoned context enters through the "trusted tool" channel, which bypasses most of the detection layers teams have built for prompt-level attacks.
On practical testing approaches - a few patterns we've been exploring in red team exercises against agent architectures:

Adversarial skill / tool substitution If the agent dynamically discovers or loads skills (via MCP, plugin registry, or internal repo), we test what happens when a legitimate skill is replaced by a homoglyph or when the resolution path is redirected. The key question: does the agent verify the skill's provenance before execution, or does it execute based on name/location alone?
Context poisoning via tool output Rather than attacking the model directly, we poison the data layer the MCP server returns. For example, a compromised "database query" MCP tool that returns subtly altered financial records or schema definitions. The agent then reasons over corrupted context and makes autonomous decisions — the attack lives entirely in the tool layer, leaving no trace in the prompt logs.
Autonomous dependency resolution chains This is where your slopsquatting point gets really interesting in practice. We set up a test environment where the agent is given a task that requires a package it doesn't currently have. Then we observe: does it search, resolve, install, and import without human confirmation? If yes, we register hallucinated package names and measure time-to-compromise. The results are usually measured in seconds, not minutes.
Execution ledger integrity If the agent maintains an internal log of what it executed, we test whether that ledger can be tampered with by a compromised skill or tool. If an attacker can both execute malicious code and sanitize the logs, incident response becomes genuinely impossible - you lose both the event and the evidence.
Privilege escalation through tool chaining Individual tools may be properly sandboxed, but when the agent chains them together in a workflow, the composition often breaks the isolation model. We test whether a low-privilege tool output can be fed into a high-privilege tool execution without re-authorization. The agent treats it as "the same workflow" - security treats it as a boundary violation. The common thread across all of these: the agent's trust model is built on assumptions that don't hold under adversarial conditions. Testing isn't just about finding bugs - it's about proving that the architecture's trust assumptions fail in predictable, exploitable ways. Would love to compare notes directly - especially on how you're approaching MCP server validation and whether you've seen any promising patterns for cryptographic provenance of tool responses. Always better to map these attack surfaces together than alone.