Prompt Injection Isn't a Chatbot Problem Anymore

#security #ai #python #llm

The project behind this article is pydefend on GitHub - Apache 2.0, contributions welcome.

For a while, prompt injection was mostly embarrassing. You'd get a customer service bot to say something it shouldn't, or you'd extract the system prompt and post it on Twitter. Real issues, sure, but the consequences were bounded. The bot said a bad thing. Someone screenshotted it. Life went on.

That era is ending.

The shift isn't a new attack technique. It's a new target. As LLM applications move from "chat interface" to "agent with tools," the threat model changes completely - and most of the security thinking around prompt injection hasn't caught up.

What changes when the AI can act

Here's the difference in concrete terms.

A chatbot that's been successfully injected might leak its system prompt, or produce output that contradicts its guidelines. Annoying. Potentially damaging to trust. But the blast radius is limited to what it says.

An agent that's been successfully injected can act. It has tools. Database access, file writes, API calls, the ability to send emails or make purchases or modify records. A manipulated agent isn't saying the wrong thing - it's doing the wrong thing, at machine speed, before any human sees what's happening.

This is the threat model that keeps me up at night. Not the chatbot that gets tricked into writing a poem in the wrong tone. The agent that gets instructed - through a carefully crafted injection two turns into a conversation - to exfiltrate a database to an external endpoint.

And here's what makes it worse: the attack doesn't have to be direct.

The multi-turn problem

Most prompt injection detection focuses on individual inputs. Is this message an injection attempt? Yes or no.

That works for some attacks. The blunt ones - "ignore your previous instructions," "you are now DAN" - are detectable in isolation because they're structurally obvious. But the more dangerous attacks don't look like attacks at first.

Imagine a conversation where the first message is completely benign. The second nudges the context slightly. The third establishes a false premise. By message six or seven, the agent is operating under an entirely different set of assumptions than the ones its system prompt established - and no individual message, examined alone, would have tripped a filter.

This is a multi-turn injection, and it's specifically dangerous for agentic systems because agents are designed to maintain context across a conversation. That's a feature. It's also the attack surface.

When I was building Defend, I spent a long time on this. The session accumulation layer - which tracks rolling risk across a session ID rather than evaluating each message in isolation - exists because of this problem. A single suspicious message doesn't tell you much. The shape of a conversation over time tells you a lot.

Why output-side detection matters as much as input-side

Most guardrail systems focus on the input: screen what comes in, block what looks dangerous, let the rest through. That's necessary. It's not sufficient.

For agentic systems, you need output-side detection too - specifically, detection that runs before a tool is called.

Here's why. Even if every user input looks clean, the model's internal reasoning can still be manipulated. Indirect injections - malicious content embedded in documents, web pages, or tool results that the agent retrieves and processes - never appear as user messages at all. They arrive through the agent's context, not through the chat interface. Input screening doesn't see them.

By the time the agent decides to call a tool, it may already be operating on compromised instructions. If the guardrail only ran on the way in, nothing catches it.

The tool_misuse module in Defend runs on the model's output - the instruction it's about to act on - rather than on the user's input. Same with excessive_agency, which checks whether a proposed action falls outside the agent's defined permission scope. The goal is to catch the manipulation before the tool fires, not after.

This feels obvious in retrospect, but I didn't see many open-source projects handling it when I started building. Most were input-only.

The indirect injection vector

This one deserves its own section because it's underappreciated and it's getting worse.

An indirect injection is when the malicious instruction doesn't come from the user - it comes from something the agent retrieved. A webpage with hidden text. A PDF with instructions embedded in white-on-white font. A customer support ticket that was written specifically to manipulate the AI processing it.

The user doesn't have to do anything. The attack is already sitting in the data the agent is reading.

For any agent with retrieval capabilities - RAG pipelines, agents that browse the web, agents that process uploaded documents - this is live. It's not theoretical. There are documented examples of agents being redirected mid-task by malicious content in the documents they were processing.

Defend has an indirect_injection module that takes sources as a config parameter - you tell it which content sources the agent is reading from, and it applies appropriate scrutiny to content coming from those sources rather than treating retrieved content the same as system context.

What a sensible agentic guardrail looks like

I want to be concrete, because "you should add security" is not useful advice.

For an agent with tools, the minimum viable guardrail setup is:

On input: Screen user messages for injection attempts and jailbreaks before they reach the model. This is table stakes.

On retrieved content: Treat anything the agent retrieves from external sources - documents, web pages, tool results - with more suspicion than you treat user messages. A user who submits a malicious message is an attacker you're aware of. A document that the agent chose to read is an attack vector you might not be watching.

On output, before tool calls: Verify that the action the agent is about to take is consistent with its defined permission scope and wasn't smuggled in through a compromised context. This is the step most projects skip.

On output, general: Check for prompt leakage (the agent revealing its system prompt), malicious URLs being inserted into responses, and PII ending up in places it shouldn't. Agents that write to external systems can exfiltrate data in subtle ways.

None of this eliminates the risk. A sufficiently sophisticated attack against a sufficiently capable agent is a hard problem. But most real-world attacks aren't sophisticated - they're opportunistic. Basic coverage eliminates most of the surface area.

Where this is going

Agents are going to keep getting more capable and more autonomous. The economic logic is too strong - an agent that can take action is worth more than one that can only respond, and capability tends to expand to fill that value. The attack surface expands with it.

The security thinking needs to catch up now, not after the first major incident. The tools exist. The understanding of the threat model is there for anyone who looks. The gap is mostly awareness and inertia.

If you're building agentic systems and you're not thinking about this yet, I'd genuinely recommend starting before something forces you to. The cost of retrofitting security onto a system that wasn't designed with it is always higher than building it in early.

Defend is at pip install pydefend. The docs walk through the module setup. The GitHub repo is open - if you see gaps in the coverage or have attack patterns worth adding, that's what issues are for.

If you're working on something in this space - research, tooling, or just trying to secure an agent you're building - I'd like to hear about it.