DEV Community

Cover image for I shipped 35 bugs in my AI chatbot. The scariest one was on the output side.

I shipped 35 bugs in my AI chatbot. The scariest one was on the output side.

Rapls on June 15, 2026

I ran my own AI chatbot plugin through a security review before release, and it came back with 35 bugs. Three were critical. The one that made my s...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

output sanitization is not new - treating LLM output like your own code is the actual bug.

Collapse
 
rapls profile image
Rapls

Right, and that's the reframe the whole post is built on. Sanitization is old news as a technique. The actual bug is upstream of it, in the mental model: the moment you treat model output as your own code, you've already granted it a trust level nothing earned. Output escaping is just the symptom-level fix. The real fix is reclassifying where the output sits, it's external input that happens to arrive from your own LLM, not a trusted internal value. Same category as a form field or a third-party API response. Once it's filed under "untrusted input," sanitization stops being a special AI precaution and becomes the boring thing you already do at every other trust boundary. The novelty was never the defense, it was people forgetting which side of the boundary the output was on.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah and the input trust boundary is equally broken - the model has no distinction between developer instructions and attacker data. so we end up patching two holes with one bandage.

Thread Thread
 
rapls profile image
Rapls

Right, and that symmetry is the part that makes it one bug, not two. On the way in, the model can't tell a developer instruction from attacker data riding in on a document. On the way out, the caller can't tell a safe value from an injected one riding in on the model's reply. Same failure, mirrored: a trust boundary the model itself can't enforce, because the model has no concept of where the trust is supposed to change.

Which is why the one bandage has to go outside the model, on both sides. You can't ask the thing that can't see the boundary to defend it. Inbound, you keep untrusted content in a channel the model is told to treat as data, never as instructions, and you don't rely on it obeying that, you constrain what an instruction could even do. Outbound, you sanitize at the sink. Both are the same move: stop expecting the model to police a line it can't perceive, and put a deterministic check on the side of the line where someone actually can. The model is a pipe. It carries whatever you put in, in both directions.

Collapse
 
xulingfeng profile image
xulingfeng

The output side hits close to home. Same pattern when we test AI systems — all the effort goes into "how dirty can we make the input," nobody thinks about sanitizing what comes back out. That "LLM output is untrusted input" line belongs on every CI/CD pipeline.

Collapse
 
rapls profile image
Rapls

That asymmetry you describe is the whole thing. We pour energy into "how dirty can the input get" and treat the return trip as if it came back clean. The model is just a pipe, and a pipe carries whatever you put in it, in both directions.

The CI/CD angle is a good one. The hard part is that output checks are context-dependent, so it's less one rule and more a set: escape before render, validate against a schema where the shape is known, allowlist any URL the output wants to fetch. What I'd love at the pipeline level is a lint that flags model output reaching a sink (render, fetch, query) without passing through something first. Closer to a taint check than a single gate.

Collapse
 
circuit profile image
Rahul S

The taint check idea is right, but there's a practical wall: json.loads() kills taint lineage. The raw model output string is tainted, sure, but the moment you parse it into structured data, parsed["url"] is a fresh string with no provenance — the parser created new objects. Same thing happens with regex extraction, template destructuring, any data transformation really. Traditional taint tracking in Perl/PHP worked because the runtime propagated taint through string operations. JSON deserialization breaks that chain completely because it's not a string operation, it's object construction. So a lint that tracks "model output reaching a sink" would need to survive structured data transformations, which is closer to information flow control than classical taint analysis. Not impossible, but it's a fundamentally harder problem than what existing SAST tools solve.

Thread Thread
 
rapls profile image
Rapls

This is the correction the taint idea needed. You're right that json.loads() severs the lineage: the moment the string becomes objects, parsed["url"] is a fresh value the parser minted, and classical taint tracking propagated through string ops, not object construction. Regex extraction and destructuring break it the same way. So tracking "model output reaches a sink" across transforms is information flow control, not the taint analysis SAST tools ship today. Agreed it's the harder problem.

Where that pushes me is away from chasing lineage through the transform, and toward treating the parse boundary as the place to re-taint. Instead of trying to keep provenance alive across json.loads(), mark everything that comes out of it as untrusted by construction, because it came from model output, and re-validate at the sink regardless of what the variable's history looks like. You lose the precision of true lineage tracking and accept the false positives, but for this threat model "re-suspect everything downstream of a model-output parse" is a cheaper and safer default than trying to thread taint through object construction. Coarser than IFC, but shippable. Treat the parse as a trust boundary, not a transformation.

Collapse
 
mateo_ruiz_6992b1fce47843 profile image
Mateo Ruiz

The "double-trust problem" is a great way to frame this.

A lot of developers have learned to distrust user input, but many still implicitly trust model output because it came from the AI rather than a human. In reality, model output is often a blend of user input, retrieved content, and model-generated text, so treating it as trusted data creates a dangerous blind spot.

The point about output driving actions is especially important. Once agents start calling tools, fetching URLs, or triggering workflows, validation has to happen outside the model. We've seen that the safest pattern is to treat the model as a decision-support layer and keep permission checks, URL validation, and execution controls in deterministic code.

"Treat model output as untrusted input" is probably one of the most valuable security principles AI builders can adopt right now.

Collapse
 
rapls profile image
Rapls

"Decision-support layer, with the controls in deterministic code" is the cleanest statement of it. The model gets to suggest the action; it doesn't get to be the action. Permission checks, URL validation, execution gates all live in code you can read and test, and the model's output is just one more input into that code, not a command that bypasses it.

Your point about the blend is the part people miss. It's tempting to think "I trust my own model," but the output isn't purely the model's, it's user input plus retrieved content plus generation, fused into one string with the provenance washed out. You can't trust the mix more than its least trustworthy ingredient, and one of those ingredients is whatever a stranger typed or whatever sat on a page you crawled.

The line I keep coming back to: the model decides, deterministic code disposes. Keep the irreversible part on the side you can audit.

Collapse
 
srashti_a3904fc69ca75e7dd profile image
Srashti

The model decides, deterministic code disposes.' That line needs to be pinned on every AI dev's wall. Whether you're building web plugins or handling automated data pipelines in Python, keeping the execution logic strictly deterministic outside the model is the only way to build safely. Thanks for sharing these mistakes so others don't have to make them

Collapse
 
rapls profile image
Rapls

Thanks, that line came out of getting burned, so I'm glad it travels. And you're right that it isn't WordPress-specific. The shape is the same in a Python pipeline: the model can decide what to do, but the moment its decision becomes an action with consequences, a deterministic layer it can't talk its way past has to be the thing that actually executes. The domain changes, the boundary doesn't. Appreciate you reading it.

Collapse
 
skycandykey1 profile image
skycandykey1

awsome ! That's so useful for everyone

Collapse
 
rapls profile image
Rapls

Thanks, glad it was useful. If it saves one output-side bug out there, it did its job.

Collapse
 
julianneagu profile image
Julian Neagu

This is the gap most LLM apps still have: everyone hardens prompt injection, then forgets the model output becomes a new attack surface.
Once you treat output as just another untrusted boundary, most of the “weird” bugs collapse into standard web security categories.
Feels less like AI security and more like re-applying OWASP in a new place.

Collapse
 
txdesk profile image
TxDesk

The double-trust problem is the part that stays underweighted. People internalized 'distrust user input' years ago but still treat model output as clean because it came from the model, when it's really a blend of user input and RAG-pulled content wearing the model's voice. On my side I treat anything the model emits about on-chain state as a claim to verify against the actual chain read, never as fact. And your Hole 2 is the one I'd underline hardest: keeping privileged actions off the model's direct output, so the executing side decides what's allowed rather than 'the output said so, run it.' That separation is what holds even when indirect injection lands.