DEV Community

I shipped 35 bugs in my AI chatbot. The scariest one was on the output side.

Rapls on June 15, 2026

I ran my own AI chatbot plugin through a security review before release, and it came back with 35 bugs. Three were critical. The one that made my s...

Read full post

Mykola Kondratiuk • Jun 20

output sanitization is not new - treating LLM output like your own code is the actual bug.

Rapls • Jun 20

Right, and that's the reframe the whole post is built on. Sanitization is old news as a technique. The actual bug is upstream of it, in the mental model: the moment you treat model output as your own code, you've already granted it a trust level nothing earned. Output escaping is just the symptom-level fix. The real fix is reclassifying where the output sits, it's external input that happens to arrive from your own LLM, not a trusted internal value. Same category as a form field or a third-party API response. Once it's filed under "untrusted input," sanitization stops being a special AI precaution and becomes the boring thing you already do at every other trust boundary. The novelty was never the defense, it was people forgetting which side of the boundary the output was on.

Mykola Kondratiuk • Jun 20

yeah and the input trust boundary is equally broken - the model has no distinction between developer instructions and attacker data. so we end up patching two holes with one bandage.

Rapls • Jun 20

Right, and that symmetry is the part that makes it one bug, not two. On the way in, the model can't tell a developer instruction from attacker data riding in on a document. On the way out, the caller can't tell a safe value from an injected one riding in on the model's reply. Same failure, mirrored: a trust boundary the model itself can't enforce, because the model has no concept of where the trust is supposed to change.

Which is why the one bandage has to go outside the model, on both sides. You can't ask the thing that can't see the boundary to defend it. Inbound, you keep untrusted content in a channel the model is told to treat as data, never as instructions, and you don't rely on it obeying that, you constrain what an instruction could even do. Outbound, you sanitize at the sink. Both are the same move: stop expecting the model to police a line it can't perceive, and put a deterministic check on the side of the line where someone actually can. The model is a pipe. It carries whatever you put in, in both directions.

Mykola Kondratiuk • Jun 22

the symmetry framing collapses it to one fix: validate at the extraction point, not at sanitization. the model can't help; the structure of the call has to.

xulingfeng • Jun 16

The output side hits close to home. Same pattern when we test AI systems — all the effort goes into "how dirty can we make the input," nobody thinks about sanitizing what comes back out. That "LLM output is untrusted input" line belongs on every CI/CD pipeline.

Rapls • Jun 16

That asymmetry you describe is the whole thing. We pour energy into "how dirty can the input get" and treat the return trip as if it came back clean. The model is just a pipe, and a pipe carries whatever you put in it, in both directions.

The CI/CD angle is a good one. The hard part is that output checks are context-dependent, so it's less one rule and more a set: escape before render, validate against a schema where the shape is known, allowlist any URL the output wants to fetch. What I'd love at the pipeline level is a lint that flags model output reaching a sink (render, fetch, query) without passing through something first. Closer to a taint check than a single gate.

Rahul S • Jun 16

The taint check idea is right, but there's a practical wall: json.loads() kills taint lineage. The raw model output string is tainted, sure, but the moment you parse it into structured data, parsed["url"] is a fresh string with no provenance — the parser created new objects. Same thing happens with regex extraction, template destructuring, any data transformation really. Traditional taint tracking in Perl/PHP worked because the runtime propagated taint through string operations. JSON deserialization breaks that chain completely because it's not a string operation, it's object construction. So a lint that tracks "model output reaching a sink" would need to survive structured data transformations, which is closer to information flow control than classical taint analysis. Not impossible, but it's a fundamentally harder problem than what existing SAST tools solve.

Rapls • Jun 16

This is the correction the taint idea needed. You're right that json.loads() severs the lineage: the moment the string becomes objects, parsed["url"] is a fresh value the parser minted, and classical taint tracking propagated through string ops, not object construction. Regex extraction and destructuring break it the same way. So tracking "model output reaches a sink" across transforms is information flow control, not the taint analysis SAST tools ship today. Agreed it's the harder problem.

Where that pushes me is away from chasing lineage through the transform, and toward treating the parse boundary as the place to re-taint. Instead of trying to keep provenance alive across json.loads(), mark everything that comes out of it as untrusted by construction, because it came from model output, and re-validate at the sink regardless of what the variable's history looks like. You lose the precision of true lineage tracking and accept the false positives, but for this threat model "re-suspect everything downstream of a model-output parse" is a cheaper and safer default than trying to thread taint through object construction. Coarser than IFC, but shippable. Treat the parse as a trust boundary, not a transformation.

Srashti • Jun 19

The model decides, deterministic code disposes.' That line needs to be pinned on every AI dev's wall. Whether you're building web plugins or handling automated data pipelines in Python, keeping the execution logic strictly deterministic outside the model is the only way to build safely. Thanks for sharing these mistakes so others don't have to make them

Rapls • Jun 19

Thanks, that line came out of getting burned, so I'm glad it travels. And you're right that it isn't WordPress-specific. The shape is the same in a Python pipeline: the model can decide what to do, but the moment its decision becomes an action with consequences, a deterministic layer it can't talk its way past has to be the thing that actually executes. The domain changes, the boundary doesn't. Appreciate you reading it.

Mateo Ruiz • Jun 16

The "double-trust problem" is a great way to frame this.

A lot of developers have learned to distrust user input, but many still implicitly trust model output because it came from the AI rather than a human. In reality, model output is often a blend of user input, retrieved content, and model-generated text, so treating it as trusted data creates a dangerous blind spot.

The point about output driving actions is especially important. Once agents start calling tools, fetching URLs, or triggering workflows, validation has to happen outside the model. We've seen that the safest pattern is to treat the model as a decision-support layer and keep permission checks, URL validation, and execution controls in deterministic code.

"Treat model output as untrusted input" is probably one of the most valuable security principles AI builders can adopt right now.

Rapls • Jun 16

"Decision-support layer, with the controls in deterministic code" is the cleanest statement of it. The model gets to suggest the action; it doesn't get to be the action. Permission checks, URL validation, execution gates all live in code you can read and test, and the model's output is just one more input into that code, not a command that bypasses it.

Your point about the blend is the part people miss. It's tempting to think "I trust my own model," but the output isn't purely the model's, it's user input plus retrieved content plus generation, fused into one string with the provenance washed out. You can't trust the mix more than its least trustworthy ingredient, and one of those ingredients is whatever a stranger typed or whatever sat on a page you crawled.

The line I keep coming back to: the model decides, deterministic code disposes. Keep the irreversible part on the side you can audit.

Mudassir Khan • Jun 22

"everyone guards the input. the output leaks" is the sentence i'm keeping because i've watched it happen across three projects and it always catches someone off guard.

the HTML injection case is the most visible, but we hit a quieter variant: model output feeding into a database query string in a tool call. tool call looked typed. it wasn't. model had helpfully constructed what looked like a valid param and we passed it straight through.

treating model output as untrusted user input for validation is the mental model that flipped how we write these integrations. if you wouldn't trust a form field, don't trust the model.

did the review find all 35 in one pass or did it take multiple rounds?

Rapls • Jun 22

"If you wouldn't trust a form field, don't trust the model" is the whole thing in one line. And your DB variant is the scarier half of it, precisely because it's quiet. An HTML injection at least shows up on the page; a model-built param flowing into a query renders nothing and looks typed on the way through, so it sails past both the model and a fast review. The tool call looking typed is exactly the trap.

Which answers your question: it took multiple rounds, not one pass. Eleven, actually. And the reason it took that many is the reason your DB case is dangerous. Each pass I was only looking for one kind of thing. Early rounds caught the visible behavior bugs. The output-side holes didn't surface until a later pass where I'd switched my lens to "treat everything the model emits as untrusted input," and the scariest one came out near the end. The visible bugs come out first; the quiet ones only show up once you re-read the same code with a different suspicion. One pass would have shipped your exact DB variant without ever flagging it.

skycandykey1 • Jun 16

awsome ! That's so useful for everyone

Rapls • Jun 16

Thanks, glad it was useful. If it saves one output-side bug out there, it did its job.

TxDesk • Jun 17

The double-trust problem is the part that stays underweighted. People internalized 'distrust user input' years ago but still treat model output as clean because it came from the model, when it's really a blend of user input and RAG-pulled content wearing the model's voice. On my side I treat anything the model emits about on-chain state as a claim to verify against the actual chain read, never as fact. And your Hole 2 is the one I'd underline hardest: keeping privileged actions off the model's direct output, so the executing side decides what's allowed rather than 'the output said so, run it.' That separation is what holds even when indirect injection lands.

Julian Neagu • Jun 17

This is the gap most LLM apps still have: everyone hardens prompt injection, then forgets the model output becomes a new attack surface.
Once you treat output as just another untrusted boundary, most of the “weird” bugs collapse into standard web security categories.
Feels less like AI security and more like re-applying OWASP in a new place.