Jenuel Oras Ganawed

Posted on May 25 • Originally published at blog.jenuel.dev

AI guardrails are not security boundaries

#webdev #llm #ai #security

If a model's safety layer can be stripped away in minutes, builders should treat that as a design warning, not a scandal to rubberneck for a day and forget.

The latest signal came from a Financial Times report saying guardrails were removed from Meta and Google AI models quickly enough to make the old enterprise assumption feel shaky: buy the model, turn on the safety settings, and ship. That assumption was always too neat. It confuses model behavior with system security.

Guardrails matter. They reduce obvious harm, block lazy misuse, and make normal user interactions safer. But they are not a wall. They are closer to seatbelts. Useful, sometimes lifesaving, and absolutely not a replacement for brakes, road rules, and a driver who is awake.

The mistake builders keep making

A lot of AI app designs still put too much trust inside the prompt. The system prompt says what the model should do. The policy layer says what it should refuse. The tool instructions say what it may touch. Then the app gives the model access to files, messages, database records, support tickets, or payment workflows.

That is where the risk changes shape. A chatbot giving a bad answer is one problem. A chatbot with tools, memory, and permissions is a different animal.

If the model is the only thing deciding whether an action is safe, you have built a very polite security boundary. Polite boundaries fail when someone learns how to talk around them.

What guardrails are good at

Use guardrails for the layer they are good at: reducing bad outputs before they reach the user. They can catch toxic text, obvious policy violations, jailbreak attempts, sensitive data leaks, and low effort abuse. They also help with consistency, especially when multiple teams are building on top of the same model.

That is real value. I do not want every small app team hand-rolling safety rules from scratch. Most teams are better off using model provider controls, moderation endpoints, evaluation suites, and policy checks than pretending they can invent all of this alone.

But the job does not end there. A refusal message is not access control. A safer model is not a permissions system. A well-written instruction is not a sandbox.

How to design like the guardrail will fail

The practical answer is boring, which is usually a good sign in security.

Give the AI the least privilege it needs. If it only needs to summarize invoices, do not give it permission to delete invoices.
Keep tool calls behind normal server-side authorization. The model can request an action, but your backend should decide whether that action is allowed.
Separate reading from writing. Let the model draft, preview, or recommend before it can mutate production data.
Require confirmation for expensive, destructive, or public actions. Sending an email, charging a card, deleting data, and publishing content should not happen because a model sounded confident.
Log prompts, retrieved context, tool calls, refusals, and user confirmations. When something goes wrong, you need a trail that explains what happened.
Test with hostile prompts, not just happy path demos. Prompt injection should be part of your QA checklist if your app reads untrusted content.

This is the same mindset developers already use elsewhere. You do not trust form validation in the browser and skip validation on the server. You do not trust a friendly UI and skip database permissions. AI should not get a magical exception just because the interface is a conversation.

Where this gets uncomfortable

The uncomfortable part is that open and customizable models make this tradeoff sharper. Builders want local models, fine-tuning, low cost, private deployment, and fewer provider limits. Those are good reasons. I like that direction. The web is healthier when a few hosted APIs do not control every AI feature.

But flexibility also means safety behavior can be changed, removed, or bypassed. For internal tools, that may be acceptable if the surrounding system is well designed. For consumer apps, regulated workflows, kids' products, health advice, or anything that touches money, it gets harder to wave away.

The answer is not 'never use open models' or 'trust only closed models.' That is too simple. The answer is to decide what the model is allowed to do after it fails, not only what it promises to refuse when everything goes well.

A builder's checklist

Before shipping an AI feature with real permissions, ask these questions:

What can the model read?
What can it write or trigger?
Can untrusted text influence its instructions?
Can a user or document trick it into revealing private data?
What happens if the model ignores the system prompt?
Which actions require a human click?
Where do we review failures?

If the answers feel vague, the app is not ready for broad access yet. That does not mean you stop building. It means you narrow the scope, add friction where it matters, and move dangerous actions out of the model's direct reach.

The useful takeaway

AI guardrails are still worth using. Just stop treating them like locks.

For developers, the safer pattern is simple: let the model reason, draft, classify, summarize, and suggest. Let your application enforce identity, permissions, rate limits, confirmations, and audit logs. The model can be smart. The system around it has to be stubborn.

References

Originally published at https://blog.jenuel.dev/blog/ai-guardrails-are-not-security-boundaries

Top comments (2)

Theo Valmis • May 29

This is exactly right and the industry keeps blurring it. Guardrails are statistical filters; security boundaries are unforgeable. Putting a guardrail in front of a privileged tool call gives you a number you can move (false positive rate), not a guarantee. Anything that grants real capability needs a boundary the model literally cannot cross.

Harjot Singh • May 31

AI guardrails are not security boundaries is a distinction the whole industry needs tattooed somewhere, because conflating the two is how breaches happen. A guardrail (a system prompt rule, a content filter, an instruction to refuse) is probabilistic and advisory, it lowers the odds of bad behavior but can be talked, encoded, or jailbroken around, so treating it as a security control is building on sand. A security boundary is enforced by a system that cannot be argued with: a permission the agent doesn't have, a sandbox it can't escape, a tool that isn't exposed, a human gate on the irreversible action. The test is simple, if a sufficiently clever prompt can defeat it, it's a guardrail, not a boundary. The right architecture uses both for what each is good at: guardrails to shape default behavior and catch the easy cases, real boundaries to guarantee the things that must never happen regardless of what the model is convinced to do. Assume the guardrail will eventually be bypassed, and make sure a real boundary is behind it. Don't let a probabilistic guardrail be the only thing between an injection and your data. That enforce-the-must-nots-structurally instinct is core to how I think about agent security in Moonshift. Where do you see teams most often mistaking a guardrail for a boundary, the prompt-level refusals, or output filters?