Models Self-Censor When Policy Gates Exist

#agents #infrastructure #guardrails #ai

There’s something interesting happening with AI agents that most people haven’t noticed yet.

When you put a hard policy gate in front of a model, something that deterministically blocks certain actions. the model starts behaving differently. It stops trying to do things that will get blocked. It adapts to the boundaries and works within them.

Thanks for reading Amjad! Subscribe for free to receive new posts and support my work.

This isn’t about fine-tuning or prompt engineering. It’s about how models respond to consistent, enforceable constraints.

The Guardrail Problem

Most AI safety today relies on another AI watching the first one. You tell a guardrail model “don’t let the agent delete the database” and hope it listens. But guardrails have their own problems. Recent research from Harvard showed that ChatGPT’s guardrail sensitivity varies based on things like which sports team the user supports. Chargers fans got refused more often than Eagles fans on certain requests. Women got refused more than men on requests for censored information.

This is what happens when you use probabilistic systems to check other probabilistic systems. The results are inconsistent and sometimes just weird.

Researchers have started distinguishing between two types of censorship in LLMs. Hard censorship is when the model explicitly refuses to answer, you get a message saying “I can’t help with that.” Soft censorship is when the model omits information or downplays certain elements while still responding. The model quietly leaves things out.

Both are unpredictable when the rules are fuzzy.

What Changes With Hard Boundaries

Put the same model behind a deterministic policy gate and something shifts.

The gate doesn’t reason. It doesn’t get tired or confused. It just checks actions against rules written in code. If the rule says no, it’s no. Every time.

The model figures this out fast. It stops generating actions that will hit the deny rule. Not because it understands ethics or safety, but because those actions reliably fail. The agent’s job is to accomplish tasks. Wasting tokens on things that always get blocked doesn’t help accomplish tasks.

This is the opposite of how models behave with probabilistic guardrails. When there’s another model watching that might be tricked, agents probe. They rephrase. They look for the exact wording that slips through. The interaction becomes adversarial.

Hard boundaries remove the adversarial dynamic. The model can’t talk its way out of a regex or a type check. So it stops trying.

What This Looks Like

Teams running customer support agents have noticed this pattern. Before putting hard limits in place, agents would occasionally suggest refunds above policy limits. Not often, but enough to be concerning. The guardrail would catch most of them, but some slipped through.

After adding a simple rule – if amount > 500 then deny something changed. Within hours, the agent stopped suggesting large refunds entirely. It started offering store credit. It would escalate to humans. It found alternatives that worked within the boundary.

The same pattern shows up with shell commands. Block rm -rf hard enough and agents stop generating destructive commands. They just don’t bother.

This isn’t the model becoming morally better. It’s optimizing for success within constraints.

Why This Matters

The security industry has spent years worrying that AI models will be too creative at finding ways around constraints. That they’ll jailbreak their way past any barrier.

But consistent constraints change behavior. When a model learns that certain actions always fail, those branches get pruned from its effective action space. The path of least resistance is to stay within the lines.

This has implications beyond just safety. When models stop probing and start working within bounds, they become more predictable. More reliable. Easier to put in production without constant fear of what they might try next.

The mechanism is simple efficiency. Models are constantly making micro-decisions about what to try. When trying something forbidden always fails, the model stops wasting time on it.

The Takeaway

If you’re building agents that actually do things in the world, this is worth paying attention to. The way you constrain an agent doesn’t just protect your systems, it shapes how the agent behaves. A well-designed policy layer becomes part of the agent’s decision process, not just an external check.

The agent learns to work with the boundaries instead of against them

I'm building Faramesh, which is basically this idea in practice – hard policy gates for AI agents. More here: faramesh.dev