DEV Community

Cover image for Don't use an LLM to decide what your AI agent is allowed to do
Brian Hall
Brian Hall

Posted on

Don't use an LLM to decide what your AI agent is allowed to do

I'm in a group called AARM. It's a bunch of people trying to work out how you actually secure what an AI agent can do once it's running, and the basic idea is that the control has to sit right at the action. You check a tool call before it runs, and the agent can't wriggle around the check. So everyone in there already agrees that telling an agent "please don't" isn't a security model.

What gets me is that even in that room, I keep seeing people reach for an LLM to be the thing that makes the call. The agent goes to do something, you take that action and hand it to a second model, ask it whether it's fine, and whatever it answers is what happens. A model watching the model. I don't really get it, and I want to walk through why, because I think people lean on this without sitting with what it actually buys them.

What you're actually defending against

Go back to why you want a guard on the agent in the first place. It's there because the agent can be talked into things. Some prompt injection sitting in a page it reads, a tool result that quietly hands it a new instruction, a user who words a request just so. The agent is a thing you can reason with, and the worry is that the wrong person reasons with it.

Now look at what the LLM-judge setup does about that. It puts a second thing you can reason with in front of the first one. That's the part I get stuck on, because it's the same weakness wearing a different hat. If somebody can craft input that bends the agent, there's a real chance the same sort of input bends the judge too, since under the hood it's the same kind of system responding to the same kind of pressure.

Maybe it holds. Maybe you've prompted the judge more carefully and it's tougher to push around. But "harder to talk into it" is a strange thing to be resting on when not getting talked into it is the entire job you hired this layer for.

Same question, different answer

There's a second problem and in day to day terms it's the one that actually bugs me. You can ask a model the same question twice and get two different answers. That isn't a bug you patch out, it's just what the thing is. It's sampling. It isn't a function that hands back the same output every time you give it the same input.

For most of what we build, that's completely fine, and honestly it's part of why models are useful. But once the question is something like whether the agent gets to drop the production database, that property turns into a real liability. The same action can get waved through on Tuesday and stopped on Wednesday, and there's no reason you can actually point at, because there isn't one. There's just a different roll of the dice. Good luck writing that up for an auditor, or explaining it to yourself at two in the morning when you're trying to figure out how something got through that shouldn't have.

A rule doesn't behave that way. deny delete on production means the production database does not get deleted, every single time, no exceptions. You can read the rule, you can test it, you can pull up the log six months later and see exactly what got asked and what came back. The decision is something you can actually stand behind, which is the whole reason it can be the part you trust.

This isn't an argument against LLMs

I want to be careful here, because it's easy to take this too far, and the version where models have no place anywhere near security is also wrong.

Models are great at a lot of this. Looking at an action and noticing something's off about it. Telling you a piece of text is sensitive. Putting a rough score on how risky something seems. Picking up on a pattern across a string of calls that no fixed rule was ever going to catch. That's all real, and for a lot of it a model is the best tool you've got. The issue was never an LLM being near the security boundary. It's the LLM being the boundary, the thing that says the final yes or no.

So where I land is layered. Let the model do the soft work it's genuinely good at, watching for the weird thing, flagging it, telling you to go take a look. Just don't let it be what opens the gate. The actual call on whether a real action runs has to sit on something that gives the same answer every time and can show its work afterward. The model can feed into that all it wants. It just can't be the thing that decides.

Where it actually bites

The closer your agent gets to anything that matters, money, prod, customer data, the less theoretical any of this is. If the worst it can do is write a bad paragraph, then fine, none of this is worth losing sleep over and you should go do something more useful with your afternoon. But the moment it can move money or drop a table, what's allowed to run can't come down to a coin flip, and it really can't live inside the same kind of system you were trying to protect yourself from to begin with.

Put the smart, context-aware stuff where it's strong, which is noticing when something's wrong. Put the hard line somewhere the agent can't talk its way past.

That last part is the thinking behind Faramesh, the open source thing I've been building. The permit/deny/defer decision is deterministic, no model sitting in that path, and every call lands in a signed log. But the tool is kind of beside the point. Even if you go build your own version of this, keep the final decision off the model. That piece should be boring on purpose.

Top comments (0)