Stop returning the same "blocked" error from your agent guardrail

#ai #security #agents

If you run deny-by-default tool guards on AI agents, your refusal is a security decision — not a logging afterthought.

I watched one source mutate a malformed tool call ~1,400 times against a production agent in a weekend. Every identical BLOCKED response was feedback for the attacker's automated search: same input shape → same refusal → "colder," changed shape → changed response → "warmer."

A Keysight paper (arXiv:2606.20470) quantifies it: deterministic detect-and-block lets attack success rate approach 1 as the query budget grows, because predictable refusals feed model-guided search. Their detect-and-misdirect approach cuts the ASR upper bound by up to ~2 orders of magnitude.

The cheap version of the fix, in pseudocode:

# BEFORE: a stable refusal = a label for the attacker's search
def on_blocked(call):
    return {"error": "TOOL_CALL_BLOCKED", "code": 4031}  # identical every time

# AFTER: vary a non-operational response so the deny path isn't a compass
def on_blocked(call):
    # return a controlled, plausible-but-non-operational response;
    # randomize shape/latency so block != stable signal
    return misdirect(call, vary=["shape", "delay", "message"])

Caveats from doing this in prod:

It makes YOUR debugging harder (your own false positives now look noisy too) — log the real reason internally, only vary the external response.
Varying text isn't enough if latency still leaks. Treat timing + error-shape as part of the response surface.
Open question I don't have a clean answer to: does misdirection just move the oracle one layer up into side channels?

I maintain an open-source deny-by-default firewall for agent tool calls (agent-airlock), which is how I had the logs to catch this. The lesson generalizes to any guardrail: a denied call's response is attack surface.

DEV Community

Stop returning the same "blocked" error from your agent guardrail

Top comments (0)