If you run deny-by-default tool guards on AI agents, your refusal is a security decision — not a logging afterthought.
I watched one source mutate a malformed tool call ~1,400 times against a production agent in a weekend. Every identical BLOCKED response was feedback for the attacker's automated search: same input shape → same refusal → "colder," changed shape → changed response → "warmer."
A Keysight paper (arXiv:2606.20470) quantifies it: deterministic detect-and-block lets attack success rate approach 1 as the query budget grows, because predictable refusals feed model-guided search. Their detect-and-misdirect approach cuts the ASR upper bound by up to ~2 orders of magnitude.
The cheap version of the fix, in pseudocode:
# BEFORE: a stable refusal = a label for the attacker's search
def on_blocked(call):
return {"error": "TOOL_CALL_BLOCKED", "code": 4031} # identical every time
# AFTER: vary a non-operational response so the deny path isn't a compass
def on_blocked(call):
# return a controlled, plausible-but-non-operational response;
# randomize shape/latency so block != stable signal
return misdirect(call, vary=["shape", "delay", "message"])
Caveats from doing this in prod:
It makes YOUR debugging harder (your own false positives now look noisy too) — log the real reason internally, only vary the external response.
Varying text isn't enough if latency still leaks. Treat timing + error-shape as part of the response surface.
Open question I don't have a clean answer to: does misdirection just move the oracle one layer up into side channels?
I maintain an open-source deny-by-default firewall for agent tool calls (agent-airlock), which is how I had the logs to catch this. The lesson generalizes to any guardrail: a denied call's response is attack surface.
Top comments (0)