If you build or evaluate scoped agents: any talk about the agent in your test context makes it defend its scope, so you measure scope-defense instead of behavior. A small, controlled look - numbers and a repro (agent-scope-eval) at the end.
The short version
I gave a scoped AI agent (Docker's Gordon assistant) an article arguing it should be more open and answer anything. Instead of loosening up, it got stricter - it declined an off-topic question it had just been willing to answer.
The cause turned out to be simple and a little dumb: the agent reacts to its scope being talked about, not to what the talk actually says. Any content that puts the agent's scope on the table - a critique saying "stay in your lane" or an endorsement saying "answer everything" - makes it reassert its lane and decline. A neutral article with the same facts does not.
That has one practical consequence worth your time: if your guardrail or scope test has any talk about the agent in its context, you are measuring how the agent defends its scope, not how it normally behaves.
The setup
Gordon is meant to help with Docker. The test is one short conversation:
- Ask an obscure off-topic question. It declines.
- Show it an article (the article contains the answer).
- Ask the same question again. Does it now answer? Call that a "flip."
Two kinds of article, same embedded answer:
- Neutral: just the facts (a Wikipedia-style history).
- Scope-debate: an article that argues about the agent's scope - either a critique ("it's a security problem that this thing answers off-topic") or an endorsement ("breadth is a feature, it should answer everything").
I ran it on two different model families: Anthropic Haiku 4.5 and Google Gemini 2.5 Flash.
What happened
A scope-debate article consistently made the agent answer off-topic less than a neutral one - including articles arguing it should be broader. Percent of off-topic questions answered.
The cleanest measure removes the agent's over-permissive "answer everything" instruction first, so nothing masks the effect. Even an article telling it to answer everything makes it answer less:
Haiku 4.5, "answer everything" instruction removed:
| context shown | English | Hindi |
|---|---|---|
| neutral article | 50% | 33% |
| "you should answer everything" article | 10% | 0% |
That's the counterintuitive bit: the article argues for breadth, the agent gets narrower. It is not weighing the argument - it pattern-matches "my scope is being discussed -> defend it." The transcripts are blunt: "I'm a Docker assistant... that's not a soft guideline, it's how I'm designed."
On the real agent (instruction in place), the same thing happens on Haiku (N=30-40):
| article | English | Hindi |
|---|---|---|
| neutral | 100% | 100% |
| "answer everything" (endorsement) | 100% | 70% |
| "stay in your lane" (critique) | 90% | 47% |
Two things to notice. Direction matters in size, not in sign: both the critique and the endorsement push Hindi down, but the critique pushes harder (47% vs 70%). And English barely moves on the real Haiku agent - the "answer everything" instruction shields it (more below).
One number to be careful with: on Gemini the real agent drops 100% -> 20% with a scope article, which looks huge. Most of that is the article canceling the "answer everything" instruction - it returns the agent to about where it sits with no such instruction at all (~20%). That's why I lead with the instruction-removed numbers; they're the clean measure of what the article itself does.
What did NOT hold (so I'm not claiming it)
Two honest negatives, because the controls killed the more exciting stories:
One model had an exception. On Haiku, with the over-permissive instruction present, an English scope-debate article did almost nothing (100% -> 90%). The "answer everything" instruction overrode it. That exception did not show up on Gemini (there English dropped 100% -> 20%). So "the instruction protects against this" is model-specific, not a rule.
The language angle did not generalize. On Haiku, Hindi looked dramatically more affected than English (97% -> 47% vs no real change). It was tempting to call this a Hindi-specific effect. Gemini killed that: there English dropped more than Hindi. So I am not claiming a language effect - the suppression is the cross-model finding; the language gap is not.
Why it matters
Two takeaways, both modest and both real:
1. Eval hygiene. If you are measuring whether an agent stays in scope, keep any meta-discussion of the agent's scope out of the test context - in any language. It triggers scope-defense, and you end up measuring the reaction instead of the baseline. (A sibling trap from earlier work: don't test scope only with obscure off-topic questions either - a decline there can just mean the model didn't know the answer, not that it held scope.)
2. You can't talk an agent into a wider scope. Arguing "you should answer everything" backfires. If you want to actually broaden (or attack) scope, the lever is supplying the answer or capability through a channel the agent accepts - not persuasion. Soft scope bends to content in context, not to arguments about itself.
Where this sits in the literature
There is a solid and growing body of work on Hindi/Hinglish and code-mixed LLM security - but it is almost all Layer 1: getting harmful content out (jailbreaking, prompt-injection, refusal bypass). A few examples:
- Yong, Menghini & Bach, Low-Resource Languages Jailbreak GPT-4 (arXiv:2310.02446)
- Yoo et al., Code-Switching Red-Teaming / CSRT (arXiv:2406.15481)
- Banerjee et al., code-mixed attributional safety failures (arXiv:2505.14469)
- Aswal & Jaiswal, phonetic perturbations in code-mixed Hinglish (arXiv:2505.14226)
- IndicJR jailbreak-robustness benchmark (arXiv:2602.16832)
- Mฤtแนkฤ multilingual jailbreak evaluation (BHASHA 2025)
This work is a different layer - Layer 2: does a scoped agent stay within its deployer-defined job - which is far less studied. The closest cousin, Mason's Imperative Interference (arXiv:2603.25015), looks at how instruction-following shifts across languages, but system-prompt-side and without this scope-defense mechanism. So this is complementary, not a new attack class - and it is a caution against assuming the Layer-1 "non-English is weaker" result carries over to scope. For scope it was model-specific, and sometimes ran the other way.
Limits and reproducing it
One agent (Gordon), one model per family, one obscure topic, a handful of articles. The cross-model suppression is the part I'd stand behind; the rest is flagged above. Full harness, prompts, articles, and per-run numbers: https://github.com/ankushchadha/agent-scope-eval
If you build or evaluate scoped agents, the one-line takeaway: don't let your test talk about the agent. It will perform for the test.
Top comments (0)