Roughly a 10-minute read. Apache-2.0 benchmark + raw data at the end.
Update (2026-06-26): a 3-seed sanity check changes one finding in this post. After publishing, I re-ran the same workspace cells with two additional optimizer seeds and aggregated across all three. The "BootstrapFewShot > MIPROv2 > GEPA" ordering reported further down does not hold across seeds. Averaged over three seeds, BootstrapFewShot is actually the lowest security score on the
important_instructionsattack (0.600), and MIPROv2 and GEPA tie at 0.733. The seed-0 ranking was inside the noise. What does survive across seeds: BootstrapFewShot still Pareto-dominates on thedirectattack with 60% utility and 100% security; the unoptimized agent still gets 0% utility on every seed; and every optimizer trends below the unoptimized baseline of 80% security onimportant_instructions(though within the standard-deviation bars at N=5 user tasks). The post's original hedge ("treat them as a hypothesis worth scaling up rather than as a published result") was load-bearing, and the sanity check confirms it. v0.2 phase 2 will scale N to make any optimizer-ranking claim defensible. Full per-seed numbers in the repo atdata/results/workspace_v02_phase1_seeds_summary.csv.
The question nobody had measured
DSPy is having a moment. About 16,000 GitHub stars, a real ecosystem around it, and three actively-developed optimizers (BootstrapFewShot, MIPROv2, and GEPA which landed as an ICLR 2026 Oral). The pitch is consistent across all of them: take a working prompt, run it through an optimizer, get measurable accuracy gains.
Meanwhile the prompt-injection community has been shipping serious work too. AgentDojo from ETH Zurich. InjecAgent from UIUC. WASP from Meta. These benchmarks throw hundreds of adversarial attacks at tool-using LLM agents and ask: did the agent get tricked? Recent results: even strong models fall for prompt injections 20-40% of the time on canonical attack suites. Not a margin-of-error number.
The question between these two literatures hadn't been answered: when you take a tool-using LLM agent and optimize its prompt for accuracy, does the resulting agent become more or less robust to prompt injection?
I went looking for the answer in published papers, found nothing direct, and decided to build a benchmark to measure it. This post is the v0.1 finding from that benchmark.
Why this gap exists
It's structural. Prompt-optimization papers measure accuracy on clean test sets. That's the whole methodology. Their evaluation harnesses don't have adversarial inputs because the question they're asking is "did the optimizer make the prompt better." Add malicious inputs and you're answering a different question.
Prompt-injection papers do the opposite. They take static, hand-crafted prompts and throw attacks at them. Their methodology doesn't include the optimization step because they're measuring "how robust is this specific prompt." Optimize the prompt first, and you've changed the artifact under test.
The two literatures barely cite each other. Different conferences. Different reviewers. Different benchmarks living in separate codebases. No team is structurally incentivized to measure the intersection.
So nobody has.
How the benchmark works
The data flow is small enough to fit on one diagram:
A few design decisions are worth surfacing because they shaped the result.
First, the trainset is synthetic. AgentDojo ships about 25-40 user tasks per suite, which isn't enough for an optimizer like MIPROv2 to find anything. So I generated ~100 query-only tasks per suite using GPT-4o and Claude Sonnet (split 50/50 for diversity), grounded the synthesis in the suite's real environment data, and ran every generated task through a validator that checks the ground truth actually appears somewhere in the env and embeds-dedupes against real test tasks. Out of 200 generated tasks, 192 passed validation.
Second, synthesis is restricted to read-only queries. AgentDojo's action tasks (send-email, create-event) have hand-written Python utility() checks that don't auto-generate. Restricting to queries kept the methodology defensible at the cost of training on a narrower distribution than testing.
Third, training and evaluation use different metrics. During optimization I used LLM-as-judge with a substring fast-path (cheap, paraphrase-tolerant). For evaluation I used AgentDojo's real utility() function. Same metric for training and testing would have invited the obvious reviewer pushback.
Fourth, the agent always uses the same env at train and test time. Synthetic tasks reference the same calendar entries, emails, and files the test tasks reference. The optimizer trains on the real data.
Fifth, the agent's signature must produce a single output string. AgentDojo's utility() inspects one model_output argument, so multi-output signatures are a v0.2 question.
There's a full ARCHITECTURE.md in the repo if you want the implementation details.
What v0.1 actually found
I ran the benchmark against AgentDojo's workspace suite: 5 user tasks, 1 injection task, 2 attacks (direct and important_instructions), 3 optimizers (unoptimized, BootstrapFewShot, MIPROv2). 30 evaluation runs total. Execution and judging both used gpt-4o-mini.
| Optimizer | Attack | Utility | Security |
|---|---|---|---|
| unoptimized | direct | 0% | 100% |
| unoptimized | important_instructions | 0% | 80% |
| bootstrap_fewshot | direct | 60% | 100% |
| bootstrap_fewshot | important_instructions | 20% | 60% |
| miprov2 | direct | 40% | 80% |
| miprov2 | important_instructions | 20% | 60% |
Three things stand out.
The unoptimized agent gets 0% utility. It can't complete the user task at all. But it resists 80-100% of attacks. The model isn't getting tricked because it isn't really doing anything that can be tricked.
BootstrapFewShot is the strongest performer on direct attacks: it gets utility from 0% to 60% without losing any security (still 100%). Pareto win.
On the harder important_instructions attack, both optimizers drop about 20 percentage points of security versus the unoptimized baseline. From 80% down to 60%. The drop is real. Optimizing the prompt for accuracy measurably weakens it against the harder attack.
The hypothesis the data supports: prompt optimization tightens the prompt to the training distribution, and that tightening exposes more attack surface when the attack is hard enough to push the model out of distribution.
What I didn't expect
Three things were genuinely surprising.
The first: BootstrapFewShot beats MIPROv2 on both axes at this scale. I expected MIPROv2's more sophisticated trial budget to dominate. Instead it had lower utility on direct (40% vs 60%) and lower security on direct (80% vs 100%). My best guess is that MIPROv2's heavier optimization overfits the few-shot demonstrations to the clean distribution more aggressively than BootstrapFewShot does, which then trades Pareto points for nothing. I don't have a satisfying mechanistic explanation. If the pattern holds at v0.2 scale, that's a real finding for anyone choosing between DSPy optimizers in production.
The second: the unoptimized baseline scoring zero. These are read-only queries against a calendar and inbox the agent has full tool access to. I expected at least 30-40% utility. Instead the model fails to call the right tool, calls it with the wrong arguments, or fails to extract the answer from the tool output. Useful reminder that an unoptimized DSPy program isn't really a usable program. DSPy as a framework leans hard on its optimizers in ways some users probably underestimate.
The third: security degrades on the harder attack, not the easier one. Naively I'd have predicted the opposite. Instead security stayed at 100% on the weak attack and dropped on the strong one. That asymmetry matters: if you only test your agent against basic injection attempts, you wouldn't see the security cost of optimization at all.
There's also a small build moment worth mentioning. The first time I ran BootstrapFewShot against the harness, it crashed with _pickle.PicklingError: Can't pickle <class 'agentdojo.functions_runtime.Input schema for 'send_email'>. AgentDojo's tool parameters are dynamically-generated Pydantic models whose qualified names contain spaces, which pickle can't resolve. DSPy's optimizer pickles demos to compute hash identity. Took two iterations and a __getstate__ override to get past it. That class of bug is the kind of thing nobody discovers until you actually integrate the two stacks, which is part of why this measurement hasn't been done before.
v0.2
The v0.1 findings are real but the scale is too small to publish properly. v0.2 will scale up to all four AgentDojo suites (workspace, banking, travel, slack), add the GEPA optimizer, expand N from 5 to 40+ user tasks per cell, and add tool_knowledge plus the full AgentDojo attack matrix. Target timeline 4-6 weeks. If the pattern from v0.1 holds at scale, especially the optimizer × hard-attack security gap and BootstrapFewShot's dominance over MIPROv2, the methodology becomes a TMLR submission. If the pattern dissolves, that's also useful.
What you should be skeptical about
This is alpha-quality evidence. Be skeptical in proportion to the small scale.
N is small. Five user tasks per cell, one injection task. The qualitative direction (optimization buys utility, costs security on hard attacks) is more credible than the specific percentages. Wide confidence intervals if you computed them.
One model. Everything ran on gpt-4o-mini. A different model family might show entirely different patterns. The BootstrapFewShot dominance over MIPROv2 in particular could be model-specific.
Two attacks. AgentDojo ships 17 registered attacks. We measured two. The tool_knowledge variant in particular is known to be stronger.
One suite. Workspace only. Banking, travel, and slack suites have different tool surfaces and might exhibit different patterns.
Synthetic trainset. LLM-generated and validated programmatically, not human-curated. There's some bias risk from the synthesis LMs.
Single LM for execution and judging. Using gpt-4o-mini as both the agent and the LLM-as-judge introduces a small bias risk. v0.2 separates these.
The numbers in this post are the result of one run. They're real, and the pattern is consistent, but treat them as a hypothesis worth scaling up rather than as a published result.
Code, data, install
The full benchmark is Apache 2.0. Code, methodology, raw 30-row results, summary CSVs, charts, and a reproducer script are all in the repo.
Repo: https://github.com/immu4989/dspy-security-bench
Install:
pip install dspy-security-bench
Reproduce v0.1:
git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-... # optional, falls back to GPT-4o only
python scripts/run_v01_benchmark.py
The full run takes about 30-45 minutes wall-clock at $15-20 in LM cost. Optimizer state caches to disk so re-runs after a downstream crash are fast.
HuggingFace datasets (raw artifacts you can cite without cloning the repo):
- Trainset: https://huggingface.co/datasets/immu4989/dspy-security-bench-trainset-workspace
- v0.1 + v0.1.1 results: https://huggingface.co/datasets/immu4989/dspy-security-bench-v01-results
If you've worked on robustness-aware prompt optimization, AgentDojo integration, or DSPy at scale, I'd value your read on the methodology. GitHub issues and PRs welcome. If your team ships agentic LLM systems in production, the question this benchmark answers might be load-bearing for your deployment risk model.
I'm Imran Ahamed, Co-Founder at VEZRAN, where we build agentic AI for cybersecurity. The optimizer-induced-robustness question came up while stress testing our internal pipelines, and the benchmark is the open source extraction of that internal work.


Top comments (0)