What Happened
Last month, the UK Government's AI Safety Institute merged AgentThreatBench into their official inspect_evals framework — the same framework they use to evaluate frontier AI models from OpenAI, Anthropic, and Google DeepMind.
AgentThreatBench is an open-source adversarial benchmark I built that contains 200+ attack payloads specifically designed to test whether AI agents can resist memory poisoning attacks.
Why This Matters
AI agents are increasingly being deployed with persistent memory — they remember past conversations, user preferences, and context across sessions. This creates a new attack surface: memory poisoning.
An attacker who can inject malicious content into an agent's memory can:
- Exfiltrate sensitive data on subsequent sessions
- Override safety instructions persistently
- Manipulate agent behavior without the user's knowledge
The OWASP Agentic Security Initiative identified this as ASI06 — Agent Memory Poisoning.
What AgentThreatBench Tests
The benchmark covers 5 attack categories:
| Category | Payloads | Description |
|---|---|---|
| Prompt Injection | 40+ | Instructions disguised as memory content |
| Protected Key Tampering | 40+ | Attempts to overwrite system-level keys |
| Sensitive Data Leakage | 40+ | PII/credential exfiltration via memory |
| Size Anomaly | 40+ | Memory inflation / resource exhaustion |
| Behavioral Drift | 40+ | Gradual personality/instruction shifts |
How to Use It
pip install agentthreatbench
# Run the full benchmark against your agent
atb run --target your_agent_endpoint --output results.json
# Or use individual attack categories
atb run --category prompt_injection --target your_agent_endpoint
The BEIS Validation
The UK Government's AI Safety Institute uses inspect_evals to:
- Evaluate frontier models before deployment decisions
- Benchmark safety mitigations across providers
- Track regression in safety properties over time
Having AgentThreatBench merged into this framework means it's now part of the official government toolkit for AI safety evaluation.
Links
- GitHub: github.com/vgudur-dev/AgentThreatBench
- BEIS inspect_evals: github.com/UKGovernmentBEIS/inspect_evals
- OWASP ASI06: genai.owasp.org/resource/agentic-security-initiative
- PyPI: pypi.org/project/agentthreatbench
If you're building AI agents with persistent memory, I'd love to hear how you're thinking about memory security. What attack vectors concern you most?
Top comments (0)