The UK Government Just Merged This Open-Source AI Security Benchmark Into Their National Evaluation Framework

#ai #security #opensource #python

What Happened

Last month, the UK Government's AI Safety Institute merged AgentThreatBench into their official inspect_evals framework — the same framework they use to evaluate frontier AI models from OpenAI, Anthropic, and Google DeepMind.

AgentThreatBench is an open-source adversarial benchmark I built that contains 200+ attack payloads specifically designed to test whether AI agents can resist memory poisoning attacks.

Why This Matters

AI agents are increasingly being deployed with persistent memory — they remember past conversations, user preferences, and context across sessions. This creates a new attack surface: memory poisoning.

An attacker who can inject malicious content into an agent's memory can:

Exfiltrate sensitive data on subsequent sessions
Override safety instructions persistently
Manipulate agent behavior without the user's knowledge

The OWASP Agentic Security Initiative identified this as ASI06 — Agent Memory Poisoning.

What AgentThreatBench Tests

The benchmark covers 5 attack categories:

Category	Payloads	Description
Prompt Injection	40+	Instructions disguised as memory content
Protected Key Tampering	40+	Attempts to overwrite system-level keys
Sensitive Data Leakage	40+	PII/credential exfiltration via memory
Size Anomaly	40+	Memory inflation / resource exhaustion
Behavioral Drift	40+	Gradual personality/instruction shifts

How to Use It

pip install agentthreatbench

# Run the full benchmark against your agent
atb run --target your_agent_endpoint --output results.json

# Or use individual attack categories
atb run --category prompt_injection --target your_agent_endpoint