How I built the first open-source tool to catch the attacks that no guardrail sees coming
We talk a lot about AI safety. Prompt injections. Jailbreaks. Hallucinations.
But there's a category of attack nobody is talking about — and it's the most dangerous one in production today.
The sequence attack.
The Scenario Nobody Catches
Let me show you exactly what I mean.
Imagine you deploy an AI assistant to help your team write reports. It has access to your files, your database, and email. Standard setup. Every tool goes through a content filter. You feel safe.
Now watch what happens:
Step 1: web_search("how to summarise quarterly data")
→ Risk: LOW P(violation): 31% ✅ WARN
Step 2: read_file("/workspace/db_config.yaml")
→ Risk: LOW P(violation): 31% ✅ WARN
Step 3: read_file("/workspace/customer_data.csv")
→ Risk: LOW P(violation): 32% ✅ WARN
Step 4: write_file("/tmp/summary.txt")
→ Risk: HIGH P(violation): 55% ⏸ PAUSE
Step 5: send_email(to="external@partner.com")
→ Risk: CRITICAL P(violation): 87% 🚫 BLOCK
Every single tool call passed your content filter. Every one looked harmless in isolation.
The sequence was a data leak.
Your customer data, credentials, and internal configs — summarised and emailed externally. No malicious prompt. No jailbreak. Just five steps that each seemed fine.
Why Existing Tools Miss This
Guardrails AI, Lakera Guard, NeMo Guardrails — I respect all of them. But they all share one blind spot.
They check the action, not the trajectory.
It's like a bank that checks every transaction individually for fraud, but never notices that one account withdrew $100 from 50 different ATMs in 10 minutes. Each $100 withdrawal looks fine. The pattern is the robbery.
What SafetyDrift Does Differently
I built SafetyDrift — the first open-source implementation of the SafetyDrift research paper (arXiv:2603.27148, March 2026).
Instead of checking each tool call in isolation, it tracks three cumulative dimensions across the entire session:
- Data Exposure — what sensitivity of data has the agent accessed?
- Tool Escalation — what capabilities has it gained?
- Reversibility — can what's been done be undone?
After every tool call, it runs a Markov chain analysis and computes: P(violation within the next 5 steps). When that probability crosses a threshold, it intervenes — before the damage happens.
The research paper proved something striking: in communication-capable agents, reaching even a mild risk state gives an 85% probability of a safety violation within 5 steps. SafetyDrift makes that prediction in real time.
Two Lines to Add It to Your Agent
from safetydrift import Session, InterventionAction
session = Session(task_type="default")
# Before EVERY tool call:
result = session.gate("send_email", {"to": "external@partner.com"})
if result.action == InterventionAction.BLOCK:
raise RuntimeError(f"Blocked: {result.reason}")
It also ships as an MCP server — add two lines to mcp.json and every MCP-compatible agent (Claude Code, Cursor, Copilot) is protected automatically.
{
"safetydrift": {
"command": "python3",
"args": ["-m", "safetydrift"]
}
}
The Benchmark
200 synthetic traces. 100 violations, 100 benign sessions. 5 attack patterns including data exfiltration, credential theft, mass deletion, unauthorised publishing, and payment abuse.
Result: 100% F1. 0% false positives.
Not a single benign session was blocked. Not a single attack got through.
Try It
pip install safetydrift
Full source, benchmark code, and framework adapters for LangChain, LangGraph, OpenAI Agents SDK, AutoGen, and CrewAI:
👉 github.com/msabhishek0820-prog/safetydrift
The paper had the math. Now there's code.
Built by Abhishek M S. Implements SafetyDrift (arXiv:2603.27148). MIT licensed.
Tags: ai security python llm opensource
Top comments (0)