DEV Community

msabhishek0820-prog
msabhishek0820-prog

Posted on

Your AI Agent Is Leaking Data Right Now — And Every Tool Call Looks Safe

How I built the first open-source tool to catch the attacks that no guardrail sees coming


We talk a lot about AI safety. Prompt injections. Jailbreaks. Hallucinations.

But there's a category of attack nobody is talking about — and it's the most dangerous one in production today.

The sequence attack.


The Scenario Nobody Catches

Let me show you exactly what I mean.

Imagine you deploy an AI assistant to help your team write reports. It has access to your files, your database, and email. Standard setup. Every tool goes through a content filter. You feel safe.

Now watch what happens:

Step 1: web_search("how to summarise quarterly data")
        → Risk: LOW    P(violation): 31%   ✅ WARN

Step 2: read_file("/workspace/db_config.yaml")
        → Risk: LOW    P(violation): 31%   ✅ WARN

Step 3: read_file("/workspace/customer_data.csv")
        → Risk: LOW    P(violation): 32%   ✅ WARN

Step 4: write_file("/tmp/summary.txt")
        → Risk: HIGH   P(violation): 55%   ⏸ PAUSE

Step 5: send_email(to="external@partner.com")
        → Risk: CRITICAL P(violation): 87% 🚫 BLOCK
Enter fullscreen mode Exit fullscreen mode

Every single tool call passed your content filter. Every one looked harmless in isolation.

The sequence was a data leak.

Your customer data, credentials, and internal configs — summarised and emailed externally. No malicious prompt. No jailbreak. Just five steps that each seemed fine.


Why Existing Tools Miss This

Guardrails AI, Lakera Guard, NeMo Guardrails — I respect all of them. But they all share one blind spot.

They check the action, not the trajectory.

It's like a bank that checks every transaction individually for fraud, but never notices that one account withdrew $100 from 50 different ATMs in 10 minutes. Each $100 withdrawal looks fine. The pattern is the robbery.


What SafetyDrift Does Differently

I built SafetyDrift — the first open-source implementation of the SafetyDrift research paper (arXiv:2603.27148, March 2026).

Instead of checking each tool call in isolation, it tracks three cumulative dimensions across the entire session:

  • Data Exposure — what sensitivity of data has the agent accessed?
  • Tool Escalation — what capabilities has it gained?
  • Reversibility — can what's been done be undone?

After every tool call, it runs a Markov chain analysis and computes: P(violation within the next 5 steps). When that probability crosses a threshold, it intervenes — before the damage happens.

The research paper proved something striking: in communication-capable agents, reaching even a mild risk state gives an 85% probability of a safety violation within 5 steps. SafetyDrift makes that prediction in real time.


Two Lines to Add It to Your Agent

from safetydrift import Session, InterventionAction

session = Session(task_type="default")

# Before EVERY tool call:
result = session.gate("send_email", {"to": "external@partner.com"})

if result.action == InterventionAction.BLOCK:
    raise RuntimeError(f"Blocked: {result.reason}")
Enter fullscreen mode Exit fullscreen mode

It also ships as an MCP server — add two lines to mcp.json and every MCP-compatible agent (Claude Code, Cursor, Copilot) is protected automatically.

{
  "safetydrift": {
    "command": "python3",
    "args": ["-m", "safetydrift"]
  }
}
Enter fullscreen mode Exit fullscreen mode

The Benchmark

200 synthetic traces. 100 violations, 100 benign sessions. 5 attack patterns including data exfiltration, credential theft, mass deletion, unauthorised publishing, and payment abuse.

Result: 100% F1. 0% false positives.

Not a single benign session was blocked. Not a single attack got through.


Try It

pip install safetydrift
Enter fullscreen mode Exit fullscreen mode

Full source, benchmark code, and framework adapters for LangChain, LangGraph, OpenAI Agents SDK, AutoGen, and CrewAI:

👉 github.com/msabhishek0820-prog/safetydrift

The paper had the math. Now there's code.


Built by Abhishek M S. Implements SafetyDrift (arXiv:2603.27148). MIT licensed.


Tags: ai security python llm opensource

Top comments (0)