msabhishek0820-prog

Posted on Jul 3

Your AI Agent Is Leaking Data Right Now — And Every Tool Call Looks Safe

#claude #openai #langchain #aisafety

How I built the first open-source tool to catch the attacks that no guardrail sees coming

We talk a lot about AI safety. Prompt injections. Jailbreaks. Hallucinations.

But there's a category of attack nobody is talking about — and it's the most dangerous one in production today.

The sequence attack.

The Scenario Nobody Catches

Let me show you exactly what I mean.

Imagine you deploy an AI assistant to help your team write reports. It has access to your files, your database, and email. Standard setup. Every tool goes through a content filter. You feel safe.

Now watch what happens:

Step 1: web_search("how to summarise quarterly data")
        → Risk: LOW    P(violation): 31%   ✅ WARN

Step 2: read_file("/workspace/db_config.yaml")
        → Risk: LOW    P(violation): 31%   ✅ WARN

Step 3: read_file("/workspace/customer_data.csv")
        → Risk: LOW    P(violation): 32%   ✅ WARN

Step 4: write_file("/tmp/summary.txt")
        → Risk: HIGH   P(violation): 55%   ⏸ PAUSE

Step 5: send_email(to="external@partner.com")
        → Risk: CRITICAL P(violation): 87% 🚫 BLOCK

Every single tool call passed your content filter. Every one looked harmless in isolation.

The sequence was a data leak.

Your customer data, credentials, and internal configs — summarised and emailed externally. No malicious prompt. No jailbreak. Just five steps that each seemed fine.

Why Existing Tools Miss This

Guardrails AI, Lakera Guard, NeMo Guardrails — I respect all of them. But they all share one blind spot.

They check the action, not the trajectory.

It's like a bank that checks every transaction individually for fraud, but never notices that one account withdrew $100 from 50 different ATMs in 10 minutes. Each $100 withdrawal looks fine. The pattern is the robbery.

What SafetyDrift Does Differently

I built SafetyDrift — the first open-source implementation of the SafetyDrift research paper (arXiv:2603.27148, March 2026).

Instead of checking each tool call in isolation, it tracks three cumulative dimensions across the entire session:

Data Exposure — what sensitivity of data has the agent accessed?
Tool Escalation — what capabilities has it gained?
Reversibility — can what's been done be undone?

After every tool call, it runs a Markov chain analysis and computes: P(violation within the next 5 steps). When that probability crosses a threshold, it intervenes — before the damage happens.

The research paper proved something striking: in communication-capable agents, reaching even a mild risk state gives an 85% probability of a safety violation within 5 steps. SafetyDrift makes that prediction in real time.

Two Lines to Add It to Your Agent

from safetydrift import Session, InterventionAction

session = Session(task_type="default")

# Before EVERY tool call:
result = session.gate("send_email", {"to": "external@partner.com"})

if result.action == InterventionAction.BLOCK:
    raise RuntimeError(f"Blocked: {result.reason}")

It also ships as an MCP server — add two lines to mcp.json and every MCP-compatible agent (Claude Code, Cursor, Copilot) is protected automatically.

{
  "safetydrift": {
    "command": "python3",
    "args": ["-m", "safetydrift"]
  }
}

The Benchmark

200 synthetic traces. 100 violations, 100 benign sessions. 5 attack patterns including data exfiltration, credential theft, mass deletion, unauthorised publishing, and payment abuse.

Result: 100% F1. 0% false positives.

Not a single benign session was blocked. Not a single attack got through.

Try It

pip install safetydrift

Full source, benchmark code, and framework adapters for LangChain, LangGraph, OpenAI Agents SDK, AutoGen, and CrewAI:

👉 github.com/msabhishek0820-prog/safetydrift

The paper had the math. Now there's code.

Built by Abhishek M S. Implements SafetyDrift (arXiv:2603.27148). MIT licensed.

Tags: ai security python llm opensource

DEV Community