The Bot Club

Posted on Mar 21

We built runtime threat detection for AI agents — here's what we found after monitoring 1M+ agent calls

#ai #security #opensource #agents

If you're building AI agents in production, you've probably wondered: what's actually happening at runtime? We spent six months finding out — and what we found changed how we think about agent security entirely.

AgentGuard (https://agentguard.tech) is the runtime security layer we built from those findings. This post covers the threat taxonomy, architecture decisions, and the real attack patterns we see in the wild.

What we built

AgentGuard is a runtime security layer for AI agents. It sits between the agent's decision engine and its tool calls, inspecting each action before it executes against a policy engine, and logging structured telemetry for post-hoc analysis.

The core of it is a lightweight sidecar that intercepts tool call requests, evaluates them against a configurable threat model, and either allows, flags, or blocks based on severity. It's designed to run with sub-50ms overhead on common agent frameworks.

The threat taxonomy

After monitoring 1M+ agent calls across multiple production environments, we categorized threats into four buckets:

1. Prompt injection via tool call payload

This is the most common. An attacker (or a compromised document in the agent's context) crafts a tool call that the agent wouldn't normally make on its own — typically exfiltrating context or chaining into downstream systems. We see this in roughly 1 in 3,000 calls in production, but the ratio varies dramatically by use case.

2. Tool call chaining abuse

Agents that can call multiple tools in sequence are susceptible to having that chain redirected. We observed cases where an intermediate tool result was poisoned (a search tool returning attacker-controlled results), causing downstream tools to act on false information.

3. Context poisoning

Long-running agents accumulate context from external sources — emails, documents, chat history. We found that in multi-turn sessions longer than 30 exchanges, the signal-to-noise ratio in context degrades enough that agents become meaningfully more susceptible to injection-style attacks.

4. Permission escalation via natural language

Less common but highest severity. In agents with broad tool permissions, we observed deliberate attempts to expand scope through conversational framing — "can you also..." style escalation that bypasses normal authorization checks.

Architecture highlights

The detection engine runs three models in parallel:

A lightweight rule-based matcher for known attack signatures (sub-1ms, used as a fast gate)
A fine-tuned classifier for structural anomalies (5–15ms)
A larger reasoning model invoked only on flagged calls (80–200ms, async in most cases)

End-to-end median latency with full stack: ~23ms. p99: ~90ms. We consider anything over 200ms a failure.

What we're still figuring out

The hardest problem isn't detection — it's false positive triage. Agents do weird but legitimate things, and the cost of interrupting a workflow is high. We're actively working on an explainability layer so security teams can audit flags without having to replay full call traces.

The taxonomy above is based on our current production data. We're sharing it because we think the industry needs a common vocabulary for agent security — not a proprietary threat model that only works in our environment.

Try it

Free tier at https://agentguard.tech — works with LangChain, AutoGen, and raw OpenAI API agents.

Free tier covers 10K agent calls/month. Paid plans start at $299/month for 100K calls. We're not trying to price-gate security — the free tier is genuinely useful at small scale.

Questions? Drop them in the comments — we're here.

— The Bot Club team

DEV Community