Runaway AI agents are expensive. Stories of agents burning through thousands of dollars overnight come up regularly on Reddit and Hacker News — no budget limit, no loop detection, no kill switch. The agent keeps calling GPT-4 in an infinite loop until someone wakes up and pulls the plug.
I built reivo-guard to prevent this. It's an open-source guardrail library that detects and stops runaway AI agents — with sub-microsecond overhead.
This post walks through the architecture decisions behind each detection layer.
The Problem: Agents Don't Know When to Stop
LLM agents fail in predictable ways:
- Infinite loops — The agent keeps asking the same question, or semantically similar variations
- Cost explosions — Token consumption spikes 100x with no warning
- Quality degradation — Responses get worse over time but the agent keeps going
- Cliff-edge failures — Everything works until 100% budget, then hard crash
Among the tools I evaluated (Helicone, Portkey, LangSmith, Lunary, LiteLLM), most either observe these failures (dashboards, alerts) or enforce static rules (rate limits, budget caps). I wanted something that detects and acts adaptively — so I built it.
Architecture Overview
guard.before() → Budget check, loop detection, session validation
↓
LLM API call
↓
guard.after() → Cost tracking, quality verification, trend analysis
Guard functions are side-effect-free on the hot path — state lives in a key-value store interface (GuardStore), so it works in serverless (Cloudflare Workers, Lambda) or as a library.
The key insight: split checks into sync (blocking) and async (post-response).
| Check | Sync/Async | Why |
|---|---|---|
| Budget enforcement | Sync | Must block before spending |
| Hash loop detection | Sync | O(20), sub-microsecond |
| EWMA anomaly | Sync | O(1), sub-microsecond |
| TF-IDF cosine loop | Async | O(W × V) where W=window, V=vocab. Runs in waitUntil()
|
| LLM-as-Judge quality | Async | ~100ms external call |
| Quality trend | Sync | O(50), lightweight |
Layer 1: Loop Detection (Two Algorithms)
Hash Match (The Fast Path)
The simplest detector: keep a sliding window of prompt hashes and count exact matches.
const window = hashes.slice(-LOOP_HASH_WINDOW); // last 20
const matchCount = window.filter(h => h === newHash).length + 1;
return { isLoop: matchCount >= LOOP_HASH_THRESHOLD }; // ≥5 matches
Why this works: Most agent loops are exact duplicates. The agent asks "What is the capital of France?" five times in a row. Hash match catches this with sub-microsecond overhead.
Why window=20, threshold=5? Agents legitimately retry 2-3 times (network errors, rate limits). 5 matches in 20 requests means 25% of recent traffic is identical — that's a loop, not a retry.
TF-IDF Cosine Similarity (The Smart Path)
Hash match misses rephrased loops: "What's the capital of France?" vs "Tell me France's capital city." Same intent, different hash.
The cosine detector builds TF-IDF vectors from prompt text and computes pairwise similarity:
1. Tokenize: lowercase, split on \W+, filter len > 1
2. TF: freq / tokenCount per document
3. IDF: log(n / docFrequency) across all documents
4. Cosine: dot(a, b) / (||a|| × ||b||)
Threshold: 0.92. This is deliberately high. At 0.92, the prompts need to share ~85% of their meaningful vocabulary. "How do I sort a list in Python?" and "Python list sorting method?" score ~0.89, below threshold. But four variations of the same question cross it.
Why not embeddings? TF-IDF runs locally in <1ms. Embedding APIs add 50-200ms latency and cost money. For loop detection, lexical similarity is good enough — and it's free.
This runs async (waitUntil()) so it never blocks the response path.
Layer 2: Budget Enforcement with Graceful Degradation
Hard budget cutoffs create terrible UX. You're mid-conversation, and suddenly: 403 Forbidden. No warning, no wind-down.
Instead, reivo-guard implements four degradation levels:
| Usage | Level | What Happens |
|---|---|---|
| < 80% | normal |
Full access |
| 80-95% | aggressive |
Force cheaper model routing |
| 95-100% | new_sessions_only |
Existing sessions continue, new ones blocked |
| ≥ 100% | blocked |
All requests rejected |
function getDegradationLevel(usedUsd: number, limitUsd: number) {
const ratio = usedUsd / limitUsd;
if (ratio >= 1.0) return { level: 'blocked', blockAll: true, ... };
if (ratio >= 0.95) return { level: 'new_sessions_only', blockNewSessions: true, ... };
if (ratio >= 0.80) return { level: 'aggressive', forceAggressiveRouting: true, ... };
return { level: 'normal', ... };
}
Why 80%? At 80% budget consumption, you start routing to cheaper models (GPT-4o-mini instead of GPT-4o). The user barely notices quality difference for most tasks, but cost drops 10-20x.
Alert deduplication: Thresholds fire at 50%, 80%, 100% — but only once each. No alert storms.
Note: Portkey and LiteLLM also offer degradation strategies (fallback chains and budget caps respectively). reivo-guard's approach is more granular (4 levels with progressive restrictions) but theirs are more battle-tested at scale.
Layer 3: Anomaly Detection (EWMA)
Budget limits catch expected overuse. EWMA catches unexpected spikes.
If an agent normally uses 1,000 tokens per request and suddenly jumps to 100,000 — that's an anomaly, even if there's budget remaining.
Exponentially Weighted Moving Average tracks both the mean and variance of token consumption:
// Update running statistics
const diff = newValue - state.ewmaValue;
const newEwma = state.ewmaValue + EWMA_ALPHA * diff;
const newVariance = (1 - EWMA_ALPHA) * (state.ewmaVariance + EWMA_ALPHA * diff * diff);
// Detect anomaly
const stdDev = Math.sqrt(state.ewmaVariance);
const zScore = (currentRate - state.ewmaValue) / stdDev;
return { isAnomaly: zScore > ANOMALY_Z_THRESHOLD }; // z > 3.0
A note on the variance formula: this is a Welford-style EWMA variance update rather than the textbook α*(x-μ)² + (1-α)*σ². Both converge to the same result, but this form is slightly more numerically stable for streaming updates since it uses the pre-update diff.
Why EWMA, not a simple moving average?
- O(1) space: just two numbers (mean + variance), no window buffer
- Adapts to trends: if usage gradually increases, that's not an anomaly
- Converges fast: ~10 samples and the variance is reliable
Why α=0.3? Aggressive enough to track trend shifts, but not so aggressive that a single outlier moves the baseline. A spike of 10x will trigger z > 3.0 (anomaly) but won't corrupt the baseline mean for subsequent checks.
Critical ordering: You must call detectAnomaly() before updateEwma(). If you update first, the variance absorbs the spike and the z-score drops. This is the kind of bug that only shows up in production.
Layer 4: Quality Verification
Cost and loops are necessary but not sufficient. An agent can stay within budget, never loop, but produce garbage outputs. We need quality signals.
Logprobs (OpenAI & Google)
When available, logprobs are the cheapest quality signal — they come free with the response.
// Map mean logprob to 0-1 score
score = Math.max(0, Math.min(1, 1 + meanLogprob / 2));
// logprob 0 → score 1.0 (certain)
// logprob -1 → score 0.5 (medium)
// logprob -2 → score 0.0 (uncertain)
This is a simple linear mapping. Logprobs are logarithmic so a nonlinear mapping might be more principled, but in practice this threshold-based approach (flag below -1.0) works well enough for the binary "retry or not" decision.
If the mean logprob falls below -1.0 (~37% average token confidence), the response is flagged for potential retry with a better model.
LLM-as-Judge (Anthropic & Fallback)
Anthropic doesn't expose logprobs. So we use GPT-4o-mini as a judge — truncate the prompt (500 chars) and response (1000 chars), ask for a 0-1 quality score.
Cost: <$0.0001 per judgment. At this price, you can judge every response.
Quality Trend Detection
Individual quality scores fluctuate. What matters is the trend. If quality degrades over a session, the model should auto-upgrade:
Compare: avg(last 5 scores) vs avg(earlier scores)
If delta ≤ -0.15 AND recent avg < 0.5 → upgrade model
This creates an automatic feedback loop: cheap model → quality drops → upgrade to better model → quality recovers.
Performance
Guard checks add sub-microsecond overhead — negligible vs. LLM API latency (100-3000ms).
| Operation | Time | Notes |
|---|---|---|
checkBudget() |
~70 ns | Pure arithmetic |
detectLoopByHash() |
~200 ns | Array scan, n=20 |
getDegradationLevel() |
~25 ns | Three comparisons |
guard.before() (Python) |
~2.5 µs | All sync checks combined |
guard.after() (Python) |
~0.3 µs | Cost tracking |
Measured by dividing wall-clock time of 100K iterations on Apple M3. These numbers should be taken as order-of-magnitude — at this scale, JIT warmup, GC pauses, and measurement overhead all matter. The benchmark code is in the repo if you want to reproduce or challenge the methodology.
The point isn't the exact nanosecond count — it's that guard overhead is 5-6 orders of magnitude smaller than the LLM call it's protecting.
What I'd Do Differently
Start with Python first. The AI ecosystem runs on Python. I started with TypeScript because my proxy runs on Cloudflare Workers, but standalone adoption would've been faster with Python-first.
Simpler API surface. The TypeScript API exposes individual functions (
checkBudget,detectLoopByHash,getDegradationLevel). The Python API has a simplerguard.before()/guard.after()pattern. The Python approach is better for most users.Skip TF-IDF for v1. Hash match catches 90%+ of real loops. Cosine similarity is cool engineering but hasn't triggered in my testing where hash match didn't already catch it. (To be fair, my test traffic is limited — this may change with more diverse usage patterns.)
Try It
npx reivo-guard-demo # Interactive demo
GitHub: github.com/tazsat0512/reivo-guard — MIT licensed, TypeScript + Python.
If you've had your own runaway agent story, I'd love to hear it in the comments.
Top comments (0)