tazsat0512

Posted on Mar 30

How I Built Open-Source Guardrails That Auto-Stop Runaway AI Agents

#opensource #ai #typescript #python

Runaway AI agents are expensive. Stories of agents burning through thousands of dollars overnight come up regularly on Reddit and Hacker News — no budget limit, no loop detection, no kill switch. The agent keeps calling GPT-4 in an infinite loop until someone wakes up and pulls the plug.

I built reivo-guard to prevent this. It's an open-source guardrail library that detects and stops runaway AI agents — with sub-microsecond overhead.

This post walks through the architecture decisions behind each detection layer.

The Problem: Agents Don't Know When to Stop

LLM agents fail in predictable ways:

Infinite loops — The agent keeps asking the same question, or semantically similar variations
Cost explosions — Token consumption spikes 100x with no warning
Quality degradation — Responses get worse over time but the agent keeps going
Cliff-edge failures — Everything works until 100% budget, then hard crash

Among the tools I evaluated (Helicone, Portkey, LangSmith, Lunary, LiteLLM), most either observe these failures (dashboards, alerts) or enforce static rules (rate limits, budget caps). I wanted something that detects and acts adaptively — so I built it.

Architecture Overview

guard.before()  →  Budget check, loop detection, session validation
       ↓
    LLM API call
       ↓
guard.after()   →  Cost tracking, quality verification, trend analysis

Guard functions are side-effect-free on the hot path — state lives in a key-value store interface (GuardStore), so it works in serverless (Cloudflare Workers, Lambda) or as a library.

The key insight: split checks into sync (blocking) and async (post-response).

Check	Sync/Async	Why
Budget enforcement	Sync	Must block before spending
Hash loop detection	Sync	O(20), sub-microsecond
EWMA anomaly	Sync	O(1), sub-microsecond
TF-IDF cosine loop	Async	O(W × V) where W=window, V=vocab. Runs in `waitUntil()`
LLM-as-Judge quality	Async	~100ms external call
Quality trend	Sync	O(50), lightweight

Layer 1: Loop Detection (Two Algorithms)

Hash Match (The Fast Path)

The simplest detector: keep a sliding window of prompt hashes and count exact matches.

const window = hashes.slice(-LOOP_HASH_WINDOW); // last 20
const matchCount = window.filter(h => h === newHash).length + 1;
return { isLoop: matchCount >= LOOP_HASH_THRESHOLD }; // ≥5 matches

Why this works: Most agent loops are exact duplicates. The agent asks "What is the capital of France?" five times in a row. Hash match catches this with sub-microsecond overhead.

Why window=20, threshold=5? Agents legitimately retry 2-3 times (network errors, rate limits). 5 matches in 20 requests means 25% of recent traffic is identical — that's a loop, not a retry.

TF-IDF Cosine Similarity (The Smart Path)

Hash match misses rephrased loops: "What's the capital of France?" vs "Tell me France's capital city." Same intent, different hash.

The cosine detector builds TF-IDF vectors from prompt text and computes pairwise similarity:

1. Tokenize: lowercase, split on \W+, filter len > 1
2. TF: freq / tokenCount per document
3. IDF: log(n / docFrequency) across all documents
4. Cosine: dot(a, b) / (||a|| × ||b||)

Threshold: 0.92. This is deliberately high. At 0.92, the prompts need to share ~85% of their meaningful vocabulary. "How do I sort a list in Python?" and "Python list sorting method?" score ~0.89, below threshold. But four variations of the same question cross it.

Why not embeddings? TF-IDF runs locally in <1ms. Embedding APIs add 50-200ms latency and cost money. For loop detection, lexical similarity is good enough — and it's free.

This runs async (waitUntil()) so it never blocks the response path.

Layer 2: Budget Enforcement with Graceful Degradation

Hard budget cutoffs create terrible UX. You're mid-conversation, and suddenly: 403 Forbidden. No warning, no wind-down.

Instead, reivo-guard implements four degradation levels:

Usage	Level	What Happens
< 80%	`normal`	Full access
80-95%	`aggressive`	Force cheaper model routing
95-100%	`new_sessions_only`	Existing sessions continue, new ones blocked
≥ 100%	`blocked`	All requests rejected

function getDegradationLevel(usedUsd: number, limitUsd: number) {
  const ratio = usedUsd / limitUsd;
  if (ratio >= 1.0) return { level: 'blocked', blockAll: true, ... };
  if (ratio >= 0.95) return { level: 'new_sessions_only', blockNewSessions: true, ... };
  if (ratio >= 0.80) return { level: 'aggressive', forceAggressiveRouting: true, ... };
  return { level: 'normal', ... };
}

Why 80%? At 80% budget consumption, you start routing to cheaper models (GPT-4o-mini instead of GPT-4o). The user barely notices quality difference for most tasks, but cost drops 10-20x.

Alert deduplication: Thresholds fire at 50%, 80%, 100% — but only once each. No alert storms.

Note: Portkey and LiteLLM also offer degradation strategies (fallback chains and budget caps respectively). reivo-guard's approach is more granular (4 levels with progressive restrictions) but theirs are more battle-tested at scale.

Layer 3: Anomaly Detection (EWMA)

Budget limits catch expected overuse. EWMA catches unexpected spikes.

If an agent normally uses 1,000 tokens per request and suddenly jumps to 100,000 — that's an anomaly, even if there's budget remaining.

Exponentially Weighted Moving Average tracks both the mean and variance of token consumption:

// Update running statistics
const diff = newValue - state.ewmaValue;
const newEwma = state.ewmaValue + EWMA_ALPHA * diff;
const newVariance = (1 - EWMA_ALPHA) * (state.ewmaVariance + EWMA_ALPHA * diff * diff);

// Detect anomaly
const stdDev = Math.sqrt(state.ewmaVariance);
const zScore = (currentRate - state.ewmaValue) / stdDev;
return { isAnomaly: zScore > ANOMALY_Z_THRESHOLD }; // z > 3.0

A note on the variance formula: this is a Welford-style EWMA variance update rather than the textbook α*(x-μ)² + (1-α)*σ². Both converge to the same result, but this form is slightly more numerically stable for streaming updates since it uses the pre-update diff.

Why EWMA, not a simple moving average?

O(1) space: just two numbers (mean + variance), no window buffer
Adapts to trends: if usage gradually increases, that's not an anomaly
Converges fast: ~10 samples and the variance is reliable

Why α=0.3? Aggressive enough to track trend shifts, but not so aggressive that a single outlier moves the baseline. A spike of 10x will trigger z > 3.0 (anomaly) but won't corrupt the baseline mean for subsequent checks.

Critical ordering: You must call detectAnomaly() before updateEwma(). If you update first, the variance absorbs the spike and the z-score drops. This is the kind of bug that only shows up in production.

Layer 4: Quality Verification

Cost and loops are necessary but not sufficient. An agent can stay within budget, never loop, but produce garbage outputs. We need quality signals.

Logprobs (OpenAI & Google)

When available, logprobs are the cheapest quality signal — they come free with the response.

// Map mean logprob to 0-1 score
score = Math.max(0, Math.min(1, 1 + meanLogprob / 2));
// logprob  0 → score 1.0 (certain)
// logprob -1 → score 0.5 (medium)
// logprob -2 → score 0.0 (uncertain)

This is a simple linear mapping. Logprobs are logarithmic so a nonlinear mapping might be more principled, but in practice this threshold-based approach (flag below -1.0) works well enough for the binary "retry or not" decision.

If the mean logprob falls below -1.0 (~37% average token confidence), the response is flagged for potential retry with a better model.

LLM-as-Judge (Anthropic & Fallback)

Anthropic doesn't expose logprobs. So we use GPT-4o-mini as a judge — truncate the prompt (500 chars) and response (1000 chars), ask for a 0-1 quality score.

Cost: <$0.0001 per judgment. At this price, you can judge every response.

Quality Trend Detection

Individual quality scores fluctuate. What matters is the trend. If quality degrades over a session, the model should auto-upgrade:

Compare: avg(last 5 scores) vs avg(earlier scores)
If delta ≤ -0.15 AND recent avg < 0.5 → upgrade model

This creates an automatic feedback loop: cheap model → quality drops → upgrade to better model → quality recovers.

Performance

Guard checks add sub-microsecond overhead — negligible vs. LLM API latency (100-3000ms).

Operation	Time	Notes
`checkBudget()`	~70 ns	Pure arithmetic
`detectLoopByHash()`	~200 ns	Array scan, n=20
`getDegradationLevel()`	~25 ns	Three comparisons
`guard.before()` (Python)	~2.5 µs	All sync checks combined
`guard.after()` (Python)	~0.3 µs	Cost tracking

Measured by dividing wall-clock time of 100K iterations on Apple M3. These numbers should be taken as order-of-magnitude — at this scale, JIT warmup, GC pauses, and measurement overhead all matter. The benchmark code is in the repo if you want to reproduce or challenge the methodology.

The point isn't the exact nanosecond count — it's that guard overhead is 5-6 orders of magnitude smaller than the LLM call it's protecting.

What I'd Do Differently

Start with Python first. The AI ecosystem runs on Python. I started with TypeScript because my proxy runs on Cloudflare Workers, but standalone adoption would've been faster with Python-first.
Simpler API surface. The TypeScript API exposes individual functions (checkBudget, detectLoopByHash, getDegradationLevel). The Python API has a simpler guard.before() / guard.after() pattern. The Python approach is better for most users.
Skip TF-IDF for v1. Hash match catches 90%+ of real loops. Cosine similarity is cool engineering but hasn't triggered in my testing where hash match didn't already catch it. (To be fair, my test traffic is limited — this may change with more diverse usage patterns.)

Try It

npx reivo-guard-demo  # Interactive demo

GitHub: github.com/tazsat0512/reivo-guard — MIT licensed, TypeScript + Python.

If you've had your own runaway agent story, I'd love to hear it in the comments.

Top comments (6)

Michael "Mike" K. Saleme • Mar 30

Good timing. We've been running constitutional governance for autonomous agents in production since January, and the question of "what stops the guardrail from becoming a false-negative factory" is still the hard part.

What's your failure mode when the guardrail itself throws an exception? Fail-open or fail-closed?

Most implementations default to fail-open (let the agent proceed), which defeats the purpose. We landed on fail-closed with explicit override paths. But that creates a different problem: legitimate actions get blocked, and the human path becomes the bottleneck.

Curious how you're handling that tradeoff.

tazsat0512 • Mar 31

Great question — this is genuinely one of the harder design decisions.

Currently reivo-guard is implicitly fail-closed: if the guard itself throws an unexpected exception, it propagates to the caller and the LLM call doesn't proceed. This is a side effect of the library design (guard runs inline before the call), not a deliberate policy choice.

For the raise_on_block=False path, before() returns a GuardDecision and intentional blocks are data, not exceptions. But an unhandled bug in the guard code itself — say a ZeroDivisionError in the EWMA variance calculation — would crash the caller. So it's fail-closed by accident, not by design.

I think you're right that fail-closed is the correct default for guardrails. The whole point is preventing runaway spend — a guardrail that silently disappears when it hits an edge case is worse than useless, it's a false sense of security.

That said, I don't yet have the explicit override path you're describing. The honest answer is: this library is at v0.3 and hasn't hit the "legitimate actions blocked at scale" problem because it doesn't have enough users yet. Your constitutional governance setup is further along on this axis than I am.

The design I'm considering:

guard = Guard(
on_guard_error="closed", # or "open", or a callable
)

Where "closed" = block + log, "open" = allow + log + alert, and a callable lets you implement circuit-breaker patterns (e.g., fail open after N consecutive guard errors, then auto-recover).

The false-negative factory concern is real too. Right now the EWMA z-score and CUSUM detectors auto-calibrate from running statistics — which means a slowly drifting baseline could normalize what should be anomalous. We partially address this with CUSUM (designed specifically for gradual drift), but the "guardrail that learns to accept the new normal" failure mode is something I'm still thinking about. Periodic baseline resets? External reference thresholds? Curious what you've landed on for that.

Would love to hear more about how you're handling the human-override bottleneck. That feels like the next problem I'll hit once

there are real users.

Michael "Mike" K. Saleme • Mar 31

Great breakdown on the design tradeoffs. A few things from our side:

On the "guardrail that learns to accept the new normal" - this is exactly what we formalized as Normalization of Deviance in multi-agent systems (DOI: 10.5281/zenodo.19195516 (doi.org/10.5281/zenodo.19195516)). The core finding: a 19-day silent failure where all telemetry read healthy but output was zero. EWMA and CUSUM both missed it because the drift was within variance. What caught it was stateful session tracking - comparing what the agent did against what it should have done at the constitutional level, not just statistical baselines.

Periodic baseline resets help but they are fragile (reset too early and you lose trend data, too late and you have already normalized the deviation). External reference thresholds are better - we use hard constitutional constraints that do not adapt:
"This agent must never execute more than N tool calls per session," regardless of what the running average says.

On the human-override bottleneck, we route through escalation chains with time-bounded auto-approve. If a human does not respond within T seconds, the system defaults to the constitutional constraint (fail-closed). This prevents the "human is the bottleneck" problem without removing the human from the loop entirely. The tradeoff: some legitimate actions get delayed by T seconds. In practice, T=30s works for most agent operations.

Your on_guard_error callable pattern is smart. Circuit-breaker with auto-recover is the right architecture - just make sure the "open" window logs enough context to reconstruct what happened. We have seen cases where the guard error itself was the interesting signal (malformed input that the guard could not parse = likely adversarial).

Published two more preprints today that touch on these exact questions - the anchor paper on protocol-level testing (DOI: 10.5281/zenodo.19343034 (doi.org/10.5281/zenodo.19343034)) and the community scaling paper (DOI: 10.5281/zenodo.19343108 (doi.org/10.5281/zenodo.19343108)).
Happy to compare notes further.

tazsat0512 • Mar 31

Thanks for sharing those preprints — read through both over the weekend.

The normalization of deviance framing resonated. We've already seen a version of this in testing: EWMA adapts to a slowly drifting baseline and stops flagging what should be anomalous. Your point about constitutional constraints (hard limits that don't bend to statistics) is the right fix for that failure mode.

Since your first comment, we shipped v0.2.0 with some changes that touch on this:

Guard class with before/after pattern — single entry point that runs budget, loop, anomaly, and rate checks inline (fail-closed by default, crashes the caller on unhandled errors)
abs(z-score) + warmup for anomaly detection — catches negative spikes too, and suppresses false positives during cold start
Rate limiting with slot-only-on-allow — blocked requests don't consume rate budget

What's not there yet but planned:

on_guard_error policy (closed/open/callable) with circuit-breaker recovery — your suggestion to treat guard errors themselves as adversarial signals is a good design principle
Constitutional constraints as a separate layer from statistical detection — hard caps like "max N tool calls per session" that never auto-calibrate away

One question on your escalation chain design: with T=30s auto-approve, how do you handle the case where the human reviewer is available but slow (say, reading context for 45s)? Does the timer reset on interaction, or is it a hard deadline?

Zero runtime deps btw — the whole library is self-contained. Felt relevant given this week's axios incident.

Michael "Mike" K. Saleme • Mar 31

On the timer: hard deadline with interaction resets. Any reviewer action (opens alert, sends "hold") pauses the countdown and adds another T. Capped at 3 extensions, then fails closed with audit log. 85% resolve in the first window, 3% hit the cap; almost always off-hours.

Your v0.2.0 changes track with what we learned. One thing to watch: keep the constitutional constraint layer structurally separate from statistical detection, not just logically. If they share state, the adaptive layer will eventually influence the hard caps through config drift. We learned that the hard way.

tazsat0512 • Apr 1

Great insight on keeping constraint layers isolated — config drift between
layers is exactly the kind of subtle failure that's hard to catch in testing

but devastating in production.
This is shaping our v0.3 design. We're leaning toward each guard layer owning its own config snapshot rather than sharing mutable state. Your "hard deadline + interaction reset" pattern is something I want to implement as a first-class option.

Really appreciate the production war stories — this kind of feedback from real deployments is invaluable for an early-stage OSS project. If you ever want to try Reivo Guard on a staging workload, I'd love to hear how it holds up. 🙏