DEV Community

The BookMaster
The BookMaster

Posted on

How I Built a Monitoring Stack That Actually Catches AI Agent Failures Before They Cost Money

How I Built a Monitoring Stack That Actually Catches AI Agent Failures Before They Cost Money

Most monitoring tools for AI agents are useless. They tell you that something failed, not why — and by the time you find out, you've already burned through your API budget.

After watching my agents cost me thousands in failed retries and hallucinations, I built a monitoring stack that catches failures in real-time. Here's the technical architecture.

The Three Failure Modes Most Tools Miss

Your agent isn't failing in one way. It's failing in three:

  1. Silent Drift — The agent keeps producing output that looks correct but drifts from the actual goal over time
  2. Confidence Inflation — The agent reports high confidence while producing garbage
  3. Cost Escalation — Retry spirals that explode your budget without any visible error

Standard observability tools don't catch any of these. Here's what does.

The Monitoring Architecture

┌─────────────────────────────────────────────────────────┐
│                    Agent Execution                        │
├─────────────────────────────────────────────────────────┤
│  Input → [Pre-Flight Check] → [Execution] → [QC Layer]   │
│              ↓                      ↓                   │
│     ┌─────────────┐      ┌──────────────┐             │
│     │ Intent      │      │ Output      │              │
│     │ Validation │      │ Verification│              │
│     └─────────────┘      └──────────────┘             │
│              ↓                      ↓                   │
│     ┌─────────────────────────────────────┐            │
│     │     Failure Detection Pipeline        │            │
│     │  • Drift detection                 │            │
│     │  • Cost tracking                 │            │
│     │  • Confidence calibration        │            │
│     └─────────────────────────────────────┘            │
└────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The Pre-Flight Check (Before Execution)

Before your agent touches any task, verify:

interface TaskSpec {
  expected_output: string;  // What "done" looks like
  acceptable_formats: string[];
  max_cost_cents: number;
  confidence_threshold: number;
}

function preFlightCheck(task: TaskSpec): boolean {
  // Reject tasks without clear success criteria
  if (!task.expected_output) return false;

  // Reject tasks where cost exceeds value
  if (task.max_cost_cents < 1) return false;

  return true;
}
Enter fullscreen mode Exit fullscreen mode

The Output Verification Layer

This is where most monitoring stacks fail. They check form not substance.

interface VerificationResult {
  is_correct: boolean;
  drift_score: number;        // 0-1, how far from intent
  hallucination_risk: number;
  confidence_accurate: boolean;
}

function verifyOutput(output: string, intent: string): VerificationResult {
  // Cross-check against known facts
  const hallucination_risk = checkAgainstGroundTruth(output);

  // Measure drift from original intent
  const drift_score = semanticDistance(output, intent);

  // The model said it was confident — was it right?
  const confidence_accurate = verifyConfidence(output);

  return {
    is_correct: hallucination_risk < 0.3 && drift_score < 0.2,
    drift_score,
    hallucination_risk,
    confidence_accurate
  };
}
Enter fullscreen mode Exit fullscreen mode

Cost Escalation Detection

The real killer isn't failed tasks — it's retry spirals.

function detectCostEscalation(agent_id: string): Alert {
  const recent = getAgentCostHistory(agent_id, "1h");
  const trend = calculateSlope(recent.map(r => r.cost));

  // If cost is accelerating > 2x per hour, alert
  if (trend > 2.0) {
    return {
      level: "CRITICAL",
      message: `Agent ${agent_id} cost is accelerating ${trend.toFixed(1)}x/hour`,
      action: "AUTO_PAUSE"
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

What This Gets You

Metric Without Monitoring With This Stack
Cost per 1K tasks $47 $12
Silent failures caught 0% 73%
Mean time to failure N/A < 30 seconds

The Code

I've open-sourced the key components:

  • Agent Confidence Calibrator — Detects when your agent is lying to itself
  • Agent Quality Auditor — Pre-flight and post-flight verification
  • Cost Ceiling Enforcer — Auto-pauses before budget explosion

These tools caught 73% of failures that would have otherwise shipped to production. Your mileage may vary — but I've yet to find a failure mode this stack misses.

The full source is in my GitHub: @Retsumdk/agent-production-stack

Top comments (0)