Abdul Rehman

Posted on Jun 19

AI Agents in Production: Error Handling, Fallbacks, and Cost Control

#aiagents #llm #production #errors

I watched an LLM pipeline burn $400 in 90 minutes once. Not because the model was expensive, but because a single unhandled 429 rate-limit error triggered an infinite retry loop against GPT-4. No fallback. No circuit breaker. No cost alert. Just a runaway process that kept hammering the API until the billing dashboard lit up.

That was early in my job board platform work, where I was processing 10,000+ job listings daily through an LLM scoring pipeline. The system worked great in testing. In production, it found every edge case the API could throw at it.

Here's what I learned about making AI agents actually reliable.

The Retry Pattern That Doesn't Burn Money

Most retry logic I see in production code is naive. A try-catch wrapper with a fixed delay and a prayer. That works until you hit a sustained outage and every retry fires at the same interval, creating a thundering herd against an already struggling API.

The fix is exponential backoff with jitter. But the important part isn't the math, it's the circuit breaker on top of it.

interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
  circuitBreakerThreshold: number;
  circuitBreakerResetMs: number;
}

class LLMClient {
  private consecutiveFailures = 0;
  private circuitOpen = false;
  private circuitOpenAt = 0;

  async callWithRetry(
    prompt: string,
    config: RetryConfig
  ): Promise<string> {
    if (this.circuitOpen) {
      const elapsed = Date.now() - this.circuitOpenAt;
      if (elapsed < config.circuitBreakerResetMs) {
        throw new Error('Circuit breaker open, skipping request');
      }
      this.circuitOpen = false;
      this.consecutiveFailures = 0;
    }

    for (let attempt = 0; attempt < config.maxRetries; attempt++) {
      try {
        const result = await this.callLLM(prompt);
        this.consecutiveFailures = 0;
        return result;
      } catch (error) {
        this.consecutiveFailures++;
        if (this.consecutiveFailures >= config.circuitBreakerThreshold) {
          this.circuitOpen = true;
          this.circuitOpenAt = Date.now();
          throw error;
        }
        const delay = Math.min(
          config.baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000,
          config.maxDelayMs
        );
        await sleep(delay);
      }
    }
    throw new Error('Max retries exceeded');
  }
}

The circuit breaker is the key. After N consecutive failures, the client stops trying entirely for a window. This prevents the cascade where every queued job hits a dead API simultaneously, each one retrying, each one burning time and money.

On the job board platform, this pattern cut our LLM-related error rate from about 4% of calls to under 0.1%. The remaining errors were genuine API outages that we couldn't do anything about anyway.

Fallback Chains Across Providers

One provider is a single point of failure. Two providers with a fallback chain is a system that keeps running when things break.

I built a fallback chain for a client project that needed reliable structured extraction from legal documents. The primary model was GPT-4o for accuracy. The fallback was Claude 3.5 Sonnet. The last resort was Gemini 2.0 Flash, which was fast and cheap but less reliable for the specific extraction schema.

type ProviderConfig = {
  name: string;
  call: (prompt: string) => Promise<string>;
  costPerCall: number;
  timeoutMs: number;
};

const fallbackChain: ProviderConfig[] = [
  { name: 'gpt-4o', call: callOpenAI, costPerCall: 0.015, timeoutMs: 30000 },
  { name: 'claude-3.5', call: callAnthropic, costPerCall: 0.012, timeoutMs: 45000 },
  { name: 'gemini-flash', call: callGemini, costPerCall: 0.001, timeoutMs: 20000 },
];

async function callWithFallback(prompt: string): Promise<{
  result: string;
  provider: string;
  cost: number;
}> {
  for (const provider of fallbackChain) {
    try {
      const result = await withTimeout(provider.call(prompt), provider.timeoutMs);
      return { result, provider: provider.name, cost: provider.costPerCall };
    } catch (error) {
      logFallbackEvent(provider.name, error);
      continue;
    }
  }
  throw new Error('All providers failed');
}

The important detail: each provider has a different timeout. GPT-4o is fast but expensive. Claude is slower but more reliable for certain tasks. Gemini Flash is the cheap safety net. If the primary times out after 30 seconds, the fallback gets 45 seconds because it needs more time to produce the same quality.

This pattern kept the document extraction pipeline running through multiple OpenAI outages. The client never noticed because Claude picked up the slack within seconds.

Cost Monitoring That Actually Catches Problems

Structured logging is where most teams stop. They log the request, the response, the latency. But they don't log the cost per call, the model used, or the fallback chain depth.

I added a structured log entry for every LLM call in the job board pipeline:

interface LLMCallLog {
  timestamp: string;
  model: string;
  provider: string;
  promptTokens: number;
  completionTokens: number;
  cost: number;
  latencyMs: number;
  fallbackDepth: number;
  success: boolean;
  errorType?: string;
}

This let me build a simple dashboard that showed cost per hour, per model, and per pipeline stage. When the AI description rewrite pipeline was running, I could see exactly which stage was burning money and whether the cost was justified by the output quality.

The real value came from anomaly detection. When cost spiked, I could trace it to a specific model, a specific prompt pattern, or a specific error type. The $400 incident I mentioned earlier would have been caught in under 5 minutes with this logging in place.

Idempotency Guarantees for LLM Operations

LLM calls are not idempotent by nature. Same prompt, same model, different output. But the operations triggered by those outputs need to be idempotent, especially when retries are involved.

On the job board platform, the scoring pipeline processed each listing exactly once. But if a retry happened after the first call succeeded but before the response was saved, the listing would be scored twice, consuming double the tokens and potentially overwriting the first score.

The fix was a request ID that the pipeline checked before processing:

async function scoreListing(listingId: string): Promise<ScoreResult> {
  const existingScore = await db.scores.findOne({ listingId });
  if (existingScore) {
    return existingScore;
  }

  const requestId = crypto.randomUUID();
  const result = await llmClient.callWithRetry(
    buildScoringPrompt(listingId),
    { requestId }
  );

  await db.scores.insertOne({
    listingId,
    requestId,
    score: result,
    createdAt: new Date()
  });

  return result;
}

The database insert is the idempotency key. If the retry fires after the insert succeeds but before the function returns, the next call finds the existing score and returns it without calling the LLM again. This pattern eliminated duplicate scoring entirely.

The Hidden Cost of Reliability

Building these patterns costs development time. The retry logic, the fallback chain, the structured logging, the idempotency checks, all of it adds complexity. But the alternative is worse.

I've seen teams spend weeks debugging intermittent failures that turned out to be a missing circuit breaker. I've seen cost overruns that could have been caught with a simple logging dashboard. I've seen production outages that a fallback chain would have prevented entirely.

The gap between a demo and a production deployment is exactly this work. The demo works because the API is up, the load is low, and the inputs are clean. Production is where APIs go down, rate limits hit, and inputs arrive in every possible broken format.

If your team is building AI agents and finding that reliability is the bottleneck between demo and deployment, that's the kind of thing I help with. Happy to compare notes on what's breaking in your pipeline.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Top comments (1)

Armorer Labs • Jun 22

The retry classifier point is worth sharpening. Rate-limit (429) and 5xx deserve exponential backoff, but 4xx errors — context-length-exceeded, content-policy-rejected, bad-request — are deterministic; retrying them just wastes tokens and surfaces the same error. Same for billing errors (insufficient_quota): a retry loop on a depleted quota can drain a budget in minutes, so we classify errors at the call site before they reach the retry counter.

Two notes on the idempotency pattern: (1) the DB unique index on requestId is doing more work than the application-level check — the in-app check is racy if two retries land in parallel; the unique index makes the second insert fail deterministically. (2) For LLM scoring specifically, idempotency-by-result-hash can be stronger than idempotency-by-requestId because the same prompt on a non-deterministic model can still produce different output; keying on (listingId, scoreHash) prevents the silent re-score.

Disclosure: I work on Armorer Labs — we build an open-source local control plane for running AI agents as managed apps, so 'cost per pipeline stage' attribution is the bit I've spent the most time staring at. Per-call logs without per-stage attribution make anomalies very hard to localize.