Saurav Bhattacharya

Posted on Jun 6

Your AI Agent Drifted Last Night and You Didn't Notice

#ai #agents #testing #devops

The Silent Failure Mode

Your agent passed every test in CI. It ran fine in staging. Then it quietly started returning subtly wrong answers in production at 2 AM, and nobody noticed until a customer complained three days later.

This is agent drift — the gradual degradation of agent output quality without any hard failures. No exceptions thrown. No schema violations. Just slowly worsening responses that slip past your monitoring.

Here's my thesis: the hardest production failures aren't crashes — they're quality degradation that happens between your eval checkpoints. You need continuous runtime detection, not just pre-deployment testing.

Three Drift Vectors

After running production agents 24/7, I've identified three distinct drift patterns:

1. Context Staleness

Your agent retrieves documents, knowledge bases, or API responses as context. That context decays:

interface StalenessDetector {
  source: string;
  maxAgeMs: number;
  check: (context: RetrievedContext) => StalenessResult;
}

const stalenessChecks: StalenessDetector[] = [
  {
    source: 'knowledge-base',
    maxAgeMs: 24 * 60 * 60 * 1000, // 24 hours
    check: (ctx) => {
      const age = Date.now() - ctx.lastIndexedAt;
      const staleChunks = ctx.chunks.filter(c => 
        Date.now() - c.sourceLastModified > 7 * 24 * 60 * 60 * 1000
      );
      return {
        stale: staleChunks.length / ctx.chunks.length > 0.3,
        staleFraction: staleChunks.length / ctx.chunks.length,
        oldestChunkAge: Math.max(...ctx.chunks.map(c => Date.now() - c.sourceLastModified)),
        recommendation: staleChunks.length > 0 
          ? `${staleChunks.length} chunks older than 7 days` 
          : 'fresh'
      };
    }
  },
  {
    source: 'api-response-cache',
    maxAgeMs: 60 * 60 * 1000, // 1 hour
    check: (ctx) => {
      const cached = ctx.apiResponses.filter(r => r.fromCache);
      const expired = cached.filter(r => Date.now() - r.cachedAt > 60 * 60 * 1000);
      return {
        stale: expired.length > 0,
        staleFraction: expired.length / Math.max(cached.length, 1),
        recommendation: expired.length > 0
          ? `${expired.length} cached API responses expired`
          : 'fresh'
      };
    }
  }
];

The insidious part: stale context doesn't cause errors. Your agent happily generates confident answers based on outdated information. The output looks fine — it's just wrong.

2. Behavioral Drift

The agent's response patterns shift over time. Maybe the underlying model got a silent update. Maybe prompt injection attempts are subtly reshaping behavior. Maybe token distributions are shifting due to accumulated conversation context.

interface DriftBaseline {
  dimension: string;
  expectedDistribution: { mean: number; stddev: number };
  windowSize: number;
}

class BehavioralDriftDetector {
  private baselines: Map<string, DriftBaseline> = new Map();
  private observations: Map<string, number[]> = new Map();

  observe(dimension: string, value: number): DriftAlert | null {
    const baseline = this.baselines.get(dimension);
    if (!baseline) return null;

    const window = this.observations.get(dimension) || [];
    window.push(value);
    if (window.length > baseline.windowSize) window.shift();
    this.observations.set(dimension, window);

    if (window.length < baseline.windowSize * 0.5) return null;

    const currentMean = window.reduce((a, b) => a + b, 0) / window.length;
    const zScore = Math.abs(currentMean - baseline.expectedDistribution.mean) 
      / baseline.expectedDistribution.stddev;

    if (zScore > 2.5) {
      return {
        dimension,
        severity: zScore > 4 ? 'critical' : 'warning',
        currentMean,
        expectedMean: baseline.expectedDistribution.mean,
        zScore,
        message: `${dimension} drifted ${zScore.toFixed(1)} sigma from baseline`
      };
    }
    return null;
  }
}

The key insight: you're not evaluating individual outputs. You're evaluating the distribution of outputs over time. A single long response means nothing. A gradual increase in average response length across 100 runs? That's signal.

3. Hallucination Creep

Hallucination rates aren't constant. They vary with input complexity, context quality, and model state. The dangerous pattern: hallucination rate slowly climbs from 2% to 8% over a week, crossing your acceptable threshold without ever triggering a single hard failure.

interface HallucinationCanary {
  name: string;
  detect: (output: AgentOutput, groundTruth: GroundTruth) => HallucinationSignal;
}

const canaries: HallucinationCanary[] = [
  {
    name: 'entity-grounding',
    detect: (output, truth) => {
      const claimedEntities = extractEntities(output.raw);
      const groundedEntities = extractEntities(truth.sourceDocuments);
      const ungrounded = claimedEntities.filter(e => 
        !groundedEntities.some(g => semanticMatch(e, g, 0.85))
      );
      return {
        hallucinated: ungrounded.length > 0,
        ungroundedEntities: ungrounded,
        groundingRate: 1 - (ungrounded.length / Math.max(claimedEntities.length, 1))
      };
    }
  },
  {
    name: 'numeric-consistency',
    detect: (output, truth) => {
      const claimedNumbers = extractNumericClaims(output.raw);
      const sourceNumbers = extractNumericClaims(truth.sourceDocuments);
      const inconsistent = claimedNumbers.filter(claim =>
        !sourceNumbers.some(src => 
          src.entity === claim.entity && 
          Math.abs(src.value - claim.value) / src.value < 0.05
        )
      );
      return {
        hallucinated: inconsistent.length > 0,
        inconsistentClaims: inconsistent,
        consistencyRate: 1 - (inconsistent.length / Math.max(claimedNumbers.length, 1))
      };
    }
  }
];

The Runtime Monitoring Loop

Detection is one half. The other half is what you do about it. Here's the runtime loop I've converged on:

async function monitorAgentRun(run: AgentRun): Promise<MonitorResult> {
  // 1. Pre-execution: Check context freshness
  const stalenessResults = await checkStaleness(run.context);
  if (stalenessResults.some(r => r.stale)) {
    await refreshStaleContext(run, stalenessResults);
  }

  // 2. Post-execution: Lightweight drift check on every run
  const driftAlerts = trackDimensions(run.output, {
    responseLength: estimateTokens(run.output.raw),
    toolCalls: run.output.toolCallCount,
    latency: run.durationMs,
    confidenceProxy: run.output.metadata?.confidence ?? null
  });

  // 3. Sampled: Hallucination canary (expensive, run on 10% sample)
  let hallucinationResult = null;
  if (Math.random() < 0.1 && run.groundTruthAvailable) {
    hallucinationResult = await runCanaries(run.output, run.groundTruth);
  }

  // 4. Alert on threshold breach
  if (driftAlerts.some(a => a.severity === 'critical')) {
    await alertOncall('agent-drift-critical', driftAlerts);
  }

  return { stalenessResults, driftAlerts, hallucinationResult };
}

Notice the layering: staleness checks are pre-execution (you can fix stale context before the agent runs). Drift detection is post-execution and cheap (runs on every invocation). Hallucination canaries are expensive and sampled.

What I Got Wrong Initially

Three mistakes I made building this:

1. Alerting on individual outliers. An agent producing one long response isn't drift. I burned weeks chasing false positives before switching to windowed statistical detection.

2. Not versioning baselines. When you intentionally change agent behavior (new prompt, new model), your baselines need to reset. Otherwise every intentional improvement triggers drift alerts.

3. Treating hallucination as binary. "The agent hallucinated" is useless. What did it hallucinate? Entities? Numbers? URLs? The category determines the fix.

The Production Checklist

If you're running agents in production without drift monitoring, start here:

Track three dimensions minimum: response length, latency, and tool call count. These are cheap proxies that catch gross behavioral changes.
Set baselines from your last 7 days of production data, not from test runs. Test distributions don't match production.
Alert on 2.5 sigma deviations over rolling windows, not individual outliers.
Check context freshness before execution, not after. Stale context is the one drift type you can prevent rather than detect.
Sample hallucination checks at 5-10% unless you have specific high-risk outputs that warrant 100% coverage.

The Uncomfortable Truth

Most teams discover drift through customer complaints. By the time a user says "your AI gave me wrong information," you've likely been serving degraded responses for days or weeks.

The gap between "my agent works" and "my agent works reliably" is entirely about what happens between your evaluation checkpoints. Continuous monitoring isn't optional — it's the difference between running a demo and running a product.

How are you detecting drift in your production agents? Or are you still finding out from users? I'm curious what signals have been most useful for early detection.

Top comments (1)

Andrii Krugliak • Jun 9

The silent part is what makes drift expensive. Passing CI and staging only proves the inputs you imagined, and drift shows up on the inputs you didn't. The thing I keep coming back to is that you can't catch this with more tests, only with a check that re-asks whether the output is still right at runtime.