I Built a Self-Diagnosing AI Agent That Catches Its Own Mistakes Before They Matter

The Problem Nobody Talks About

Running AI agents in production is a different beast. They hallucinate, drift off-task, and sometimes quietly fail in ways you only notice when a customer emails you three days later.

The real issue? Agents don't know when they're about to fail.

I spent six months building agent infrastructure, and the breakthrough wasn't a better model — it was adding a self-diagnostic layer that flags uncertainty before it compounds.

The Architecture

Here's the core pattern I landed on. Every agent action gets scored against three signals:

Confidence drift — Is the model's confidence dropping across sequential tokens?
Context coherence — Does the output stay grounded in the system prompt and prior context?
Action reversibility — Can we undo this if it turns out wrong?

class AgentDiagnostics:
    def __init__(self, agent):
        self.agent = agent
        self.confidence_threshold = 0.72
        self.rollback_stack = []

    def run(self, task: str) -> str:
        result = self.agent.execute(task)

        # Check confidence before returning
        if result.confidence < self.confidence_threshold:
            # Flag for human review instead of silent failure
            self.flag_for_review(task, result)
            return self.agent.execute_fallback(task)

        # Track reversible actions
        if result.is_reversible:
            self.rollback_stack.append(result)

        return result

    def flag_for_review(self, task, result):
        # Your alerting logic here — Slack, email, etc.
        print(f"⚠️ Low confidence ({result.confidence:.2f}) on: {task[:50]}...")

What This Caught That Nothing Else Did

After shipping this, I looked at the logs. About 14% of what I thought were "successful" agent runs had been silently degrading — not failing outright, but producing outputs that were 15-20% lower quality than baseline.

That's the invisible tax on agent systems. Not the obvious crashes. The slow drift.