Nilofer 🚀

Posted on Apr 29

Agent Failure Classifier: Post-Hoc Root Cause Analysis for Failed LLM Agent Runs

#llm #opensource #machinelearning #agents

When an LLM agent fails, the trace is right there, the user turns, the tool calls, the responses, the final result. But knowing what happened and knowing why it failed are two different things. Most teams read traces manually, form a guess, and move on.

Agent Failure Classifier is a CLI tool and Python library for post-hoc root cause analysis of failed or low-quality LLM agent runs. Feed it any agent trace and it classifies the failure into one of eight named failure modes, identifies the first turn where things went wrong, and produces a structured report with actionable fixes.

The classifier combines eight fast rule-based detectors with an optional LLM-as-judge pass via OpenRouter. The rule-based layer is free, deterministic, and requires no network access. The LLM pass breaks ties and classifies traces the rules cannot resolve alone.

The Eight Failure Modes

The classifier recognises exactly eight failure modes, each with a precise definition:

HALLUCINATION: Agent stated facts or called tools that do not exist
TOOL_MISUSE: Agent called a real tool with wrong parameters or at the wrong time
CONTEXT_LOSS: Agent forgot earlier decisions or repeated already-completed steps
CIRCULAR_REASONING: Agent looped between the same 2-3 steps without making progress
GOAL_DRIFT: Agent started pursuing a sub-goal and forgot the original task
OVER_REFUSAL: Agent refused an action it was capable of and should have taken
SCHEMA_ERROR: Agent generated malformed JSON for a tool call or structured output
TIMEOUT_CASCADE: One slow tool call caused the agent to rush or skip subsequent steps

These are not fuzzy categories. Each one maps to a specific detector with specific signals. A hallucination is flagged when the agent asserts a factual claim without invoking any retrieval tool. A timeout cascade is flagged when a tool call exceeds a latency threshold and the subsequent agent turn is unusually short relative to the tool output.

How It Works

The classification pipeline runs in two layers.

The rule-based layer runs eight deterministic detectors over the trace. Each detector looks for specific structural signals repeated tool calls with identical inputs, cycles in agent turn content, latency spikes followed by short responses, malformed JSON in tool call outputs. This layer runs offline, requires no API key, and classifies all eight failure modes.

The LLM-as-judge layer is optional. When enabled, it receives traces the rule-based layer couldn't resolve with high confidence and breaks ties. The judge runs via OpenRouter and can be pointed at any OpenRouter model or a local OpenAI-compatible server (Ollama, vLLM, llama.cpp).
Every classification produces a structured report with the classified failure mode, a confidence score, the first turn where the failure was detected, a root cause summary, and a list of actionable fixes.

Getting Started

Install

git clone https://github.com/dakshjain-1616/agent-failure-classifier
cd agent-failure-classifier
python3 -m venv venv
source venv/bin/activate
pip install -e .

Requires Python 3.8+. The only dependencies are pydantic, rich, click, and requests. The rule-based layer runs with no additional setup.

LLM Judge Setup (Optional)
To enable the LLM-judge pass, copy .env.example to .env and set your OpenRouter key:

cp .env.example .env
# edit .env and set OPENROUTER_API_KEY=sk-or-...

Without any key, pass --no-llm to every classify or batch call. The rule-based layer alone classifies all eight failure modes.

CLI

The CLI is exposed as both a console script (agent-failure-classifier) and an importable module (python -m agent_failure_classifier.cli).

Classify a single trace
The core command takes a trace JSON file and returns a structured report. --no-llm keeps it offline, rule-based only, no API call.

agent-failure-classifier classify --trace traces/hallucination_example.json --no-llm

Key flags:

Validate a trace
Before classifying, validate parses the trace and prints its structure: trace ID, goal, turn count, and a preview of each turn. Useful for confirming the trace loaded correctly before running classification.

agent-failure-classifier validate --trace traces/hallucination_example.json

Batch classification
batch runs classification over every *.json file in a directory and produces a failure-mode distribution table plus a per-trace summary.

agent-failure-classifier batch --traces-dir ./traces/ --no-llm

Worked Examples

Example 1 - Hallucination
The trace has a user asking for WWII death statistics. The agent responds directly with a factual claim, no tool call, no retrieval.

{
  "trace_id": "hallucination-001",
  "original_goal": "Get population statistics",
  "final_result": "70 million people died in WWII.",
  "is_successful": false,
  "turns": [
    {"turn_number": 0, "role": "user",  "content": "How many people died in WWII?"},
    {"turn_number": 1, "role": "agent", "content": "70 million people died in WWII."}
  ]
}

Classification: HALLUCINATION, confidence 75%, first failure at turn 1. The detector flags that the agent asserted a factual claim without invoking any retrieval tool. Recommended fixes include adding a fact-checking step, requiring tool verification for factual claims, and implementing retrieval-augmented generation.

Example 2 - Circular Reasoning
Four turns alternating between "Let me analyze this step by step." and "I need more information." The agent makes no progress across the entire trace.
Classification: CIRCULAR_REASONING, confidence 80%. The rule-based detector identifies a 2-step cycle repeating across agent turns and recommends a maximum-iteration limit plus state-change detection.

Example 3 - Timeout Cascade
A slow_api tool call with latency_ms: 6000 followed by a one-word agent response "OK".
Classification: TIMEOUT_CASCADE, confidence 70%. The detector flags the latency breach and notes that the subsequent agent turn is a one-word response, less than half the length of the tool output, indicating the agent rushed through the remaining steps.

Python API

Classify a trace programmatically

import json
from agent_failure_classifier.classifier import FailureClassifier
from agent_failure_classifier.models import AgentTrace

trace = AgentTrace(**json.load(open("traces/hallucination_example.json")))
report = FailureClassifier(use_llm=False).classify(trace)

print(report.classified_failure_mode, report.confidence)
print(report.root_cause_summary)

Record a trace live with TraceRecorder
Rather than constructing trace JSON by hand, TraceRecorder is a context manager that captures an agent run as it executes and writes a trace file to disk on exit. The output is immediately compatible with the CLI and with FailureClassifier.

from agent_failure_classifier.recorder import TraceRecorder

with TraceRecorder(goal="Find Italian restaurants", output_dir="./traces") as r:
    r.add_turn(role="user", content="Find Italian restaurants near me")
    r.add_turn(role="agent", content="Searching...")
    r.add_turn(
        role="tool",
        content="results",
        tool_name="search",
        tool_input={"query": "italian restaurants"},
        tool_output='{"hits": ["Luigi Bistro", "Pasta Palace"]}',
        latency_ms=120,
    )
    r.add_turn(role="agent", content="I found Luigi Bistro and Pasta Palace.")
    r.set_final_result("Found Luigi Bistro and Pasta Palace", is_successful=True)

On exit the trace is saved to ./traces/trace_<id>_<timestamp>.json.

Parse traces from other frameworks
AutoParser auto-detects and normalises three input formats into the canonical AgentTrace model. No manual conversion needed regardless of where the trace came from.

from agent_failure_classifier.formats import AutoParser

trace = AutoParser().parse_file("path/to/trace.json")

The three supported formats are:

Native / generic: a dict with trace_id, original_goal, is_successful, and a turns list. This is the format emitted by TraceRecorder.
LangSmith run export: a dict with run_type, inputs, outputs, and optional child_runs. Tool child runs become TOOL turns; chain and LLM child runs become AGENT turns.
LangGraph state dict: a dict with thread_id and a state.messages list whose entries use type values human, ai, and tool. A minimal list-of-dicts ([{"role": "...", "content": "..."}, ...]) is also accepted by the generic parser.

How I Built This Using NEO

This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

The problem was defined at a high level: a tool that takes any agent trace, runs deterministic detectors over it, and classifies the failure into a named category with a structured report and actionable fixes. NEO generated the full implementation: the eight rule-based detectors, the FailureClassifier orchestration layer, the optional LLM-as-judge pass via OpenRouter, the TraceRecorder context manager, the AutoParser with support for native, LangSmith, and LangGraph formats, and the Click-based CLI with classify, validate, and batch commands.

How You Can Build Further With NEO

Use it as a CI/CD quality gate for your agent.
If you're shipping an LLM agent, you can integrate the classifier directly into your deployment pipeline. Record traces from your test suite with TraceRecorder, run batch classification on every pull request, and fail the build if a new failure mode appears or if the rate of a known one spikes. You get a systematic regression check on agent behaviour, not just on code.

Use it to understand where your agent breaks most.
Run batch classification across a directory of historical traces and look at the failure mode distribution. If CONTEXT_LOSS shows up in 40% of your traces, that's a signal about your agent's memory design, not a one-off bug. This turns debugging from reactive to diagnostic, you're looking at patterns across runs, not reading individual traces one by one.

Use it as a live monitoring layer in a multi-agent system.
The classifier runs as an A2A agent, which means it can sit as a node in a multi-agent pipeline. Any agent in the system can send its trace to the classifier after each run and get a structured failure report back. An orchestrator can use that signal to decide whether to retry, reroute, or escalate without any human in the loop.

Use it during agent development to catch regressions early.
Wrap TraceRecorder around your agent during development. Every run produces a trace. Feed those traces into the classifier after each session and you'll know immediately if a change introduced a new failure mode. It's the difference between finding out something broke in production versus finding out in your local environment.

Final Notes

Agent Failure Classifier turns trace debugging from a manual read-and-guess process into a systematic one. Eight named failure modes, a deterministic rule-based layer that runs offline, an optional LLM judge for ambiguous cases, and support for traces from native formats, LangSmith, and LangGraph, all producing a structured report with the first failure turn and actionable fixes.

The code is at https://github.com/dakshjain-1616/agent-failure-classifier
You can also build with NEO in your IDE using the VS Code extension or Cursor.