DEV Community

Tuomo Nikulainen
Tuomo Nikulainen

Posted on

Why Heuristic Detectors Beat LLMs at Finding Agent Failures

TL;DR: We built 20 core rule-based detectors that find failures in AI agent traces. On the TRAIL benchmark (Patronus AI), they achieve 60.1% accuracy vs. 11.9% for the best LLM. Zero false positives. Zero LLM cost. On Who&When (ICML 2025), combined with a single Sonnet call for attribution, they beat GPT-5.4 Mini on both agent identification (60.3% vs. 60.3%) and step localization (24.1% vs. 22.4%).

pip install pisama
Enter fullscreen mode Exit fullscreen mode

The assumption everyone makes

When an AI agent fails in production (it hallucinates, gets stuck in a loop, ignores instructions, drops context), the standard approach is to throw another LLM at the problem. LLM-as-judge. Agent-as-judge. Feed the trace to GPT-4 and ask "what went wrong?"

We tested this assumption. The answer is surprising: for most agent failures, simple heuristics work better.

The benchmarks

TRAIL: Trace-level failure detection

Patronus AI's TRAIL benchmark contains 148 real agent execution traces with 841 human-labeled errors across 21 failure categories. It's the hardest agent failure detection benchmark available. The best frontier model (GPT-5.4) finds only 11.9% of failures. Claude Sonnet 4.6 finds 6.9%.

We ran Pisama's 20 core heuristic detectors on TRAIL:

Method Joint Accuracy Precision Cost Latency
GPT-5.4 11.9% -- $$$ ~seconds
Gemini 3.1 Pro 6.8% -- $$$ ~seconds
Claude Sonnet 4.6 6.9% -- $$$ ~seconds
Pisama (heuristic) 60.1% 100% $0 21s total

60.1% joint accuracy, with 100% precision across 481 detections on TRAIL. Zero false positives, but roughly 40% of failures missed by heuristics alone (the tiered pipeline escalates to LLM judges for better coverage). 5x better than SOTA at the joint-accuracy level. On our internal calibration across 8,051 entries from external datasets, mean precision across 57 calibrated detectors is 0.81. Not every detector hits 100% precision outside the TRAIL dataset.

The per-category breakdown shows where heuristics dominate:

Category Pisama F1 TRAIL SOTA
Context Handling 0.978 0.00
Specification 1.000 N/A
Loop / Resource Abuse 1.000 ~0.30
Tool Selection 1.000 ~0.57
Hallucination (language) 0.884 0.59
Goal Deviation 0.829 0.70

Context handling and task orchestration (categories where LLMs score literally 0.00) are where heuristic detectors excel.

Who&When: Multi-agent failure attribution

Who&When (ICML 2025 Spotlight) tests a harder question: in a multi-agent conversation that failed, which agent caused the failure and at which step?

Heuristic detectors alone can find when the failure happened (step accuracy: 16.8%, competitive with GPT-5.4 Mini's 22.4%) but struggle with who's to blame (agent accuracy: 31.0% vs. GPT-5.4 Mini's 60.3%). Blame attribution requires reading comprehension. Understanding that "WebSurfer clicked the wrong link" is different from "Orchestrator planned poorly."

But here's the key: you don't need to choose between heuristics and LLMs. You can tier them. Run heuristics first (free, fast), then use a single LLM call only for attribution:

Method Agent Accuracy Step Accuracy
Pisama heuristic-only 31.0% 16.8%
Pisama + Haiku 4.5 39.7% 15.5%
Pisama + Sonnet 4 60.3% 24.1%
GPT-5.4 Mini 60.3% 22.4%
Gemini 3.1 Flash-Lite 50.0% 19.0%

Sonnet 4 at the attribution tier beats every baseline in the paper.

Why heuristics win at detection

Agent failures have structural signatures that don't require semantic understanding:

Loops are repeated state. A hash comparison catches them instantly. No need to "understand" that the agent is stuck. Pisama's loop detector counts consecutive tool repetitions and cyclic patterns. F1: 1.000 on TRAIL.

Context neglect is measurable overlap. If the input mentions specific dates, numbers, and names, and the output references none of them, the context was ignored. Pisama's context detector extracts weighted elements (numbers, dates, proper nouns, URLs) and measures utilization. F1: 0.978 on TRAIL.

Hallucination correlates with tool failure. When an agent claims it searched the web but the search tool returned an error, that's a fabricated result. Pisama's hallucination detector checks tool call success rates and source-output overlap. F1: 0.884 on TRAIL.

Specification mismatch is requirement coverage. If the user asked for "a REST API with JWT authentication and PostgreSQL" and the output describes an HTML contact form, keyword coverage is low. Pisama's specification detector extracts requirements and measures coverage with synonym and stem matching. F1: 1.000 on TRAIL.

The pattern: agent failures leave measurable traces. LLMs try to reason about whether something went wrong. Heuristics directly measure the signatures of failure. When the signal is structural, a purpose-built pattern matcher extracts it more reliably than a general-purpose language model.

This echoes Gigerenzer's research on decision-making: in uncertain environments, simple rules that focus on the most diagnostic cue often outperform complex models that try to weight all available information. Agent failure detection is exactly this kind of problem. High-dimensional traces where a single diagnostic signal (state repetition, element coverage, tool success rate) carries most of the information.

Where LLMs are still needed

Heuristics can't do everything. Two things require semantic reasoning:

  1. Blame attribution in multi-agent systems. "WebSurfer clicked an irrelevant link" vs. "Orchestrator gave unclear instructions". Determining which agent caused a cascade requires understanding the causal chain. This is where Pisama's LLM judge tier ($0.02/case with Sonnet 4) adds value.

  2. Novel failure modes. Heuristic detectors match known patterns. A completely new type of failure that doesn't match any of the 20 core detectors will be missed. The LLM judge serves as a catch-all for out-of-distribution failures.

The right architecture isn't heuristics or LLMs. It's heuristics then LLMs. Cheap, fast pattern matching for 90%+ of detections, with LLM escalation for the cases that need semantic reasoning.

Try it

pip install pisama
Enter fullscreen mode Exit fullscreen mode
from pisama import analyze

result = analyze("trace.json")

for issue in result.issues:
    print(f"[{issue.type}] {issue.summary}")
    print(f"  Severity: {issue.severity}/100")
    print(f"  Fix: {issue.recommendation}")
Enter fullscreen mode Exit fullscreen mode

CLI:

pisama analyze trace.json
pisama watch python my_agent.py
pisama detectors
Enter fullscreen mode Exit fullscreen mode

MCP server (Cursor / Claude Desktop):

{
  "mcpServers": {
    "pisama": { "command": "pisama", "args": ["mcp-server"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

Source: github.com/tn-pisama/pisama

PyPI: pypi.org/project/pisama


What failure modes are you seeing in your agent systems? We'd love to hear what detectors we should add. Open an issue or reach out at team@pisama.ai.

Top comments (0)