You ran an eval. The dashboard says 80% accuracy. Now what?
For most teams, the answer is surprisingly manual. Someone exports failures, copies a few examples into a document, writes some notes, maybe creates a ticket or two, and then moves on. By the next eval run, those notes are already stale. The failures have changed, new ones have appeared, and nobody remembers whether a particular issue is actually new or something that has been showing up for weeks.
The bottleneck is not running the eval. It is closing the feedback loop.
Without a structured path from failure → diagnosis → prompt improvement, evals become scoreboards rather than engineering tools.
Recently I ran into exactly this problem while working on a local MLX-based classifier that maps developer work sessions to Jira tickets.
The classifier is evaluated against a golden dataset of 40 hand-authored developer sessions. Each session targets a specific failure mode: hard decoys, overhead work, untracked activity, ambiguous evidence, and other edge cases that show up in real engineering environments.
After a few iterations, I had run the eval three times. I also had 62 failures.
What I did not have was a reliable way to answer basic questions:
- Which failures keep showing up?
- Which prompt changes helped?
- Are the failures random, or manifestations of the same underlying issue?
- What is the highest-leverage thing to fix next?
The traditional approach — maintaining notes manually — breaks down almost immediately. Notes become outdated after the second run. There is no consistent structure. Nobody tracks recurrence. And reviewing dozens of failures turns into a forensic exercise every time.
The key insight
The eval was already producing everything I needed. Every failure had structured evidence: a trace, a span, the model's prediction, the expected answer, the classifier's reasoning. The problem was not interpretation. The problem was extraction and organisation.
That is where I started using a Claude Code skill. A Claude Code skill is essentially a markdown file containing a repeatable workflow: some frontmatter, a procedure, and a set of allowed tools. In my case, the skill is invoked manually with:
/eval-feedback
It is not an autonomous agent and it does not run continuously. It is simply a repeatable post-eval analysis workflow — a perfect fit for evaluating classifier failures.
The data structure
The most important design decision was not the skill itself. It was the data structure it writes to: a machine-maintained file called FEEDBACK.json.
Machine-maintained is the important part. Humans are terrible at keeping failure logs up to date. Structured JSON does not have that problem. It can be queried, diffed, aggregated, and analysed across runs without anyone manually curating it.
The file contains three top-level arrays:
{
"runs": [],
"observations": [],
"failure_classes": []
}
runs stores evaluation-level metadata and metrics. observations stores individual failure evidence. failure_classes stores named patterns that persist across multiple runs.
Three design decisions
First: a lean-read pattern. Instead of loading the entire file into context every time, the skill pulls only targeted slices using jq — recent runs, open failure classes, matching observations, summary statistics. This lets the file grow indefinitely without consuming large amounts of context. Based on current usage, roughly 20 KB of additional data per run.
Second: updates through a small Python append workflow — load → mutate → write. The skill never edits JSON directly.
Third, and most valuable: every observation contains a failure_class_id. That single field links individual failures to a persistent failure pattern. When the same pattern appears again, its occurrence count increases automatically. Recurring problems rise to the top without any manual prioritisation.
What the taxonomy revealed
After three runs, the system identified 10 named failure classes across 62 observations. One class dominated everything else. I called it optimism-bias. It appeared 27 times across all three runs.
The pattern was consistent: whenever the classifier encountered any adjacent signal — a mention in a document, a related article, a matching keyword, or a topically similar file path — it tended to classify the session as belonging to the target task with high confidence.
Even more interesting was what happened during a context experiment. I removed a 2,500-character OCR truncation limit, expecting additional context to improve accuracy. The opposite happened. Performance got worse. The classifier became more confident in incorrect predictions because the additional context provided more opportunities to find loosely related evidence.
Without the structured cross-run view, I probably would have concluded the model needed more data. Instead, the evidence pointed somewhere else entirely: the issue was not data volume. It was prompt design.
The broader lesson
AI systems are often very good at analysing the failures of other AI systems. The classifier's reasoning output turned out to be the richest signal in the entire pipeline. It exposed exactly which evidence the model was over-weighting and why a prediction seemed reasonable from the model's perspective.
Reading the reasoning traces and clustering them into recurring failure modes is precisely the kind of task an LLM excels at. Once those patterns are captured in a structured format, the same system can generate prompt changes targeted at specific failure classes.
How to replicate this
You do not need much:
- An eval that emits structured traces (OpenTelemetry or similar)
- A golden dataset with expected outputs
- A place to persist observations across runs
The skill itself is only a couple hundred lines of markdown describing the workflow, schema, and guardrails. The underlying idea generalises well beyond classifiers. Any system where you are running repeated experiments can benefit from a persistent failure taxonomy that accumulates evidence over time.
Because the loop does not close when you hit 95% accuracy.
The loop closes when the failure taxonomy starts driving the next prompt revision.
Key findings
- 10 named failure classes across 62 observations
-
optimism-biasaccounted for 27 of them - More context made accuracy worse, not better
- The fix was prompt design, not more data
Top comments (0)