More Context Made My Classifier Worse: Building a Machine-Maintained Failure Taxonomy

Akarsh hegde — Thu, 02 Jul 2026 13:16:40 +0000

You ran an eval. The dashboard says 80% accuracy. Now what?

For most teams, the answer is surprisingly manual. Someone exports failures, copies a few examples into a document, writes some notes, maybe creates a ticket or two, and then moves on. By the next eval run, those notes are already stale. The failures have changed, new ones have appeared, and nobody remembers whether a particular issue is actually new or something that has been showing up for weeks.

The bottleneck is not running the eval. It is closing the feedback loop.

Without a structured path from failure → diagnosis → prompt improvement, evals become scoreboards rather than engineering tools.

Recently I ran into exactly this problem while working on a local MLX-based classifier that maps developer work sessions to Jira tickets.

The classifier is evaluated against a golden dataset of 40 hand-authored developer sessions. Each session targets a specific failure mode: hard decoys, overhead work, untracked activity, ambiguous evidence, and other edge cases that show up in real engineering environments.

After a few iterations, I had run the eval three times. I also had 62 failures.

What I did not have was a reliable way to answer basic questions:

Which failures keep showing up?
Which prompt changes helped?
Are the failures random, or manifestations of the same underlying issue?
What is the highest-leverage thing to fix next?

The traditional approach — maintaining notes manually — breaks down almost immediately. Notes become outdated after the second run. There is no consistent structure. Nobody tracks recurrence. And reviewing dozens of failures turns into a forensic exercise every time.

The key insight

The eval was already producing everything I needed. Every failure had structured evidence: a trace, a span, the model's prediction, the expected answer, the classifier's reasoning. The problem was not interpretation. The problem was extraction and organisation.

That is where I started using a Claude Code skill. A Claude Code skill is essentially a markdown file containing a repeatable workflow: some frontmatter, a procedure, and a set of allowed tools. In my case, the skill is invoked manually with:

/eval-feedback

It is not an autonomous agent and it does not run continuously. It is simply a repeatable post-eval analysis workflow — a perfect fit for evaluating classifier failures.

The data structure

The most important design decision was not the skill itself. It was the data structure it writes to: a machine-maintained file called FEEDBACK.json.

Machine-maintained is the important part. Humans are terrible at keeping failure logs up to date. Structured JSON does not have that problem. It can be queried, diffed, aggregated, and analysed across runs without anyone manually curating it.

The file contains three top-level arrays:

{
  "runs": [],
  "observations": [],
  "failure_classes": []
}

runs stores evaluation-level metadata and metrics. observations stores individual failure evidence. failure_classes stores named patterns that persist across multiple runs.

Three design decisions

First: a lean-read pattern. Instead of loading the entire file into context every time, the skill pulls only targeted slices using jq — recent runs, open failure classes, matching observations, summary statistics. This lets the file grow indefinitely without consuming large amounts of context. Based on current usage, roughly 20 KB of additional data per run.

Second: updates through a small Python append workflow — load → mutate → write. The skill never edits JSON directly.

Third, and most valuable: every observation contains a failure_class_id. That single field links individual failures to a persistent failure pattern. When the same pattern appears again, its occurrence count increases automatically. Recurring problems rise to the top without any manual prioritisation.

What the taxonomy revealed

After three runs, the system identified 10 named failure classes across 62 observations. One class dominated everything else. I called it optimism-bias. It appeared 27 times across all three runs.

The pattern was consistent: whenever the classifier encountered any adjacent signal — a mention in a document, a related article, a matching keyword, or a topically similar file path — it tended to classify the session as belonging to the target task with high confidence.

Even more interesting was what happened during a context experiment. I removed a 2,500-character OCR truncation limit, expecting additional context to improve accuracy. The opposite happened. Performance got worse. The classifier became more confident in incorrect predictions because the additional context provided more opportunities to find loosely related evidence.

Without the structured cross-run view, I probably would have concluded the model needed more data. Instead, the evidence pointed somewhere else entirely: the issue was not data volume. It was prompt design.

The broader lesson

AI systems are often very good at analysing the failures of other AI systems. The classifier's reasoning output turned out to be the richest signal in the entire pipeline. It exposed exactly which evidence the model was over-weighting and why a prediction seemed reasonable from the model's perspective.

Reading the reasoning traces and clustering them into recurring failure modes is precisely the kind of task an LLM excels at. Once those patterns are captured in a structured format, the same system can generate prompt changes targeted at specific failure classes.

How to replicate this

You do not need much:

An eval that emits structured traces (OpenTelemetry or similar)
A golden dataset with expected outputs
A place to persist observations across runs

The skill itself is only a couple hundred lines of markdown describing the workflow, schema, and guardrails. The underlying idea generalises well beyond classifiers. Any system where you are running repeated experiments can benefit from a persistent failure taxonomy that accumulates evidence over time.

Because the loop does not close when you hit 95% accuracy.

The loop closes when the failure taxonomy starts driving the next prompt revision.

Key findings

10 named failure classes across 62 observations
optimism-bias accounted for 27 of them
More context made accuracy worse, not better
The fix was prompt design, not more data

Your velocity went up. Your visibility went down.

Akarsh hegde — Thu, 02 Jul 2026 11:55:23 +0000

~5 min read

AI tools made everyone faster. Coding agents, writing assistants, research tools, design copilots — across every domain, the same thing happened: people do more in less time. More tasks touched, more changes made, more ground covered per day.
That's the win everyone talks about. Here's the part nobody does: the faster you go, the less you can see about how you got there.
Velocity is throughput, not memory
When you did three things in a day, you could hold them in your head. What you tried, what worked, what the dead end was, why you went with this approach — it fit in working memory and spilled naturally into a standup or a commit message.

Now you do fifteen things in a day, and a large share of the actual work was done by something else. The agent made a dozen micro-decisions. The AI tool chose an approach, you nodded, it moved on. By the evening you genuinely cannot reconstruct half of it — not because you weren't paying attention, but because there was too much, moving too fast, and a lot of it wasn't even your keystrokes.
The throughput went up. The trail did not. Those two things used to rise together; AI split them apart.
"More done" hides a question: done how?
Higher velocity blurs a distinction that used to be obvious — the difference between activity and progress. When work was slow and manual, you could feel which was which. At high speed, fifteen completed tasks all look the same on the board, whether they were solid or whether the agent took a shortcut you'd have rejected if you'd been watching closely.

The only way to tell them apart is to be able to look back at how each task was actually done — including what the AI did on your behalf. Not a vague memory. The actual sequence: what was attempted, what the tool decided, what you approved, what shipped. Without that, "I got a lot done today" is a feeling, not a fact you can check.
And it compounds. A month of high-velocity work with no trail is a month of outcomes you have to take on faith. The code runs, the tickets are closed — but ask how any specific thing came to be and the honest answer is a shrug.
The record has to keep pace with the work
The reflex answer is "write it down as you go." That never worked when work was slow, and it's hopeless now. You can't manually narrate fifteen fast, AI-assisted tasks in parallel — the documenting would cost more time than the AI saved. The whole velocity gain would go to bookkeeping.
So the capture has to be automatic, and it has to run at the same speed as the work:

Observe the sessions as they happen — the agent runs, the tool calls, the approvals — without the developer stopping to record anything.
Attribute each stretch of work to the task it belongs to, so the day's activity maps back onto real tickets instead of a blur.
Reconstruct what actually happened per task: what the AI did, what the human decided, what the outcome was.
Write it back where the record is supposed to live — the ticket, the worklog, the doc — kept current automatically instead of by hand.

The point isn't surveillance and it isn't process for its own sake. It's that you should be able to see your own work — how a task got done, and what the AI did inside it — with the same clarity you had back when you were slow enough to remember.
The faster the tools get, the more this matters
Every gain in AI capability widens the gap between how much you produce and how much you can account for. That gap is fine right up until someone asks what happened — a review, an incident, a handoff, or just you trying to understand your own last month.
Speed without a trail isn't really progress you can stand behind. It's just motion you can no longer inspect.

DEV Community: Akarsh hegde