Austin Vance for Focused

Posted on May 21 • Originally published at focused.io

Agent Failures Should Open Tickets | Focused Labs

#ai #programming

Agent traces should create work.

An AI agent workflow can fail twice. So, if it does fail twice, it should create a ticket with an owner and linked evidence and have that be something that the team can check for regression down the line. Instead, tracing an AI’s processes and reviewing its outputs can be pretty and searchable but still be essentially worthless as something that can be used for anything other than replaying the series of mistakes over and over. It’s what I’ve begun to call “replayable regret,” expensive, and painful to behold.

LangChain this week identified a critical gap in tooling: traces of AI agent work can be traced and reviewed, but the error identification and corresponding merged fix are still manual and slow. Harrison Chase called this out this week too and noted that LangChain is building out an “issue bench”, already using it internally, but still early for this class of tooling.

That phrase matters because the unit of work changes. The trace stops being the artifact everyone stares at after the failure. The recurring failure becomes the artifact the team improves against.

Traces should not die in Slack

The common failure loop is quite dumb.

Step 1 for handling the common failure loop: an agent fails in a workflow, someone opens up the trace from that work item, and the team can see the failure in the trace steps. The tool call timed out. The planner picked a weird branch. Retrieval pulled stale context. The evaluator fired. A user left negative feedback. All of that fails when a trace link is added to Slack with the word “interesting” on top of it. In a word, that’s vibes, not work.

Traces are good! I recently wrote about why traces from agent workflows should cross the MCP boundary. And making traces visible is the first half of work here. Traces are good because their visibility says this happened. But that trace, or set of traces, must also say this failure family is owned, fixed, covered, and blocked from coming back.

That second half is the AI agent workflow people keep skipping.

In the recent Engine thread, the team at LangChain identified the right inputs into the AI agent workflow: tool call failures and timeouts, online eval failures, trace anomalies, negative feedback, and unusual behavior. Today those inputs are treated as interesting patterns to watch as dashboard widgets. Instead, they should be treated as signals for a queue of work, where each is eligible to become a named issue with severity, linked traces, suspected boundary, and release condition.

The trace is evidence. The issue loop turns it into engineering work.

The good version is pretty straightforward and mechanical: a trace anomaly turns into an issue, new traces get clustered with it, and it gets an owner. Said owner then makes a change which in turn adds a new evaluator or updates an existing one. The change is then released through a particular release gate, which in turn runs that new evaluator. And if it introduces any regressions, said issue reopens.

No ceremony. Just a loop.

The ticket needs a shape

An agent issue is not a Jira card with “LLM flaky” in the title. That card should be illegal (morally, at least). The issue needs the same hard edges as a production defect.

An agent issue has to have the same characteristics of a production defect issue:

Failure name: “refund flow calls payment API before policy check”
Workflow: refund, plan upgrade, incident triage, research synthesis
Severity: customer impact, data risk, financial risk, operational drag
Evidence: linked traces, failed eval runs, user feedback, tool responses
Boundary: prompt, tool contract, context source, model route, permission, downstream API
Owner: team or service owner, not “AI”
Fix status: proposed, merged, reverted, blocked
Regression coverage: benchmark eval, coverage eval, release gate
Reopen rule: the exact signal that opens it up again and puts it back in the queue

Agent failures are hard to test because they manifest differently based on the input, the branch under test, and the tools the agent touched before it failed. Without an issue name, every failure trace becomes a new issue to debug rather than another data point in the failure family that production tests are supposed to remove.

If LangSmith Engine emits issue.created and issue.trace.added events, then stable event IDs can handle dedupe, severity can travel with the event, and the shared request ID can group deliveries from the same upstream action. That’s all that’s required for this. No need for a religion. Use the existing webhook shape to get failures into queues, boards, and CI jobs.

The boring webhook handler should do four things:

Dedupe on event ID.
Group related deliveries by request ID.
Attach trace evidence to the existing issue when the cluster already exists.
Trigger the right owner workflow for the right reasons, meaning severity and recurrence justify the cost of that workflow.

This is a small piece of work. It is also how agent quality work avoids getting lost on Tuesday.

Benchmarks are pointing at the same problem

Long-horizon agent work fails in similar ways to engineering work. Rather than one incorrect result, the failure is a series of small errors that creep up over time, leaving a final result that is less than useful.

RoadmapBench exists to evaluate long-horizon software development tasks: 115 tasks spread across 17 repositories and 5 languages. The median task modified 3,700 lines of code in 51 files. For tasks at that size, the best model resolved 39.1% of them. The useful analysis is where the generated plan went wrong, which files inside the task became riskier, and which requirements got orphaned.

The CLI project pipeline for LongCLI-Bench uses the same kind of scoring to compare tool performance on long-horizon programming tasks: fail-to-pass, pass-to-pass, and step-level progress. It reports pass rates below 20% for state-of-the-art agents. In terms of stalls, there is a big difference between a late red X for failing to hit the ultimate task goal and an early red X that points to a tool loop, the wrong files, or a pass-to-fail regression.

Phoenix-bench: Locating the oracle for file-level actions on hardware tasks added only 1.4% to resolution. A single round of feedback from testbench logs increased the resolved rate from 42% to 45%. It turns out that pointing to the right general area for a human to improve long-horizon programming tasks is of limited value. Providing actionable feedback that improves the task under consideration is valuable.

This is the issue-loop argument dressed up in benchmark clothing. Better testing of AI agents requires more than a simple test suite. It requires a workflow that can expose issues, allow them to be fixed, and verify the fix inside the same workflow.

The eval suite should grow from resolved issues

Closed issues should feed the test suite.

LangSmith describes evaluators as workspace-level resources that can be attached to tracing projects and data sets in the same workspace. They can be suggested by Engine for detected issues where custom evaluators could be developed and then added as trace evidence for the closed clusters that caused the issue in the first place.

Brace Sproul’s distinction between benchmark evals and coverage evals maps onto this. One set of evaluators for fast benchmarks on known workflows. A second, more exhaustive set of evaluators for longer paths, product commitments, and stranger trajectories. Trying to use one suite for both ends turns evals into the tax nobody wants to pay.

Resolved issues should feed the right suite, not one giant eval blob.

Severity-0 resolved issues, like refund errors in critical workflows, should be evaluated with the fast benchmarks. A rare edge case in a long multi-hop research workflow is probably better served by the broader coverage suite, high cost and long run time included. Severity-0 policy violations may belong in both suites.

However it gets cut, this is work. Every workflow change can introduce failure modes the system has not seen before. The test that proves a fix worked is different from the test that guards the same problem against a later regression. And then there is the matter of the gate.

The queue is where agency gets real

The harder discipline is developing agents so they improve. That is much harder to demonstrate than the capabilities an agent can apply to tasks and workflows.

A harder thing to demo is building AI agency into an agent. Developing AI agency was always about that discipline. It shows up in a particular way: when something fails, the team can explain what happened next.

A good issue queue for a development team debugging a failure answers these questions. The team cannot get all this information from a single trace.

Is this failure new or recurring?
Which workflow owns it?
Which traces point at the same root cause?
Did the fix land?
Which evaluator covers it now?
Which release gate blocks a regression?
Who gets paged if the issue comes back?

Again, this is normal software development, complicated by workflows that fail through complex, probabilistic, variable paths. Same defect, different costumes.

The LangChain survey of production AI agents found that 57.3% of respondents already have agents in production. The number one production blocker cited by respondents was quality, at 32%. This sits next to 89% observability adoption for production agents, far ahead of offline evaluation at 52.4% and online evaluation at 37.3%. There is already a sea of visibility for production agents. The work to convert that visibility into closed quality issues is still barely underway.

Honeycomb’s new investigation features for agent observability start to address the same problem, with Agent Timeline built to reconstruct complex multi-agent, multi-trace workflows. But reconnecting that path to specific owned work, and making sure the work is covered, is still the large gap.

That is where the issue queue comes in.

Own the loop

The AI agent workflow I want is not fancy.

Signal failure -> create issue -> add evidence -> assign owner -> propose fix -> add evaluator -> run release gate -> reopen regression.

This workflow looks less interesting week to week than announcing a new AI model. But to the buyer with an agent touching refunds, support tickets, infrastructure changes, or account data, this is the kind of work that matters week to week. Last week’s failure needs to become this week’s guardrail.

Agent failures should open tickets.

The ticket is where the trace becomes work. The work is where the system gets better.