DEV Community

JEONSEWON
JEONSEWON

Posted on

My AI-agent waste detector scored zero false positives. Then I ran it on a real trace.

My detector passed every synthetic test with zero false positives. Then I pointed it at one real trace and found a crack.
This is the honest version of where I am. I'm building Clew — a tool that finds the redundant loops, re-queries, and handoffs that silently burn tokens when multiple AI agents work together. No crash, no error, just two agents quietly re-doing each other's work while the token bill climbs.
I build in public, and I publish the negatives. So here's the whole arc, including the part that isn't working yet.

First, I killed my own hypothesis

The original idea wasn't waste detection at all. It was failure prediction: watch the behavior between agents and forecast multi-agent failures before they happen. The differentiator was a single metric built on two signals — structural cycles in the inter-agent message graph, and the decay of novelty in embeddings.

Before I ran anything, I pre-registered the success bar: AUC ≥ 0.80. I numbered every change and kept the signal code physically separated from the labels so I couldn't leak my way to a good number. Then I ran it on MAST-Data — UC Berkeley's dataset of 1,600+ real multi-agent traces across 7 frameworks[(Cemri et al., arXiv:2503.13657)](url)

Result: AUC ≈ 0.455. A coin flip.

It got worse. The signal correlated with trace length at r ≈ 0.86 — it was mostly measuring how long a trace was, not whether it failed. Correcting for that dropped AUC to 0.42 and reversed the direction: successful traces actually showed more decay (p ≈ 0.013).

The honest read: not disproven, but unvalidated. On this implementation, on this data — negative. So I shut it down. And I counted it as a win, because I got a fast, honest answer in weeks instead of building a dashboard on a metric that secretly measures string length. That experiment became the DNA of everything since: design the experiment that's allowed to kill the idea.

The pivot: from predicting failure to cutting waste

The intuition behind v1 — that you need structure and meaning — turned out to be right. The implementation was wrong. An external paper confirmed the shape of the fix: an unsupervised cycle-detection framework that runs structure first, then semantics [(George et al., IBM Research, arXiv:2511.10650)](url). On their benchmark of 1,575 LangGraph trajectories, the cascade hit F1 0.72 — versus 0.08 for structure alone and 0.28 for semantics alone.

To be clear: that 0.72 is IBM's result on IBM's data, not mine. I keep that line bright in everything I publish. But it told me what I'd gotten wrong (I'd summed the signals instead of cascading them, and looked at global trends instead of local repeats), and it pointed at a sharper wedge: stop predicting failure, start detecting the redundant loops and handoffs that burn tokens. That's measurable. It speaks in dollars.

Building it so I couldn't fool myself

Before writing a line of detection logic, I built the thing that makes the validation trustworthy. Pre-registered GO/KILL criteria, frozen in git before I looked at any result. A leakage guard enforced not as a policy but as a failing test: the detection code physically cannot import or read the label files. Parameters chosen on a dev split only; the evaluation split touched exactly once, after freezing.

Two moments from that build are worth telling, because they're the actual product.

The fix that passed but was a lie. During calibration, a clean case kept tripping the detector: two lookups with the same schema but different values (think customer A vs customer B). My first "fix" diversified the data so the numbers turned green — except that didn't solve the limitation, it deleted the case that exposed it. Shipped as-is, the detector would have flagged legitimate work as waste. The real fix routed the decision from the semantic layer (which can't tell those apart) to the structural layer (which can): a re-query only counts if the input is identical. Then I put the hard case back in to prove the gate worked.

The audit that caught a blind spot before launch. A recall check right before freezing found that one of four waste patterns — regenerative handoffs — scored 0/10. Diagnosis: it's structurally identical to a normal handoff, so there's no candidate path for it. Rather than quietly drop it to protect the score, I explicitly de-scoped it, left it in the dataset, and let those 10 misses count against my aggregate F1.

Carrying that penalty, the single-shot evaluation came back GO: F1 0.857, zero false positives, 100% recall on the three in-scope patterns. The out-of-sample numbers reproduced the dev numbers exactly — so no overfitting to my own dev split.

And every one of those numbers is synthetic, held-out, three patterns. That matters for what comes next.

Then reality showed up

A GO on synthetic data isn't a product. So I pointed the detector at real LangGraph instrumentation for the first time. The structural machinery held up better than I expected — but it surfaced three things my synthetic tests had never touched:

  • Spans collapse weirdly. Real LLM calls instrument under a single model-class name, so the detector saw "the same node three times" and cried wolf. Fixable: fold the LLM sub-spans into their parent node while preserving tool spans.

  • Routers fake repeats. LangGraph's conditional-edge functions instrument as spans too, and a repeating router looks like a repeating node. Fixable, and principled: a span that burns zero tokens is, by definition, not token waste — so exclude it.

Those two are confirmed engineering fixes. The third one I can't fix from my desk:

  • The threshold might not transfer. My similarity threshold was calibrated on synthetic data, where "redundant" and "distinct" separate cleanly. On real output, same-domain results cluster in a middle band that sits right on top of my threshold. Stripping the JSON scaffolding accounts for some of it (~0.2 of the similarity), but not all. With a sample size of three, this is a crack to take seriously — not a verdict. And there's exactly one way to find out.

What's true, and what isn't

I think the honesty boundary is the whole point, so here it is plainly.

True today: a structure-then-semantics cascade catches three planted waste patterns in held-out synthetic traces with zero false positives, passing a pre-registered bar. On real instrumentation, the structural mechanism is confirmed (spans collapse correctly, genuine repeats still fire, router false-positives are removed). And there's a working tool: feed it a trace, get back a waste report in minutes.

Not true yet: that it works on real production traces. That it saves real tokens — I have zero measured savings. That the threshold generalizes — unverified, and currently my biggest open question. Zero users.

The tempting move is to nudge the threshold up by a hair so the borderline cases fall the right way. That's exactly the post-hoc overfit that killed v1. The threshold has to be re-derived honestly, from real distributions — which I don't have.

The ask
That's where you come in, and it's a genuine ask, not a pitch.

If you run a multi-agent system — LangGraph, CrewAI, AutoGen, something custom — and your token bill has been creeping up: send me a trace. I'll run it through Clew and send back a free report — where the redundant work is, and what it's costing you in tokens. If your data can't leave your environment, I'll send you the tool to run locally and you just share the numbers.

I'm not selling anything. The one question my synthetic tests genuinely cannot answer is whether this holds on real output distributions — and I'd rather find that out honestly than pretend it's already settled.

The most valuable thing I've built so far isn't a clever detector. It's a way of working that doesn't let me lie to myself about whether the detector is any good. If real traces break it, you'll read about that here too.

Top comments (2)

Collapse
 
mehmetcanfarsak profile image
Mehmet Can Farsak

Really appreciate the rigorous methodology here — killing your own hypothesis before it becomes a sunk cost is refreshing. The waste detection angle is spot on.

I actually ran into a related problem with multi-agent setups: agents wasting tokens on execution when they should be in ideation mode. Put together a small hook plugin (Brainstorm-Mode by mehmetcanfarsak on GitHub) that prevents that execution drift via PreToolUse hooks — essentially adds a 'thinking mode' vs 'action mode' boundary. Three modes (divergent, actionable, academic) keep agents from burning tokens on premature tool calls during brainstorming phases.

Collapse
 
jeonsewon profile image
JEONSEWON

This is great and I love that you came at it from the opposite side. The way I see it: your hook is prevention (block the premature tool call at PreToolUse, deterministically, every time), and Clew is detection (find the redundant work after the fact in the trace). Same enemy, different layer.
What's neat is they kind of need each other — a prevention tool like Brainstorm-Mode is hard to prove without measuring waste before and after, which is exactly what Clew does. If you're ever up for it, I'd genuinely love to run Clew over a before/after trace pair from a Brainstorm-Mode session: it'd quantify how much the divergent/actionable/academic gating actually saves, and honestly it'd also be a real-world test of whether my detector holds up (which I don't fully know yet). Either way — starring the repo, the mode-boundary idea is sharp.