<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JEONSEWON</title>
    <description>The latest articles on DEV Community by JEONSEWON (@jeonsewon).</description>
    <link>https://dev.to/jeonsewon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3781888%2Fe7d6377e-9f5c-4964-8938-bf0a1317f26e.jpg</url>
      <title>DEV Community: JEONSEWON</title>
      <link>https://dev.to/jeonsewon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeonsewon"/>
    <language>en</language>
    <item>
      <title>I almost shipped a parser that couldn't read a single real file</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Sat, 20 Jun 2026 11:32:58 +0000</pubDate>
      <link>https://dev.to/jeonsewon/i-almost-shipped-a-parser-that-couldnt-read-a-single-real-file-2p3m</link>
      <guid>https://dev.to/jeonsewon/i-almost-shipped-a-parser-that-couldnt-read-a-single-real-file-2p3m</guid>
      <description>&lt;p&gt;I needed my tool to accept trace files in a standard format (OTLP / OpenInference) so any framework could feed it. I read the spec, pictured the JSON structure, and was about to write the parser against that mental model. Then I made myself do one annoying thing first: dump the actual output and compare.&lt;br&gt;
Good thing I did. Almost every assumption was wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fojp68sqdzab48ociy88x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fojp68sqdzab48ociy88x.png" alt=" " width="800" height="569"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd assumed uppercase-hex trace IDs, UnixNano string timestamps, and a nested resourceSpans structure — straight from the spec. The real SDK output had 0x-prefixed IDs, ISO datetimes, and a flat array. The "standard OTLP-JSON" I'd designed against turned out to be a spec example that no tool actually emits. If I'd written the parser against my mental model, it would have passed my own hand-made fixtures and then failed on every real file. The generalization would have been fake.&lt;/p&gt;

&lt;p&gt;The second dump taught me something bigger. I'd assumed users have trace files sitting on disk. They mostly don't — OpenInference's own examples ship spans to a dashboard over HTTP. So the premise of "send me your trace file" was itself off. The most valuable output of this whole task wasn't the parser — it was a 3-line snippet that lets a user produce a file in the first place. A parser with no "here's how to make the file" guide is a tool nobody can use.&lt;/p&gt;

&lt;p&gt;So the scope quietly shifted: support the one format real SDK output actually uses, reject the others with a clear "convert it like this" error instead of failing silently, and ship the copy-paste snippet — then verify that snippet actually runs end to end, not just the parser's unit tests.&lt;br&gt;
The discipline here is the same one I keep relearning: a result that's correct against data you imagined tells you nothing about data the world produces. Verify the real shape before you build for it.&lt;/p&gt;

&lt;p&gt;(Caveat kept explicit: one export format supported today; dashboard/proto exports get a conversion message, not silence. The detection engine and its known limits are unchanged — this was about the door, not the detector.)&lt;br&gt;
Code: &lt;a href="https://dev.tourl"&gt;github.com/JEONSEWON/Clew-by-Custos&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>"Why don't you just use a better embedder?" — the most reasonable suggestion I keep refusing</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Fri, 19 Jun 2026 08:57:05 +0000</pubDate>
      <link>https://dev.to/jeonsewon/why-dont-you-just-use-a-better-embedder-the-most-reasonable-suggestion-i-keep-refusing-33c4</link>
      <guid>https://dev.to/jeonsewon/why-dont-you-just-use-a-better-embedder-the-most-reasonable-suggestion-i-keep-refusing-33c4</guid>
      <description>&lt;p&gt;Every time I show the E3 finding — that my semantic layer barely separates genuinely-distinct outputs on real, same-topic data — someone gives the same sensible advice: just swap in a stronger embedding model. It's a good instinct. I keep saying no, and the reason is more interesting than the suggestion.&lt;/p&gt;

&lt;p&gt;The semantic layer's job is to confirm whether two outputs are really redundant. On synthetic data it worked. On real same-topic text, outputs share so much vocabulary that almost everything scores "similar," and my threshold (φ=0.514) can't tell waste from normal progress. So yes — a different embedder might separate them better.&lt;/p&gt;

&lt;p&gt;Here's the trap. I have exactly three real pairs that expose this. If I start swapping embedders and tuning until those three separate cleanly, I haven't fixed the detector — I've fitted it to three examples. The number would look great and mean nothing, because I'd have chosen the instrument after seeing what makes the result pretty. That is the precise mistake that killed my first project: a signal that looked strong because it was quietly shaped to the data in front of me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5tg9gn5a2m4gcvaxyguf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5tg9gn5a2m4gcvaxyguf.png" alt=" " width="800" height="722"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Freezing the embedding model isn't stubbornness about this particular model. It's refusing to change the ruler after seeing the measurement. The honest fix for E3 isn't a better embedder chosen against three examples — it's real traces across many topics and domains, enough to see the actual distribution, and then a pre-registered recalibration I commit to before looking.&lt;/p&gt;

&lt;p&gt;Which is why the bottleneck isn't a model on Hugging Face. It's traces from people running real multi-agent systems. The better embedder might genuinely be the answer — but I only get to find that out honestly once, so I'm not spending it on n=3.&lt;br&gt;
Code, the E3 log, and the frozen params: &lt;br&gt;
&lt;a href="https://dev.tourl"&gt;github.com/JEONSEWON/Clew-by-Custos&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>machinelearning</category>
      <category>llmops</category>
    </item>
    <item>
      <title>The 4-step ritual I use so an AI coding agent can't hand me a green checkmark that lies</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:51:38 +0000</pubDate>
      <link>https://dev.to/jeonsewon/the-4-step-ritual-i-use-so-an-ai-coding-agent-cant-hand-me-a-green-checkmark-that-lies-9pf</link>
      <guid>https://dev.to/jeonsewon/the-4-step-ritual-i-use-so-an-ai-coding-agent-cant-hand-me-a-green-checkmark-that-lies-9pf</guid>
      <description>&lt;p&gt;I built my whole product with an AI writing the code, and the scariest failure mode isn't bugs — it's a test suite that passes for the wrong reason. Here are all four steps, and the real moment each one saved me.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Pre-register the pass/fail criteria, frozen in git, before I see any result. If I define "success" after seeing the output, I'll unconsciously pick the definition the output already satisfies. My first project died exactly here: a signal that "passed" was secretly measuring trace length, not failure. Writing the bar down first is the cheapest insurance there is.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Commit the criteria and run the baseline tests. A known-good starting line the agent can't silently move. If a test goes green, I want to know it went green today, not that it was already green.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ask the agent for a PLAN, not code. Most "the AI wrecked my codebase" stories begin with approving a 400-line diff nobody read. A plan is reviewable in two minutes. This is also where I catch the dangerous "fixes" — once, my agent's fix made the numbers go green by quietly deleting the hard case that exposed a real weakness. I caught it in the plan, not in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Review the plan, push back, then approve — manually. The agent never writes code I haven't already read a plan for. Auto-approve is how you wake up to a system that's confidently wrong.&lt;br&gt;
It feels slow. It's the opposite: an hour here saves a week of debugging something that looked correct the whole time. Code's public: github.com/JEONSEWON/Clew-by-Custos&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  BuildInPublic #AIAgents #ClaudeCode #DevTools
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F108ju9kzxjtl6xs2b2p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F108ju9kzxjtl6xs2b2p7.png" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>devtools</category>
      <category>llmops</category>
    </item>
    <item>
      <title>I spent a day teaching my tool to make the problem look smaller</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Wed, 17 Jun 2026 10:38:02 +0000</pubDate>
      <link>https://dev.to/jeonsewon/i-spent-a-day-teaching-my-tool-to-make-the-problem-look-smaller-1dg5</link>
      <guid>https://dev.to/jeonsewon/i-spent-a-day-teaching-my-tool-to-make-the-problem-look-smaller-1dg5</guid>
      <description>&lt;p&gt;Every "we found $X of waste!" tool has a quiet incentive: the bigger the number, the more impressive the demo. So when I built the report that turns a trace into a waste summary, I deliberately wired in four rules whose only job is to keep that number honest:&lt;/p&gt;

&lt;p&gt;1.Count only the re-run, never the original. If an agent does real work once and then redundantly repeats it, only the repeat is waste. Charging the legitimate first run to the "waste" column is the easiest way to double your headline number — and a lie.&lt;/p&gt;

&lt;p&gt;2.One row per wasted span (dedupe). When one repeat pairs with several earlier runs, the naive report lists it multiple times and the waste visually balloons. I assert in a test that rows = actual wasted spans.&lt;/p&gt;

&lt;p&gt;3."unknown" instead of a guess. No token count captured? The report says unknown. It does not invent a plausible number to fill the cell.&lt;/p&gt;

&lt;p&gt;4.The report prints its own frozen parameters (φ, N, embedding model) in the header — so anyone reading it knows exactly which settings produced it.&lt;/p&gt;

&lt;p&gt;None of this makes the demo flashier. That's the point. In a category where every competitor is incentivized to inflate, a report that refuses to exaggerate is the differentiator — and it's enforced in tests, not in good intentions.&lt;/p&gt;

&lt;p&gt;Code's public: github.com/JEONSEWON/Clew-by-Custos&lt;/p&gt;

&lt;h1&gt;
  
  
  BuildInPublic #AIAgents #LLMOps #DevTools
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ne4h152eaawf4w9l72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ne4h152eaawf4w9l72.png" alt=" " width="800" height="668"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>observability</category>
      <category>devtools</category>
    </item>
    <item>
      <title>I passed all 5 of my real-AI tests. The most useful thing I found is that half my detector barely works.</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Tue, 16 Jun 2026 14:50:16 +0000</pubDate>
      <link>https://dev.to/jeonsewon/i-passed-all-5-of-my-real-ai-tests-the-most-useful-thing-i-found-is-that-half-my-detector-barely-2hin</link>
      <guid>https://dev.to/jeonsewon/i-passed-all-5-of-my-real-ai-tests-the-most-useful-thing-i-found-is-that-half-my-detector-barely-2hin</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoilfar4mc3mbuc4xea1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoilfar4mc3mbuc4xea1.png" alt=" " width="800" height="799"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I finally pointed my multi-agent waste detector at real AI output for the first time — five scenarios, real Claude traces, no synthetic data. Four of the five came back exactly as expected on the first run. The fifth (a "looks like a repeat but isn't" case) threw a false positive.&lt;/p&gt;

&lt;p&gt;The tempting move there is to nudge my similarity threshold until the false positive goes away. I didn't touch it. I diagnosed first — and the culprit turned out to be my test design, not the detector. I rebuilt the test properly; the second run passed 5/5, with the threshold (φ=0.514) untouched from start to finish.&lt;/p&gt;

&lt;p&gt;But passing 5/5 isn't the story. This is:&lt;br&gt;
I measured the similarity of output pairs that are genuinely not redundant — different content that should look distinct. On real, same-topic output, 100% of them scored above my "this is redundant" threshold. Not a near-miss band like my synthetic work predicted — everything, comfortably above the line.&lt;/p&gt;

&lt;p&gt;Here's what that means, and it's uncomfortable: my zero false positives right now are entirely the work of the structural layer (which simply isn't raising candidates). The semantic layer — the part that's supposed to confirm "yes, this really is redundant" — has almost no separating power on real same-topic text, because outputs on one topic share so much vocabulary that they're all similar by default. The moment the structural layer surfaces one borderline candidate, the semantic layer rubber-stamps it.&lt;/p&gt;

&lt;p&gt;My synthetic mock-exam hid this completely. Clean separation between "planted near-identical" and "unrelated clean" is exactly the artificial sharpness real data doesn't have.&lt;/p&gt;

&lt;p&gt;The honest boundary, updated: four detection paths work on real traces with zero false positives from the structural layer. The semantic layer's separating power on real data is not demonstrated — the opposite is. This is one topic and five traces, so it's a strong signal to act on, not a verdict.&lt;/p&gt;

&lt;p&gt;What I'm explicitly NOT doing: redesigning the semantic layer to fix this now. Tuning it against one topic and five traces is just overfitting with extra steps — the same trap that killed my first project. The real fix needs real traces across several topics and domains, which only come from people running actual systems.&lt;/p&gt;

&lt;p&gt;The whole thing — code, the synthetic GO, the real-probe log, and this E3 limitation written out in full — is public: github.com/JEONSEWON/Clew-by-Custos. If you run a multi-agent system and can share a trace, that's the one input that turns this from a suspicion into an answer.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>observability</category>
      <category>llm</category>
    </item>
    <item>
      <title>"How do you build a test your detector can't game?"</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Sun, 14 Jun 2026 08:53:08 +0000</pubDate>
      <link>https://dev.to/jeonsewon/how-do-you-build-a-test-your-detector-cant-game-1d6</link>
      <guid>https://dev.to/jeonsewon/how-do-you-build-a-test-your-detector-cant-game-1d6</guid>
      <description>&lt;p&gt;If you build a waste detector and then build your own test set, there's a trap waiting: you can pass without your detector actually working.&lt;br&gt;
Here's how. My "waste" examples have repeated steps; my "clean" examples don't. Looks reasonable — until you realize a dumb detector that just flags any repeat now scores perfectly, and the semantic layer (the hard part, the part that decides if a repeat is genuinely redundant) never has to do anything. You'd ship a "0.85 F1" that's really measuring "did the trace repeat," which is the exact mistake that killed my first project.&lt;/p&gt;

&lt;p&gt;The fix is paired design: every clean example shares the identical structure as a waste example — same nodes, same repeats, same handoffs. The only difference is whether the repeated work is semantically redundant or real progress. That single constraint forces the semantic layer to be the actual judge, because structure alone can no longer tell the two apart.&lt;/p&gt;

&lt;p&gt;Building the test you can't cheat is harder than building the detector. It's also the only reason the detector's score means anything.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvu4iil4ad9ffzicx8qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvu4iil4ad9ffzicx8qr.png" alt=" " width="800" height="605"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>testing</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>My AI-agent waste detector scored zero false positives. Then I ran it on a real trace.</title>
      <dc:creator>JEONSEWON</dc:creator>
      <pubDate>Sat, 13 Jun 2026 06:27:34 +0000</pubDate>
      <link>https://dev.to/jeonsewon/my-ai-agent-waste-detector-scored-zero-false-positives-then-i-ran-it-on-a-real-trace-96o</link>
      <guid>https://dev.to/jeonsewon/my-ai-agent-waste-detector-scored-zero-false-positives-then-i-ran-it-on-a-real-trace-96o</guid>
      <description>&lt;p&gt;My detector passed every synthetic test with zero false positives. Then I pointed it at one real trace and found a crack.&lt;br&gt;
This is the honest version of where I am. I'm building Clew — a tool that finds the redundant loops, re-queries, and handoffs that silently burn tokens when multiple AI agents work together. No crash, no error, just two agents quietly re-doing each other's work while the token bill climbs.&lt;br&gt;
I build in public, and I publish the negatives. So here's the whole arc, including the part that isn't working yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, I killed my own hypothesis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original idea wasn't waste detection at all. It was failure prediction: watch the behavior between agents and forecast multi-agent failures before they happen. The differentiator was a single metric built on two signals — structural cycles in the inter-agent message graph, and the decay of novelty in embeddings.&lt;/p&gt;

&lt;p&gt;Before I ran anything, I pre-registered the success bar: AUC ≥ 0.80. I numbered every change and kept the signal code physically separated from the labels so I couldn't leak my way to a good number. Then I ran it on MAST-Data — UC Berkeley's dataset of 1,600+ real multi-agent traces across 7 frameworks[(&lt;a href="https://arxiv.org/abs/2503.13657" rel="noopener noreferrer"&gt;Cemri et al., arXiv:2503.13657&lt;/a&gt;)](url)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: AUC ≈ 0.455. A coin flip.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It got worse. The signal correlated with trace length at r ≈ 0.86 — it was mostly measuring how long a trace was, not whether it failed. Correcting for that dropped AUC to 0.42 and reversed the direction: successful traces actually showed more decay (p ≈ 0.013).&lt;/p&gt;

&lt;p&gt;The honest read: not disproven, but unvalidated. On this implementation, on this data — negative. So I shut it down. And I counted it as a win, because I got a fast, honest answer in weeks instead of building a dashboard on a metric that secretly measures string length. That experiment became the DNA of everything since: design the experiment that's allowed to kill the idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pivot: from predicting failure to cutting waste&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The intuition behind v1 — that you need structure and meaning — turned out to be right. The implementation was wrong. An external paper confirmed the shape of the fix: an unsupervised cycle-detection framework that runs structure first, then semantics [(&lt;a href="https://arxiv.org/abs/2511.10650" rel="noopener noreferrer"&gt;George et al., IBM Research, arXiv:2511.10650&lt;/a&gt;)](url). On their benchmark of 1,575 LangGraph trajectories, the cascade hit F1 0.72 — versus 0.08 for structure alone and 0.28 for semantics alone.&lt;/p&gt;

&lt;p&gt;To be clear: that 0.72 is IBM's result on IBM's data, not mine. I keep that line bright in everything I publish. But it told me what I'd gotten wrong (I'd summed the signals instead of cascading them, and looked at global trends instead of local repeats), and it pointed at a sharper wedge: stop predicting failure, start detecting the redundant loops and handoffs that burn tokens. That's measurable. It speaks in dollars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building it so I couldn't fool myself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before writing a line of detection logic, I built the thing that makes the validation trustworthy. Pre-registered GO/KILL criteria, frozen in git before I looked at any result. A leakage guard enforced not as a policy but as a failing test: the detection code physically cannot import or read the label files. Parameters chosen on a dev split only; the evaluation split touched exactly once, after freezing.&lt;/p&gt;

&lt;p&gt;Two moments from that build are worth telling, because they're the actual product.&lt;/p&gt;

&lt;p&gt;The fix that passed but was a lie. During calibration, a clean case kept tripping the detector: two lookups with the same schema but different values (think customer A vs customer B). My first "fix" diversified the data so the numbers turned green — except that didn't solve the limitation, it deleted the case that exposed it. Shipped as-is, the detector would have flagged legitimate work as waste. The real fix routed the decision from the semantic layer (which can't tell those apart) to the structural layer (which can): a re-query only counts if the input is identical. Then I put the hard case back in to prove the gate worked.&lt;/p&gt;

&lt;p&gt;The audit that caught a blind spot before launch. A recall check right before freezing found that one of four waste patterns — regenerative handoffs — scored 0/10. Diagnosis: it's structurally identical to a normal handoff, so there's no candidate path for it. Rather than quietly drop it to protect the score, I explicitly de-scoped it, left it in the dataset, and let those 10 misses count against my aggregate F1.&lt;/p&gt;

&lt;p&gt;Carrying that penalty, the single-shot evaluation came back GO: F1 0.857, zero false positives, 100% recall on the three in-scope patterns. The out-of-sample numbers reproduced the dev numbers exactly — so no overfitting to my own dev split.&lt;/p&gt;

&lt;p&gt;And every one of those numbers is synthetic, held-out, three patterns. That matters for what comes next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then reality showed up&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A GO on synthetic data isn't a product. So I pointed the detector at real LangGraph instrumentation for the first time. The structural machinery held up better than I expected — but it surfaced three things my synthetic tests had never touched:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Spans collapse weirdly. Real LLM calls instrument under a single model-class name, so the detector saw "the same node three times" and cried wolf. Fixable: fold the LLM sub-spans into their parent node while preserving tool spans.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Routers fake repeats. LangGraph's conditional-edge functions instrument as spans too, and a repeating router looks like a repeating node. Fixable, and principled: a span that burns zero tokens is, by definition, not token waste — so exclude it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those two are confirmed engineering fixes. The third one I can't fix from my desk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The threshold might not transfer. My similarity threshold was calibrated on synthetic data, where "redundant" and "distinct" separate cleanly. On real output, same-domain results cluster in a middle band that sits right on top of my threshold. Stripping the JSON scaffolding accounts for some of it (~0.2 of the similarity), but not all. With a sample size of three, this is a crack to take seriously — not a verdict. And there's exactly one way to find out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's true, and what isn't&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I think the honesty boundary is the whole point, so here it is plainly.&lt;/p&gt;

&lt;p&gt;True today: a structure-then-semantics cascade catches three planted waste patterns in held-out synthetic traces with zero false positives, passing a pre-registered bar. On real instrumentation, the structural mechanism is confirmed (spans collapse correctly, genuine repeats still fire, router false-positives are removed). And there's a working tool: feed it a trace, get back a waste report in minutes.&lt;/p&gt;

&lt;p&gt;Not true yet: that it works on real production traces. That it saves real tokens — I have zero measured savings. That the threshold generalizes — unverified, and currently my biggest open question. Zero users.&lt;/p&gt;

&lt;p&gt;The tempting move is to nudge the threshold up by a hair so the borderline cases fall the right way. That's exactly the post-hoc overfit that killed v1. The threshold has to be re-derived honestly, from real distributions — which I don't have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ask&lt;/strong&gt;&lt;br&gt;
That's where you come in, and it's a genuine ask, not a pitch.&lt;/p&gt;

&lt;p&gt;If you run a multi-agent system — LangGraph, CrewAI, AutoGen, something custom — and your token bill has been creeping up: send me a trace. I'll run it through Clew and send back a free report — where the redundant work is, and what it's costing you in tokens. If your data can't leave your environment, I'll send you the tool to run locally and you just share the numbers.&lt;/p&gt;

&lt;p&gt;I'm not selling anything. The one question my synthetic tests genuinely cannot answer is whether this holds on real output distributions — and I'd rather find that out honestly than pretend it's already settled.&lt;/p&gt;

&lt;p&gt;The most valuable thing I've built so far isn't a clever detector. It's a way of working that doesn't let me lie to myself about whether the detector is any good. If real traces break it, you'll read about that here too.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>buildinpublic</category>
      <category>observability</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
