I passed all 5 of my real-AI tests. The most useful thing I found is that half my detector barely works.

#ai #buildinpublic #observability #llm

I finally pointed my multi-agent waste detector at real AI output for the first time — five scenarios, real Claude traces, no synthetic data. Four of the five came back exactly as expected on the first run. The fifth (a "looks like a repeat but isn't" case) threw a false positive.

The tempting move there is to nudge my similarity threshold until the false positive goes away. I didn't touch it. I diagnosed first — and the culprit turned out to be my test design, not the detector. I rebuilt the test properly; the second run passed 5/5, with the threshold (φ=0.514) untouched from start to finish.

But passing 5/5 isn't the story. This is:
I measured the similarity of output pairs that are genuinely not redundant — different content that should look distinct. On real, same-topic output, 100% of them scored above my "this is redundant" threshold. Not a near-miss band like my synthetic work predicted — everything, comfortably above the line.

Here's what that means, and it's uncomfortable: my zero false positives right now are entirely the work of the structural layer (which simply isn't raising candidates). The semantic layer — the part that's supposed to confirm "yes, this really is redundant" — has almost no separating power on real same-topic text, because outputs on one topic share so much vocabulary that they're all similar by default. The moment the structural layer surfaces one borderline candidate, the semantic layer rubber-stamps it.

My synthetic mock-exam hid this completely. Clean separation between "planted near-identical" and "unrelated clean" is exactly the artificial sharpness real data doesn't have.

The honest boundary, updated: four detection paths work on real traces with zero false positives from the structural layer. The semantic layer's separating power on real data is not demonstrated — the opposite is. This is one topic and five traces, so it's a strong signal to act on, not a verdict.

What I'm explicitly NOT doing: redesigning the semantic layer to fix this now. Tuning it against one topic and five traces is just overfitting with extra steps — the same trap that killed my first project. The real fix needs real traces across several topics and domains, which only come from people running actual systems.

The whole thing — code, the synthetic GO, the real-probe log, and this E3 limitation written out in full — is public: github.com/JEONSEWON/Clew-by-Custos. If you run a multi-agent system and can share a trace, that's the one input that turns this from a suspicion into an answer.

DEV Community

I passed all 5 of my real-AI tests. The most useful thing I found is that half my detector barely works.

Top comments (0)