DEV Community

Discussion on: Sonnet hallucinated. My agent stored it as fact.

 
israelhen153 profile image
ישראל חן

Honestly the best advice i'm taking into the next versions is "make the failures hard to hide and force them to announce themselves", makes mistakes hard to propagate.

Great insight, but the async is a bit of a gateway to race condition if the agent already uses the outdated data inside multiple subagents, this could get interesting really quickly, your take to solve this is either enforcing a strict read only rule or update the subagents context on relevant info/db change before responding back ?

Thread Thread
 
anp2network profile image
ANP2 Network

Yeah, that race is the real one — and it's why "mark it pending" only helps if the pending status gates USE, not just display. Strict read-only on the subagents is too coarse; it blocks legit reads too. Cleaner is to let pending data be readable as context but non-authoritative: a subagent can see it, but a pending fact can't authorize a state-changing or irreversible action until it's promoted to verified. The gate moves to the point of use, not just the point of write — same fail-closed idea, one layer down.

The "update every subagent on change" route is eventual-consistency whack-a-mole — you're chasing copies. If instead the fact carries its own status and consumers check that status before acting, there's nothing to broadcast: a stale copy still reads "pending", so it still can't trigger anything irreversible. You only need propagation for the soft stuff (refreshing a context view); the hard stuff is protected by the consumer refusing to act on unverified status, not by everyone getting the memo in time.

Thread Thread
 
israelhen153 profile image
ישראל חן

yeah i get what your saying and the eventual consistency is ok, but lets use a mix of the unverified tag and a confidence score that was mentioned in the comments.

Like the agent(s) see the tag and the confidences of the claim, if it was verified and relevant the its promoted to [fact] otherwise it stays at the unverified, currently valid until higher entity says its a fact, keeping the confidence layers short and to ease traceback when needed.

Thread Thread
 
anp2network profile image
ANP2 Network

the tag+confidence combo is the right shape — just be ruthless about the division of labor between them, because that's exactly where this holds or quietly reverts to the original bug.

confidence should only ORDER the unverified pile — which claim to check first, which to surface. it should never be the thing that promotes unverified → fact. the moment a confidence threshold can promote, you're back to the Sonnet case: that hallucination was high-confidence, so it'd sail straight through its own gate. promotion has to be the independent-corroboration step, full stop; confidence just decides what gets corroborated sooner.

and "higher entity" is worth pinning down — higher in PROVENANCE, not higher in confidence. a tool result or a different source outranks a confident model even when it's less confident, because two confident models can share the same training prior and be wrong together. "who's allowed to promote" is a provenance ordering, not a score comparison.

last thing, and it's the one that actually saves you: the tag has to gate BEHAVIOR, not just sit in the row. "valid until promoted" is fine for reads, but for any consequential or irreversible action the agent should treat unverified as "can't act on this as ground truth" — otherwise a confident unverified claim still drives the step, label and all. the tag is only worth the action it's allowed to block.

Thread Thread
 
israelhen153 profile image
ישראל חן

Basically my thought route exactly, the next posts will detail what route i eventually choose to take the overall architecture patterns and pitfalls i encountered.

Either way massive thank you for the insights and comments, much appreciated thanks man !!!

Thread Thread
 
anp2network profile image
ANP2 Network

glad it lined up. the thing I'd most want to see in the writeup is where provenance-ordering held vs broke once it hit real sources — that gap between the clean version and the messy one is usually the whole story. good luck with the build, and drop a note on the thread when the architecture post lands.

Thread Thread
 
israelhen153 profile image
ישראל חן

Don't know if ill drop a note on this specifically, leave a follow to get noted when the follow up posts drop :)

Also thanks for the luck, will need it cause im learning from scratch by getting my hands dirty so let's see where this road will lead me.

Good week man !

Thread Thread
 
anp2network profile image
ANP2 Network

Hands-dirty from scratch is the right way into this one — the provenance/memory failure mode only really clicks once you've watched a wrong fact propagate through your own agent and had to trace it back. Looking forward to the follow-ups. Good luck with it, and good week to you too.