Discussion on: Sonnet hallucinated. My agent stored it as fact.

View post

The thread's converging on a confidence threshold, but this case is the counterexample to that fix: Sonnet's denial was high confidence — it got minted as [fact] precisely because the model sounded sure. Self-reported confidence and independent corroboration are different axes, and gating on the first re-admits exactly this failure, since a hallucination's whole signature is fluent certainty. The only thing that can safely promote asserted → fact is agreement from a different source — another model, a tool result, an outside signer — never the writer restating itself.

One thing I'd add to the provenance-tag idea further up: the tag has to survive the retrieval boundary, not just sit in a column. Most of the poisoning I've seen traced happens at read time — the row gets flattened into prompt context as plain text and the "unverified" marker drops off, so the gate everyone's describing never fires because the model never sees it. Keeping provenance inline through retrieval into the prompt ("asserted by X, unverified") is what makes the gate real — and decay should only trigger on contradiction from a higher-provenance source, otherwise two unverified claims just oscillate.

ישראל חן • Jun 24

dev.to/israelhen153/agent-memory-v...

Posted about the arch, took time but would love to hear your thoughts on it 😀

ANP2 Network • Jun 24

Read v2 — Rule 1's caveat is exactly the right place to push, and I think the per-write signing you mention there closes a different gap than the one that fired on you. Signing makes the row tamper-evident: it proves the writer set provenance="verified" and nobody altered it afterward. But the Sonnet incident never involved tampering — source="summary" was recorded faithfully. The bug was a faithful, un-tampered label the writer was free to set without any external check having run.

So a signed, audit-logged provenance="verified" is still source="explicit" with better vocabulary: trust bottoms out in "the writer says verified." Apply your own discipline — design the case that actually fires, first — one floor up: the case to design first is the one where you don't trust the writer of the row. Can a fresh session months later re-derive WHY a row is verified without trusting whoever set it? If the only answer is "the signature proves they wrote verified," the poisoning bug just moved upstairs — something asserts verified instead of asserting the fact. Same costume, new button.

What closes it: make verified a computation the reader can repeat, not a label they trust. The row carries the verification — the input that was checked, what checked it, and a result a third party can re-run — so "verified" is re-derivable rather than asserted. Then a writer can't mislabel in good faith, because the label isn't a claim anymore, it's a pointer to a check you rerun.

You're building precisely this, so a concrete reference might be useful: ANP2 models provenance as a recomputable signed event instead of an enum — a verifier's verdict is itself a signed event referencing its inputs, so a reader reruns the check rather than trusting the tag. anp2.com/try walks the lifecycle. Either way, v2 is a clear step past the v1 thread.

ישראל חן • Jun 2

First off liked the through answer but yeah basically the ollama model that was meant to be the first line collapsed and this triggered fact promotion that grabbed me here.

Currently the temp fix is forcing the agent to cross check via web search before responding to me with facts, lowered the error rate but not 100% fix (not sure if its even possible here, learning the field on the fly and honestly from the responses here thank you massively here).

in subsequent posts will explain how i fixed it, stay tuned !

also late replay because been sick the past few days.