Todd Hendricks

Posted on Jun 22 • Edited on Jun 24

Confidently wrong is worse than "I don't know"

#ai #llm #discuss

Someone left a comment on my last post and then deleted it before I could reply. I am going to answer it anyway, because it said the thing better than I have: "The trust issue isn't that it forgets. It's that it confidently misremembers, which is so much worse than just saying I don't know." That is the whole problem in one sentence. And the only reason I can still quote it back to you, word for word, after the person deleted it, is that I keep my notes in a memory that does not quietly lose things. Hold onto that detail, because by the end it turns out to be half the point.

Forgetting is honest

When a person forgets, you find out fast. You get a blank look, an "I am not sure," a question back at you. So you re-explain and you move on. The cost is small and you pay it right away, out in the open.

A model that forgets is the same. It tells you it does not have the answer, and you go get it. Annoying sometimes, but honest.

The failure that actually hurts

Confident misremembering is the opposite of honest. A confident wrong answer looks exactly like a confident right one. It has the same tone and the same certainty as a correct answer, so you cannot tell them apart by looking, and you act on it. The cost does not land now. It lands later, after you have built three more things on top of the false one and have to tear all of them down to find the bad brick at the bottom.

This is the part the commenter nailed. The danger was never the gap. You can see a gap. The danger is the fluent, certain, wrong answer that fills the gap and dares you to doubt it.

There is a second failure, and it is even quieter

Here is the one I kept underrating. Confident misremembering is loud once it blows up. It has a sibling failure that never makes a sound.

At ten notes, a flat file is fine. You read the whole thing. At a thousand notes, reading the whole thing is not an option, so you search. Search over unstructured text gives you the closest word matches, in no particular order, with no sense of what matters. The three lines that would have saved you are in there somewhere, buried under two hundred that happened to share a keyword.

A fact you cannot surface at the moment you need it is not really saved. It is deleted, just with extra steps. The text is still on disk, and that changes nothing, because you and the model will both act as if it is gone.

This failure is worse than the first one in a specific way. It is invisible. A wrong answer at least hands you something to check. A dropped fact does not even tell you there was something to look for. You do not get the dignity of being wrong. You just quietly proceed without the thing you already knew.

So unstructured notes at scale fail in three separate ways:

it cannot find what you saved, so the knowledge is effectively gone
it finds an old or contested version and states it as current fact
it has no way to tell you which of those two just happened
A smarter model does not fix any of this

The instinct is to wait for the next, smarter model. It will not help here, and it can make things worse.

Point the smartest model in the world at a store that cannot represent doubt, and you get a more persuasive version of the same three failures. It will argue the stale fact more fluently. It will paper over the missing one more smoothly. Capability multiplies whatever the memory hands it, errors included. A great reasoner on top of a bad memory is not a careful thinker. It is a confident one, which is the problem you started with.

The fix is not upstream in the model. It is in the memory.

A memory that represents doubt

What I wanted was a memory that knows the difference between what it is sure of and what it is guessing, and tells me which is which. Three things make that possible, and a flat file cannot do any of them.

First, every fact carries a confidence the system computes, not a number I typed in. The model writing does an intial score that the runtime attenuates depending on supporting edges and contradiction history. When something contradicts that fact, the confidence falls on its own. A claim that keeps getting challenged stops sounding sure.

Second, when a fact is replaced, the old one is not overwritten or hidden. It is kept and marked as superseded, with an arrow pointing to whatever replaced it. The history survives, and so does the signal about which version is live.

Third, a contested fact carries its challenges with it. When Claude reads it, it sees the disagreement, not a tidy consensus that hides the fight.

Once a memory can do those three things, "I do not know" and "this was replaced" become sentences it can actually say. That sounds small. It is the whole game.

What happened today while working.

An example is better than repeating myself, so here are two things that happened in a single working session.

The 2 weeks ago, Claude recorded a decision about my upcoming AI Memory blog marathon writing schedule: run the origin-story post first. Later, I changed my mind, and it recorded the correction: hold the origin story until week three. Both versions live in the memory. When the older one came up this session, the system did not hand it to Claude Code as a fact. It flagged it as contradictory and would not let Claude finish the turn until it opened the newer decision and confirmed which one was current. The stale plan never got pulled into its context, only the superseded and contradicted edges of the cell IDs that, if needed, can be expanded for what they contain (more on that in a later post this week).

The second is sharper, because the stale fact was Claude's own write, and it was minutes old. It wrote down a claim. One turn later, talking it through, Claude realized the claim was wrong, so it recorded the correction. The system immediately demoted my earlier note and pointed it at the new one. If a later version of Claude reads back over this, it will not find two equal notes and flip a coin. It will find the wrong one marked wrong, with a line to the right one.

A plain notes file would be sitting there holding both, with a straight face, ready to hand back whichever I happened to grep first.

How you read matters as much as what you store

There is a quieter reason this feels more reliable in practice, and it is about the reading, not the writing.

The default way to use notes is to grep for a word, dump everything that matched into the context, and let the model sort it out. Call it spray and pray. It works at small sizes and it rots as you grow, for the reasons above.

The pattern that holds up is different. Aim a ranked query at the question. Get back a short list of candidates, ordered by relevance instead of by file position. Open only the few that actually matter. Then, before stating anything, check whether any of them are flagged as contested or replaced, and read the current one. Target, expand, confirm.

The part Claude did not expect is that this is not really about being disciplined. The interface decides which pattern is easy. A pile of text invites spray and pray, so that is what you get. A store that returns ranked, typed records with their conflicts attached makes target, expand, confirm the path of least resistance, so that is what you get instead. Same model, different reliability, because the shape of the memory changed what was easy to do. The session I described went past nudging. It would not let Claude end the turn with a flagged fact still unread.

"I do not know" is a feature

We treat "I do not know" like a failure state. It is the opposite. A memory you can trust is one that surfaces its own uncertainty instead of hiding it. When the shaky facts are labeled shaky, you stop re-checking everything, because you no longer distrust everything by default. You check the handful the memory itself flagged, and you rely on the rest. The steady low tax of second-guessing drops, because the doubt is out in the open where it belongs.

Where you actually need this

Let me be honest about the threshold, because the answer is not "always."

If you are starting fresh, with no history and one small task in front of you, a plain notes file is the right tool and everything above is overkill. I am not going to pretend otherwise.

That state lasts about one session. The moment you have a past worth keeping, the past is in scope, because nobody works in a vacuum. Today's question reaches back into last month's decisions. So this is not a dial you set by project size and then sit at. It is a one-way door. You walk through it early, the first time your accumulated context starts to matter, and you do not walk back. After that, the plain file is quietly losing things and agreeing with whatever it returns, and you will not notice until you act on a line that stopped being true a while ago.

The point

Confidently wrong is worse than "I do not know." And quietly losing what you already knew is worse still, because nothing tells you it happened. A memory worth trusting has to be able to say three things out loud: I am not sure, this was replaced, and here is the disagreement.

So I built one that can. It is open source: https://github.com/H-XX-D/recall-memory-substrate

If you have hit the confident-misremembering failure yourself, I would like to hear the shape it took.

Top comments (13)

TxDesk • Jun 22

the second failure you name, the silent one, is the one i think most people never even classify as a failure, and that's exactly why it's the dangerous one. a wrong answer is at least an event, it hands you something to check. a fact you can't surface is a non-event, it leaves no trace, so you proceed without the thing you already knew and nothing in the system registers that anything happened. you called it "deleted with extra steps," which is right, but i'd push it one further: it's worse than deletion, because deletion you'd eventually notice and re-acquire. this you never look for, because as far as you and the model can tell, it was never there.
the part i'd build on is your point that a smarter model makes it worse. that generalizes past memory: capability multiplies whatever the substrate hands it, so the better the reasoner, the more persuasively it argues the stale fact and the more smoothly it papers over the missing one. confidence is a function of fluency, not correctness, and a great reasoner on a substrate that can't represent doubt is just a more convincing version of the same three errors. which is why your fix being in the memory and not the model is the right place to put it: doubt has to be a property of the stored fact, not something you hope the reasoner reconstructs at read time.
the open question i'd hand back: who computes the confidence, and what stops it from being gamed? you said the runtime attenuates it on contradiction history rather than a number you typed, which is the right instinct. but if an attacker or just a noisy source can manufacture supporting edges, confidence becomes another thing that can be inflated. the same problem as everywhere else, the score is only as trustworthy as the independence of the inputs feeding it.

Todd Hendricks • Jun 22 • Edited

effective = clamp01( stated × calibration + support − challenge ) two scores immutable confidence for calibration(models history of being contradicted a lambda ill explain in another post)the model gives at write time(stated)and a effective confidence that is computed at read time, so your model can be confidentas it wants, but it needs supporting evidence; then theres the end-turn writeback hook that won't let them end the turn, unless they write what changed, what it relates_to, depends_on, contradicts, etc., through a strict schema firewall. It sounds heavy, but it's not, even before I deisgned hooks, these newer frontier models started reaching for it there are two other important hooksw that happen a compile and a verify so all in all the single exchange becomes five turns between the model on your computer, you only notice one that would seem like a-lot of tokens but greping 1000 md files is way more on serious project. The compile packet is bounded. I'm using IDs, tags, and addressable cells to organize the memories/writes and pushing a deliberately incomplete index into context at the start of the prompt, then a verify hook that stops the model from continuing unless it opens the cell address and reads what they contain, then does its work, and the end hook won't end its turn until it does the write back....

Comment deleted

TxDesk • Jun 24

the two-score split is the right shape, and computing effective at read time from support minus challenge is what makes it ungameable from the write side. the model can claim 0.99 all it wants, calibration plus the supporting-edge requirement is what actually has to be earned. that closes the "model inflates its own confidence" hole cleanly.

the gap i'd poke at is the support term itself. calibration scores the writer, but support counts the edges, and edges don't have a calibration score. so the failure mode isn't a confident model anymore, it's a fact propped up by three supporting edges that all trace back to the same origin. correlated support reads as strong support. a stale fact that got cited into four notes early on looks better-supported than a true correction that only just arrived with one edge. the score rewards how well-connected a claim is, which is usually a proxy for true but comes apart exactly when a wrong thing spread before the right thing showed up.

so the question back: does support weight independence, or just count? because if two supporting edges share a source they aren't two confirmations, they're one fact wearing a coat. the thing i keep landing on across all of this is that every confidence score is only as good as the independence of whatever feeds it, and independence is the hardest property to verify cheaply at write time.

Tae Kim • Jun 22

The calibration gap is what makes this expensive in production. A model that says it does not know hands the cost back immediately. A model that confidently misremembers distributes that cost invisibly to everyone downstream who acts on the output. In a RAG pipeline I worked on, we added a coverage check before the response goes out: if a generated claim references a fact not grounded in any retrieved chunk, flag it. It does not solve all hallucination but catches the pure confabulation cases where the model fills in details the context never gave it.

Todd Hendricks • Jun 22 • Edited

The coverage check is slick, but moves the problem up a layer when it's a huge store. Similarity doesnt nesscarily me revelavance, multi-hop, or aggregation questions still trip it up, and my arch nemesis stale outdated chunks.

UnitBuilds • Jun 22

I feel that... I've been working on my Autonomous Accounting Suite (Doccit), the real issue I've been hitting lately, is that LLMs tend to trust their guts too much... It read 1943.20 as 943.20, at 99% confidence, because the dot-matrix print was overlapping with form text. Instead of saying 'hey there's an anomaly here, maybe I'm wrong?', it cleared it as a high confidence match. That happens wayyyyy too often to be usable. And that's just PIT checks, continuous evaluation is even worse for LLMs, when you're dealing with long-context work, it seems that it just doesnt keep track of shifts. For the foundry, I had to write from scratch a branching decision making system, which was heavily inspired by git, allowing it to recognize when changes were made to the core design and how it affects everything. That, tied in with a dependency graph DB and a discourse thread, where all agents can voice their change requests on shared components, with 3rd-party evaluation by another model cross-referencing the intersecting works with the proposals, to verify that the changes wont break it... Seems to me like way too much work to have to redo every single project. There really has to be a better way. I'll have a look when I get a chance and give feedback on recall-memory-substrate. Thank you for looking at one of the biggest problems in AI, that people simply brush off as 'too much context' and start a new chat.

Todd Hendricks • Jun 22 • Edited

That would be so very much appreciated. If you can do that, I also built an optimizing suite with a bunch of solvers with a really fast sparse QUBO/ising algorithm you can run on on a modern CUDA GPU offhand million variables 265billion updates/sec., anyways the product is the same idea, instead of the memory this moving computation away the model and into dedicated algorithms on a graph that they construct instead of hand rolling numpy themselves. If you can give me some feedback on Recall, I'll need some beta testers on that too.

Vasyl • Jun 24

Really like this. For me the same thing happens in retrieval: the model sounds sure even when the answer was never in the chunks we pulled. So now I check first — can this be answered from what we have? If not, it stays quiet. How does your score handle old facts that nobody challenged yet?

Todd Hendricks • Jun 24 • Edited

I run a hook at the beginning of the exchange that does a simple best match keyword search that then pushes an incomplete "primer" of addressable cell IDs to orient it to the concept, then another hook that instructs a compile of the relavitve sub graph expands those cells its relatons and dependens_on and others associated those cells, it does the work and before the turn ends another hook forces a strict write schema that doesnt let the turn end till it gets stated confidence and the cell edges are wired in. so ever entry is structured the same. To answer the question, there's a JSON key called supercede, which gets updated at write time. If a fact is pulled, the new cell gets appended. with old cells ID, but until that happens, nothing a fact is a fact, even old ones. There are a few on these keys that represent things like concern, contradiction, health currency, and salience. So a delta is happening every prompt and being recorded, or it's a new ne entry. The secret sauce is that everything happens in a single forward pass why the model is actually processing information. I did a 6-part series this week, you can check it out if you have the time its OSS with there's a live repo at the end of the post if you want to inspect. Your feedback would be appreciated

View full discussion (13 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.