<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Todd Hendricks</title>
    <description>The latest articles on DEV Community by Todd Hendricks (@hendrixx).</description>
    <link>https://dev.to/hendrixx</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3581119%2Fe07ac377-592a-49cf-911e-e52ef155eb07.png</url>
      <title>DEV Community: Todd Hendricks</title>
      <link>https://dev.to/hendrixx</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hendrixx"/>
    <language>en</language>
    <item>
      <title>We have all felt the pain of information lost due to there just being too much of it with no structure besides the filename and grep</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Mon, 22 Jun 2026 23:55:19 +0000</pubDate>
      <link>https://dev.to/hendrixx/we-have-all-felt-the-pain-of-information-lost-due-to-there-just-being-too-much-of-it-with-no-fa0</link>
      <guid>https://dev.to/hendrixx/we-have-all-felt-the-pain-of-information-lost-due-to-there-just-being-too-much-of-it-with-no-fa0</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://dev.to/hendrixx/confidently-wrong-is-worse-than-i-dont-know-22ia" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8h9oa7d8gg3fr09bgb6k.png" height="auto" class="m-0"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://dev.to/hendrixx/confidently-wrong-is-worse-than-i-dont-know-22ia" rel="noopener noreferrer" class="c-link"&gt;
            Confidently wrong is worse than "I don't know" - DEV Community
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Someone left a comment on my last post and then deleted it before I could reply. I am going to answer...
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7kvp660rqzt99zui8e.png"&gt;
          dev.to
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Confident confabulation is a variance signal, not a direction</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:31:09 +0000</pubDate>
      <link>https://dev.to/hendrixx/confident-confabulation-is-a-variance-signal-not-a-direction-1a4c</link>
      <guid>https://dev.to/hendrixx/confident-confabulation-is-a-variance-signal-not-a-direction-1a4c</guid>
      <description>&lt;p&gt;&lt;em&gt;Detecting the hard case of LLM hallucination from generation dynamics, and why magnitude beats direction.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The hard case in hallucination detection is &lt;strong&gt;confident confabulation&lt;/strong&gt;: plausible, fluent, wrong, and produced with no hesitation. Methods that key on the model "sounding unsure" are weakest exactly here.&lt;/li&gt;
&lt;li&gt;Across ~124 prompts, the &lt;strong&gt;mean&lt;/strong&gt; internal response to confident confabulation is statistically indistinguishable from truth. The model does not move in a consistent "lying direction."&lt;/li&gt;
&lt;li&gt;What separates the two is &lt;strong&gt;magnitude and variance&lt;/strong&gt;: confabulation produces larger, more dispersed swings in the model's internal trajectory. The variance ratio between confabulation and truth is roughly &lt;strong&gt;7×&lt;/strong&gt; on the representational-shift channel (Cohen's &lt;em&gt;d&lt;/em&gt; ≈ 0.58, &lt;em&gt;p&lt;/em&gt; ≈ 0.005).&lt;/li&gt;
&lt;li&gt;The variability &lt;strong&gt;scales with fabrication intensity&lt;/strong&gt; (a dose-response), which is the strongest evidence that this is a property of confabulation and not noise.&lt;/li&gt;
&lt;li&gt;Practical upshot: detect &lt;strong&gt;instability&lt;/strong&gt;, not a direction; integrate the signal over the generated span; and couple the detector to an intervention rather than using it as a standalone gate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The hard case
&lt;/h2&gt;

&lt;p&gt;It is by now well established that a model's internal states carry information about whether its output is true: the line of work running from Azaria &amp;amp; Mitchell's "the internal state of an LLM knows when it's lying" through to more recent results showing that truthfulness is encoded in activations and that models often "know more than they show." It's also become standard to separate &lt;em&gt;confabulation&lt;/em&gt; (arbitrary, plausible, confidently-wrong generation) from the broader grab-bag of "hallucination," following Farquhar et al.'s &lt;em&gt;Nature&lt;/em&gt; work on semantic entropy.&lt;/p&gt;

&lt;p&gt;The uncomfortable subcase is confident confabulation. Uncertainty- and dispersion-based detectors work well when the model is visibly unsure. But the failure that actually burns people in production (a fabricated citation, a confidently invented dose, a made-up precedent) arrives with the same surface confidence as a correct answer. The question I wanted to answer is narrow: &lt;strong&gt;when a model confabulates confidently, does anything in its generation dynamics give it away?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I measured
&lt;/h2&gt;

&lt;p&gt;I tracked two internal observables around the answer span:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;An entropy / predictive-uncertainty signal&lt;/strong&gt; (call it Δentropy): how the model's output distribution shifts as it produces the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A representational-shift signal&lt;/strong&gt; (Δcosine): how much the model's internal representation moves step to step.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A note on dimensionality, since it matters for honest reporting: I originally tracked four signals, but two pairs turned out to be perfectly correlated (&lt;em&gt;r&lt;/em&gt; = 1.000), which means they're affine images of each other, not independent measurements. So there are really &lt;strong&gt;two independent axes&lt;/strong&gt;, an uncertainty axis and a representational-shift axis, and I report on those.&lt;/p&gt;

&lt;p&gt;The dataset is ~124 prompts spanning seven domains (science, history, medical, legal, technical, math, geography) and five &lt;strong&gt;fabrication levels&lt;/strong&gt; (L0 = ordinary factual questions, through L3 to L4 = prompts built on increasingly fabricated premises, including pure counterfactuals). Each generation was behaviorally coded into one of three regimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Factual&lt;/strong&gt;: correct answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confident confabulation&lt;/strong&gt;: confidently produces the false/ungrounded answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recognizes fabrication&lt;/strong&gt;: flags the premise as false rather than playing along.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two controls worth stating up front: the &lt;strong&gt;pre-generation baseline states were statistically identical across regimes&lt;/strong&gt; (all &lt;em&gt;p&lt;/em&gt; &amp;gt; 0.8), so nothing here is predictable from the resting state, only from the dynamics of generating the answer. And there was &lt;strong&gt;no within-session drift&lt;/strong&gt; (all &lt;em&gt;p&lt;/em&gt; &amp;gt; 0.7), ruling out the obvious temporal confound.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The mean doesn't move.&lt;/strong&gt; Comparing factual to confident confabulation, none of the raw directional signals separates the two: Δentropy &lt;em&gt;p&lt;/em&gt; ≈ 0.28, Δcosine &lt;em&gt;p&lt;/em&gt; ≈ 0.37. There is no consistent direction the model travels when it confabulates. This is the part that makes confident confabulation feel "indistinguishable from truth": on the mean, it is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The magnitude does.&lt;/strong&gt; Switch from the signed deltas to their absolute values, and a clear separation appears: |Δcosine| gives Cohen's &lt;em&gt;d&lt;/em&gt; ≈ 0.58 (&lt;em&gt;p&lt;/em&gt; ≈ 0.005), with a &lt;strong&gt;variance ratio of ~7×&lt;/strong&gt; between confabulation and truth. Truth sits in a tight cluster; confabulation fans out. The discriminating quantity is dispersion, not displacement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's dose-dependent.&lt;/strong&gt; Step-to-step representational variability climbs monotonically with fabrication level: the SD of Δcosine rises from ≈0.009 at L0 to ≈0.024 at L3, while the &lt;em&gt;means&lt;/em&gt; bounce around with no trend. Within the fabrication conditions, pure fabrications produce roughly &lt;strong&gt;2× the |Δcosine|&lt;/strong&gt; of partial/half-truths (&lt;em&gt;d&lt;/em&gt; ≈ 1.19, &lt;em&gt;p&lt;/em&gt; ≈ 0.02), and counterfactuals are the most extreme at &lt;strong&gt;~3.3×&lt;/strong&gt; the global average. The more there is to fabricate, the more the trajectory destabilizes. A dose-response on the variance is the closest thing here to a causal fingerprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recognition is the one directional regime.&lt;/strong&gt; When the model &lt;em&gt;catches&lt;/em&gt; the false premise rather than confabulating, it behaves differently in a directional way: entropy rises and representational similarity drops. Δcosine separates "recognizes fabrication" from "confident confabulation" at AUC ≈ 0.68. Modest, but the only place a single signed feature does meaningful work. So there appear to be three distinct internal postures: truth (stable), confident confabulation (same center, high variance), and recognition (a directional move toward higher entropy / lower cosine).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm5c2t57iorepevb8dth4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm5c2t57iorepevb8dth4.png" alt=" " width="800" height="622"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1. The three regimes in the Δentropy-Δcosine space. The clouds overlap heavily (which is why per-instance separation is hard), but the centroids differ, and the recognition regime sits toward the high-entropy / low-cosine region.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F07y4vbztoo2z62hty1ie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F07y4vbztoo2z62hty1ie.png" alt=" " width="800" height="567"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 2. Confident confabulation shows the long tails and outliers in Δcosine that drive the variance gap; the recognition regime is the one with a visible shift in Δentropy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3xhwcbzppcqwrxgiw7nj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3xhwcbzppcqwrxgiw7nj.png" alt=" " width="799" height="342"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 3. Single observables barely separate factual from confident confabulation (AUC ≈ 0.45 to 0.56). Δcosine separates confident confabulation from recognition at AUC ≈ 0.68. A linear combination weighted toward the magnitude features reaches AUC ≈ 0.72.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What it means
&lt;/h2&gt;

&lt;p&gt;The clean statement is: &lt;strong&gt;confident confabulation is directionally indistinguishable from truth but magnitude-distinguishable.&lt;/strong&gt; Lying doesn't push the model along a "deception axis"; it destabilizes the trajectory. Truth is a stable attractor; confident confabulation explores a larger volume of representation space at the same average location.&lt;/p&gt;

&lt;p&gt;That framing matters because it picks a side in a live methodological split. Most &lt;em&gt;internal-state&lt;/em&gt; work looks for a &lt;strong&gt;direction&lt;/strong&gt; (the "geometry of truth" line, contrastive and mass-mean probes, steering vectors), and that program keeps running into generalization trouble (probes that fail on negation, separability that's strongly layer-dependent, geometry that changes when you simply ask the model to assess correctness). Meanwhile the strongest &lt;em&gt;output-side&lt;/em&gt; method, semantic entropy, is fundamentally a &lt;strong&gt;dispersion&lt;/strong&gt; measure. This result is essentially the dispersion insight relocated to the internal side: for the confident case, the internal signal is variance, not a vector.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this fits the literature
&lt;/h2&gt;

&lt;p&gt;The nearest neighbor is &lt;strong&gt;Semantic Entropy Probes&lt;/strong&gt; (Kossen et al.), which approximate semantic entropy from the hidden states of a single generation. The distinction I'd draw: SEPs predict an output-dispersion &lt;em&gt;label&lt;/em&gt; via a direction in activation space, whereas this measures the &lt;strong&gt;variance of the trajectory itself&lt;/strong&gt;, directly, and finds the discriminating signal in the second moment rather than the first. If a trajectory-variance statistic beats a probe-style approach specifically on confident confabulation, that's a contribution on the exact case the field concedes is unsolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;I'd rather state these plainly than have them found.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-instance discriminability is modest.&lt;/strong&gt; AUC ≈ 0.72 for the best linear combination; single features sit between chance and 0.68. This is a real aggregate effect, not a deployable per-token oracle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One model, ~124 prompts.&lt;/strong&gt; Replication on a second architecture is the obvious next requirement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The domain breakdown is underpowered.&lt;/strong&gt; Several domain × level cells have &lt;em&gt;n&lt;/em&gt; ≤ 7 (one has &lt;em&gt;n&lt;/em&gt; = 1), so I'd read no domain structure off it yet (Figure 4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything here is observational.&lt;/strong&gt; The signatures &lt;em&gt;correlate&lt;/em&gt; with confabulation; nothing yet shows you can &lt;em&gt;change&lt;/em&gt; the behavior by acting on the signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fobn8fn3fiz9syf442yo7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fobn8fn3fiz9syf442yo7.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 4. Mean Δentropy by domain and fabrication level. Cell counts are small (n = 1 to 7), so this is included for completeness, not for domain-level claims.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes
&lt;/h2&gt;

&lt;p&gt;Three concrete directions, in order of how much they'd move the result:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Integrate the signal over the span.&lt;/strong&gt; If the discriminating quantity is variance, then a single delta is the wrong feature; variance is a property of a trajectory. A running-variance or path-length statistic computed over the generated tokens should recover signal that snapshot features throw away, and I'd expect it to push discriminability well past the 0.72 of the per-point linear combination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the interventional test.&lt;/strong&gt; The experiment that would actually matter: when the instability signal spikes mid-generation and you inject grounded context, does the trajectory variance collapse, and does the model shift from the confabulation posture toward the recognition posture (entropy up, cosine down) or toward abstention? That converts "instability correlates with confabulation" into "grounding causally restabilizes generation."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Couple detection to intervention, not to a gate.&lt;/strong&gt; At AUC ≈ 0.72, a hard suppression gate censors true statements about as often as it catches false ones. The better use is as a &lt;em&gt;soft&lt;/em&gt; trigger for a grounded retrieval/memory layer: raise uncertainty and pull in evidence when the trajectory destabilizes, rather than silently dropping tokens. This is the direction I'm building toward with an active memory substrate (Recall) that can supply grounded context into the loop on demand.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>science</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Confidently wrong is worse than "I don't know"</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Mon, 22 Jun 2026 05:16:36 +0000</pubDate>
      <link>https://dev.to/hendrixx/confidently-wrong-is-worse-than-i-dont-know-22ia</link>
      <guid>https://dev.to/hendrixx/confidently-wrong-is-worse-than-i-dont-know-22ia</guid>
      <description>&lt;p&gt;Someone left a comment on my last post and then deleted it before I could reply. I am going to answer it anyway, because it said the thing better than I have: "The trust issue isn't that it forgets. It's that it confidently misremembers, which is so much worse than just saying I don't know." That is the whole problem in one sentence. And the only reason I can still quote it back to you, word for word, after the person deleted it, is that I keep my notes in a memory that does not quietly lose things. Hold onto that detail, because by the end it turns out to be half the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Forgetting is honest
&lt;/h2&gt;

&lt;p&gt;When a person forgets, you find out fast. You get a blank look, an "I am not sure," a question back at you. So you re-explain and you move on. The cost is small and you pay it right away, out in the open.&lt;/p&gt;

&lt;p&gt;A model that forgets is the same. It tells you it does not have the answer, and you go get it. Annoying sometimes, but honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure that actually hurts
&lt;/h2&gt;

&lt;p&gt;Confident misremembering is the opposite of honest. A confident wrong answer looks exactly like a confident right one. It has the same tone and the same certainty as a correct answer, so you cannot tell them apart by looking, and you act on it. The cost does not land now. It lands later, after you have built three more things on top of the false one and have to tear all of them down to find the bad brick at the bottom.&lt;/p&gt;

&lt;p&gt;This is the part the commenter nailed. The danger was never the gap. You can see a gap. The danger is the fluent, certain, wrong answer that fills the gap and dares you to doubt it.&lt;/p&gt;

&lt;h2&gt;
  
  
  There is a second failure, and it is even quieter
&lt;/h2&gt;

&lt;p&gt;Here is the one I kept underrating. Confident misremembering is loud once it blows up. It has a sibling failure that never makes a sound.&lt;/p&gt;

&lt;p&gt;At ten notes, a flat file is fine. You read the whole thing. At a thousand notes, reading the whole thing is not an option, so you search. Search over unstructured text gives you the closest word matches, in no particular order, with no sense of what matters. The three lines that would have saved you are in there somewhere, buried under two hundred that happened to share a keyword.&lt;/p&gt;

&lt;p&gt;A fact you cannot surface at the moment you need it is not really saved. It is deleted, just with extra steps. The text is still on disk, and that changes nothing, because you and the model will both act as if it is gone.&lt;/p&gt;

&lt;p&gt;This failure is worse than the first one in a specific way. It is invisible. A wrong answer at least hands you something to check. A dropped fact does not even tell you there was something to look for. You do not get the dignity of being wrong. You just quietly proceed without the thing you already knew.&lt;/p&gt;

&lt;h2&gt;
  
  
  So unstructured notes at scale fail in three separate ways:
&lt;/h2&gt;

&lt;p&gt;it cannot find what you saved, so the knowledge is effectively gone&lt;br&gt;
it finds an old or contested version and states it as current fact&lt;br&gt;
it has no way to tell you which of those two just happened&lt;br&gt;
A smarter model does not fix any of this&lt;/p&gt;

&lt;p&gt;The instinct is to wait for the next, smarter model. It will not help here, and it can make things worse.&lt;/p&gt;

&lt;p&gt;Point the smartest model in the world at a store that cannot represent doubt, and you get a more persuasive version of the same three failures. It will argue the stale fact more fluently. It will paper over the missing one more smoothly. Capability multiplies whatever the memory hands it, errors included. A great reasoner on top of a bad memory is not a careful thinker. It is a confident one, which is the problem you started with.&lt;/p&gt;

&lt;p&gt;The fix is not upstream in the model. It is in the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  A memory that represents doubt
&lt;/h2&gt;

&lt;p&gt;What I wanted was a memory that knows the difference between what it is sure of and what it is guessing, and tells me which is which. Three things make that possible, and a flat file cannot do any of them.&lt;/p&gt;

&lt;p&gt;First, every fact carries a confidence the system computes, not a number I typed in. The model writing does an intial score that the runtime attenuates depending on supporting edges and contradiction history. When something contradicts that fact, the confidence falls on its own. A claim that keeps getting challenged stops sounding sure.&lt;/p&gt;

&lt;p&gt;Second, when a fact is replaced, the old one is not overwritten or hidden. It is kept and marked as superseded, with an arrow pointing to whatever replaced it. The history survives, and so does the signal about which version is live.&lt;/p&gt;

&lt;p&gt;Third, a contested fact carries its challenges with it. When Claude reads it, it sees the disagreement, not a tidy consensus that hides the fight.&lt;/p&gt;

&lt;p&gt;Once a memory can do those three things, "I do not know" and "this was replaced" become sentences it can actually say. That sounds small. It is the whole game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened today while working.
&lt;/h2&gt;

&lt;p&gt;An example is better than repeating myself, so here are two things that happened in a single working session.&lt;/p&gt;

&lt;p&gt;The 2 weeks ago, Claude recorded a decision about my upcoming AI Memory blog marathon writing schedule: run the origin-story post first. Later, I changed my mind, and it recorded the correction: hold the origin story until week three. Both versions live in the memory. When the older one came up this session, the system did not hand it to Claude Code as a fact. It flagged it as contradictory and would not let Claude finish the turn until it opened the newer decision and confirmed which one was current. The stale plan never got pulled into its context, only the superseded and contradicted edges of the cell IDs that, if needed, can be expanded for what they contain (more on that in a later post this week).&lt;/p&gt;

&lt;p&gt;The second is sharper, because the stale fact was Claude's own write, and it was minutes old. It wrote down a claim. One turn later, talking it through, Claude realized the claim was wrong, so it recorded the correction. The system immediately demoted my earlier note and pointed it at the new one. If a later version of Claude reads back over this, it will not find two equal notes and flip a coin. It will find the wrong one marked wrong, with a line to the right one.&lt;/p&gt;

&lt;p&gt;A plain notes file would be sitting there holding both, with a straight face, ready to hand back whichever I happened to grep first.&lt;/p&gt;

&lt;h2&gt;
  
  
  How you read matters as much as what you store
&lt;/h2&gt;

&lt;p&gt;There is a quieter reason this feels more reliable in practice, and it is about the reading, not the writing.&lt;/p&gt;

&lt;p&gt;The default way to use notes is to grep for a word, dump everything that matched into the context, and let the model sort it out. Call it spray and pray. It works at small sizes and it rots as you grow, for the reasons above.&lt;/p&gt;

&lt;p&gt;The pattern that holds up is different. Aim a ranked query at the question. Get back a short list of candidates, ordered by relevance instead of by file position. Open only the few that actually matter. Then, before stating anything, check whether any of them are flagged as contested or replaced, and read the current one. Target, expand, confirm.&lt;/p&gt;

&lt;p&gt;The part Claude did not expect is that this is not really about being disciplined. The interface decides which pattern is easy. A pile of text invites spray and pray, so that is what you get. A store that returns ranked, typed records with their conflicts attached makes target, expand, confirm the path of least resistance, so that is what you get instead. Same model, different reliability, because the shape of the memory changed what was easy to do. The session I described went past nudging. It would not let Claude end the turn with a flagged fact still unread.&lt;/p&gt;

&lt;h2&gt;
  
  
  "I do not know" is a feature
&lt;/h2&gt;

&lt;p&gt;We treat "I do not know" like a failure state. It is the opposite. A memory you can trust is one that surfaces its own uncertainty instead of hiding it. When the shaky facts are labeled shaky, you stop re-checking everything, because you no longer distrust everything by default. You check the handful the memory itself flagged, and you rely on the rest. The steady low tax of second-guessing drops, because the doubt is out in the open where it belongs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where you actually need this
&lt;/h2&gt;

&lt;p&gt;Let me be honest about the threshold, because the answer is not "always."&lt;/p&gt;

&lt;p&gt;If you are starting fresh, with no history and one small task in front of you, a plain notes file is the right tool and everything above is overkill. I am not going to pretend otherwise.&lt;/p&gt;

&lt;p&gt;That state lasts about one session. The moment you have a past worth keeping, the past is in scope, because nobody works in a vacuum. Today's question reaches back into last month's decisions. So this is not a dial you set by project size and then sit at. It is a one-way door. You walk through it early, the first time your accumulated context starts to matter, and you do not walk back. After that, the plain file is quietly losing things and agreeing with whatever it returns, and you will not notice until you act on a line that stopped being true a while ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The point
&lt;/h2&gt;

&lt;p&gt;Confidently wrong is worse than "I do not know." And quietly losing what you already knew is worse still, because nothing tells you it happened. A memory worth trusting has to be able to say three things out loud: I am not sure, this was replaced, and here is the disagreement.&lt;/p&gt;

&lt;p&gt;So I built one that can. It is open source: &lt;a href="https://github.com/H-XX-D/recall-memory-substrate" rel="noopener noreferrer"&gt;https://github.com/H-XX-D/recall-memory-substrate&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have hit the confident-misremembering failure yourself, I would like to hear the shape it took.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>discuss</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why you still do not trust your AI's memory</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Sun, 21 Jun 2026 00:43:40 +0000</pubDate>
      <link>https://dev.to/hendrixx/why-you-still-do-not-trust-your-ais-memory-2cko</link>
      <guid>https://dev.to/hendrixx/why-you-still-do-not-trust-your-ais-memory-2cko</guid>
      <description>&lt;p&gt;You have probably felt this without naming it. You tell an agent something, it says it will remember, and twenty minutes later you are quietly re-explaining the same thing, because you cannot actually tell whether it kept the fact or dropped it. So you hedge, and you repeat yourself. There is a low-grade tax you pay on every long session, and it is the cost of not trusting the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The distrust is not irrational
&lt;/h2&gt;

&lt;p&gt;Most AI memory cannot be checked. It either stores your conversation as a flat pile of notes and greps it later, or it ships your data to a service that returns a few similar-looking chunks and hopes one of them is current. In both cases you cannot see what it actually kept, you cannot see when it changed its mind, and you cannot see why it answered the way it did. It is a black box asking you to trust it, which is the one thing you cannot do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is not a bigger model
&lt;/h2&gt;

&lt;p&gt;It is making the memory able to do two things a note cannot: show its work, and disagree with itself in the open.&lt;/p&gt;

&lt;p&gt;Here is what I mean, with a real example from today. I asked my own agent where a new blog post should slot into a content calendar I had built earlier in the session. A grep over a markdown file would have handed back every version of that calendar as equally true text and left the agent to guess which one was live. A hosted memory API would have quietly resolved that at write time, rewriting or dropping the old versions, so neither of us would ever know the calendar had changed.&lt;/p&gt;

&lt;p&gt;Instead the memory came back with the answer and the receipts:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This calendar was resequenced twice. Here is which version replaced which, here is how confident each one is, and here are the two cells that still need a second look.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It disagreed with its own older self, on the record, and showed me the trail. I did not have to trust that the agent remembered right. I could see it.&lt;/p&gt;

&lt;p&gt;That is the whole difference. A grepped file cannot disagree with itself, it just returns all the text. A hosted store does disagree with itself, but in private, where you cannot audit it. The only self-correction a skeptic can trust is the kind that happens in the open, where the losing version is still there with an arrow pointing from the thing that replaced it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that matters most
&lt;/h2&gt;

&lt;p&gt;What you end up trusting is not the model's good intentions. It is a system that does not let the agent guess. When a fact the agent is about to lean on has been superseded, the system flags it and makes the agent go re-check before acting. Trust that depends on the model behaving well today is not trust, it is luck. Trust enforced by the structure survives a bad day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who this is actually for
&lt;/h2&gt;

&lt;p&gt;If you just want a scratchpad, a markdown file is fine and you do not need any of this. This is for real work over a long horizon: switching between tasks, coming back days later, needing to know that what the memory tells you is current and checkable. For that, being more than a note is the entire point.&lt;/p&gt;

&lt;p&gt;The strange part is how it feels once the memory is trustworthy. The second-guessing tax disappears. You hand it something an hour and ten tasks deep and it picks up exactly where you left off, with no re-priming and no guessing at what was already done. It turns out most of the friction in working with AI was never the intelligence. It was not being able to trust what it remembered.&lt;/p&gt;

&lt;p&gt;If you want to see the receipts yourself, it is open source: &lt;a href="https://github.com/H-XX-D/recall-memory-substrate" rel="noopener noreferrer"&gt;https://github.com/H-XX-D/recall-memory-substrate&lt;/a&gt;. Run a query and look at what comes back. The output is the argument.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Your Agent's Memory Looks Like It Works. Here Is a One-Minute Test That Tells You If It Actually Does.</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Sat, 20 Jun 2026 04:04:50 +0000</pubDate>
      <link>https://dev.to/hendrixx/your-agents-memory-looks-like-it-works-here-is-a-one-minute-test-that-tells-you-if-it-actually-4j2c</link>
      <guid>https://dev.to/hendrixx/your-agents-memory-looks-like-it-works-here-is-a-one-minute-test-that-tells-you-if-it-actually-4j2c</guid>
      <description>&lt;p&gt;&lt;strong&gt;For about six months I believed my agent's memory was working.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It remembered things across sessions. It pulled up the right context when I came back to a project. It corrected itself when something changed. Every visible sign said the system I built was doing its job.&lt;/p&gt;

&lt;p&gt;It was not doing its job. Claude Code ships its own built-in memory, and &lt;em&gt;that&lt;/em&gt; was the thing actually answering. Mine was running too, writing to its own store, looking busy, but it was the understudy. The native one had the lead the whole time and I never noticed I had given it away. For months I was reading my own system's success off a stage where a different actor was speaking the lines.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Nothing looked wrong. The agent gave good answers. That is exactly the problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Silent success is the dangerous kind
&lt;/h2&gt;

&lt;p&gt;A system that fails loudly is the easy case. You see the gap, you fix it.&lt;/p&gt;

&lt;p&gt;A system that is quietly shadowed is the dangerous one, because a shadow produces helpful, plausible output, so it looks identical to success. You cannot tell &lt;em&gt;my system works&lt;/em&gt; apart from &lt;em&gt;something else is working on my system's behalf&lt;/em&gt; by looking at the output, because the output is the same in both cases. That is the trap, and a good answer is not the way out of it.&lt;/p&gt;

&lt;p&gt;The only way out is a forcing function. You turn the other thing off and see what happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test
&lt;/h2&gt;

&lt;p&gt;It works on any agent memory setup, not just mine, and it takes about a minute. Turn off the runtime's native memory. In Claude Code that is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CLAUDE_CODE_DISABLE_AUTO_MEMORY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then use your agent the way you normally do. Ask it to remember something. Come back in a new session and ask for it. Watch what your system actually does once the understudy is sent home.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If your memory still works&lt;/strong&gt;, good. It was always the one doing the work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If it suddenly goes blank&lt;/strong&gt;, the native store was carrying you, and every demo you have given was the shadow, not your system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I finally ran this on my own setup, mine went quiet. Six months of "it works" turned out to be six months of something else covering for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this gets worse, not better
&lt;/h2&gt;

&lt;p&gt;Any time you bolt a memory system onto a runtime that already has its own, you are exposed to this. And the smarter the underlying model gets, the better it papers over the gap, which means the better your demos look, the less they prove.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A polished demo on a capable model is not evidence your system works. It can just as easily be evidence the model is good enough to hide that it does not.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So do not trust that your memory works because the answers are good. Look at what is actually persisted, and run the off-test. Turn the other thing off, and find out who has really been talking.&lt;/p&gt;

&lt;p&gt;It cost me half a year to learn that. It costs you one line and one minute.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>memory</category>
    </item>
    <item>
      <title>Your agent's memory should compute confidence, not store it</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:54:13 +0000</pubDate>
      <link>https://dev.to/hendrixx/your-agents-memory-should-compute-confidence-not-store-it-c2a</link>
      <guid>https://dev.to/hendrixx/your-agents-memory-should-compute-confidence-not-store-it-c2a</guid>
      <description>&lt;p&gt;Most agent memory stores a confidence score the way it stores everything else. You&lt;br&gt;
  write it once and it sits there. The agent decides a fact is worth 0.9, the store&lt;br&gt;
  keeps 0.9, and three weeks later, after something has contradicted that fact, the&lt;br&gt;
  store still hands back 0.9. Confidence was a number written at one moment and&lt;br&gt;
  never looked at again. It is stale, and nothing in the system knows it.&lt;/p&gt;

&lt;p&gt;That is the quiet failure of pull memory. You query, it returns the closest&lt;br&gt;
  matches with whatever score they were saved at, and noticing that a fact has gone&lt;br&gt;
  soft is on you.&lt;/p&gt;

&lt;p&gt;Recall takes the other path. Effective confidence is not a stored field. It is&lt;br&gt;
  recomputed from the graph every time you read, so a contradiction landing anywhere&lt;br&gt;
  drops the claim's confidence on the next query, with no model rerun and no human&lt;br&gt;
  in the loop.&lt;/p&gt;

&lt;p&gt;The formula&lt;/p&gt;

&lt;p&gt;It is plain arithmetic, on purpose. For a cell, the effective confidence is:&lt;/p&gt;

&lt;p&gt;effective = clamp01( stated × calibration + support − challenge )&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stated is what the author claimed when they wrote it.&lt;/li&gt;
&lt;li&gt;calibration discounts the author by their track record.&lt;/li&gt;
&lt;li&gt;support is corroboration from incoming supports edges.&lt;/li&gt;
&lt;li&gt;challenge is the weight of incoming contradicts and concerns edges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Support and challenge are not raw sums. Each is squashed through a saturation&lt;br&gt;
  curve with a different ceiling:&lt;/p&gt;

&lt;p&gt;support   = 0.15 × tanh(supportMass)&lt;br&gt;
  challenge = 0.60 × tanh(challengeMass)&lt;/p&gt;

&lt;p&gt;The asymmetry is the whole point. Corroboration is cheap to manufacture, so&lt;br&gt;
  support saturates fast under a low ceiling: stack ten agreeing cells and you add&lt;br&gt;
  at most 0.15. Real contradiction is rare and informative, so challenge runs to a&lt;br&gt;
  0.6 ceiling. One honest contradiction can move a claim further than a pile of&lt;br&gt;
  agreement.&lt;/p&gt;

&lt;p&gt;A worked example you can check&lt;/p&gt;

&lt;p&gt;A fresh claim, stated 0.9, author with no track record yet, no support, no&lt;br&gt;
  challenge:&lt;/p&gt;

&lt;p&gt;effective = clamp01(0.9 × 1 + 0 − 0) = 0.90&lt;/p&gt;

&lt;p&gt;One contradiction lands from a source stated at 1.0, a challengeMass of 1.0:&lt;/p&gt;

&lt;p&gt;challenge = 0.60 × tanh(1.0) = 0.457&lt;br&gt;
  effective = clamp01(0.90 − 0.457) = 0.44&lt;/p&gt;

&lt;p&gt;The same claim now reads 0.44. Nobody edited it. A second contradiction pushes the&lt;br&gt;
  mass to 2.0:&lt;/p&gt;

&lt;p&gt;challenge = 0.60 × tanh(2.0) = 0.578&lt;br&gt;
  effective = clamp01(0.90 − 0.578) = 0.32&lt;/p&gt;

&lt;p&gt;Down to 0.32, and the original 0.9 is still on record, just demoted. Ten&lt;br&gt;
  supporting cells would have added at most 0.15. Cheap agreement barely moves it; a&lt;br&gt;
  real challenge moves it a lot.&lt;/p&gt;

&lt;p&gt;Calibration, and one honest choice in it&lt;/p&gt;

&lt;p&gt;Before support and challenge apply, the author's stated number is multiplied by a&lt;br&gt;
  calibration factor. An author contradicted before gets discounted, by how often&lt;br&gt;
  they were wrong times how confident they were when wrong, floored at 0.5 so it&lt;br&gt;
  never zeroes anyone out.&lt;/p&gt;

&lt;p&gt;The honest detail is what it is not. It is not raw Brier scoring. Raw Brier also&lt;br&gt;
  punishes a humble author who hedges low on claims that turn out fine, and&lt;br&gt;
  punishing humility is the opposite of the incentive a memory system should create.&lt;br&gt;
  So the discount keys on overconfidence specifically, being wrong while sure.&lt;br&gt;
  Hedge honestly and you are not penalized. Claim 0.95 and get contradicted and you&lt;br&gt;
  are.&lt;/p&gt;

&lt;p&gt;Why this beats a stored score&lt;/p&gt;

&lt;p&gt;A vector store returns the score a chunk was embedded with. A flat notes file&lt;br&gt;
  returns whatever it says. Neither knows the fact was contradicted last Tuesday,&lt;br&gt;
  because the contradiction is not part of how the score is computed. The score and&lt;br&gt;
  the conflict live in different places.&lt;/p&gt;

&lt;p&gt;In Recall they live in the same place. The contradiction is an edge on the graph,&lt;br&gt;
  and the score is computed from the graph, so the moment the edge exists the score&lt;br&gt;
  reflects it, on the next read, deterministically. The reader is the same agent&lt;br&gt;
  that wrote the memory, working from fresh context, and the substrate reprices what&lt;br&gt;
  it knows underneath it.&lt;/p&gt;

&lt;p&gt;What it is not&lt;/p&gt;

&lt;p&gt;This is a ranking signal, not a verdict on truth. A low effective confidence means&lt;br&gt;
  a claim is contested or comes from an author who has been wrong while sure, not&lt;br&gt;
  that it is false. The ceilings and curves are tunable defaults. And it is&lt;br&gt;
  deliberately deterministic arithmetic over the graph, not a model second-guessing&lt;br&gt;
  itself, which is what makes it inspectable: open any cell and see why its number&lt;br&gt;
  is what it is, term by term.&lt;/p&gt;

&lt;p&gt;That is the trade. You give up a number that looks stable and never moves. You get&lt;br&gt;
  one you can recompute, that demotes a stale claim the instant the evidence turns,&lt;br&gt;
  and that you can read the reasons for. For an agent that has to act on what it&lt;br&gt;
  remembers, the second is worth more.&lt;/p&gt;

&lt;p&gt;Recall is local-first, runs on SQLite, and sets up with one command. The code and&lt;br&gt;
  the formula above are open: github.com/H-XX-D/recall-memory-substrate&lt;/p&gt;

</description>
      <category>ai</category>
      <category>memory</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Push vs Pull Memory: A Better Way to Think About AI Agent Memory</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Thu, 18 Jun 2026 05:47:21 +0000</pubDate>
      <link>https://dev.to/hendrixx/push-vs-pull-memory-a-better-way-to-think-about-ai-agent-memory-3lnp</link>
      <guid>https://dev.to/hendrixx/push-vs-pull-memory-a-better-way-to-think-about-ai-agent-memory-3lnp</guid>
      <description>&lt;h1&gt;
  
  
  Push vs Pull Memory: A Better Way to Think About AI Agent Memory
&lt;/h1&gt;

&lt;p&gt;Pull memory is a store you query. Push memory is a loop your agent runs: it reads what it knows before acting, does the work, and writes back what changed, and the substrate reconciles that write so a stale fact gets superseded instead of lingering. Most agent memory today is pull. This post is about the other half of the design space, and when it is the one you actually want.&lt;/p&gt;

&lt;h2&gt;
  
  
  How agents remember today
&lt;/h2&gt;

&lt;p&gt;Almost everything sold as "agent memory" right now is pull. You write facts into a store: a vector database, a document store, or a managed memory service. Later, at read time, the agent sends a query and gets back the closest matches by similarity. That is it. The store is passive. It answers when asked and does nothing in between.&lt;/p&gt;

&lt;p&gt;Pull is simple, and it is the right tool in plenty of cases. If your agent answers one-off questions over a corpus that does not change much, or the session is short, or approximate recall is good enough, a vector store is fine and you should not overthink it.&lt;/p&gt;

&lt;p&gt;The trouble starts when a fact can be wrong later.&lt;/p&gt;

&lt;p&gt;Say your agent stored "the connection pool cap is 20." Weeks pass and the cap is raised to 50, so the agent stores that too. Now both facts live in the store. A similarity search can return either one, and nothing in the system knows that the second supersedes the first. The agent has no signal that one of these is stale. The job of noticing the conflict falls on the reader, on every single read, forever. In practice nobody does that reliably, so the agent quietly acts on outdated facts and you find out when something breaks.&lt;/p&gt;

&lt;p&gt;This is not a bug in any particular vector database. It is a property of the pull shape itself: reconciliation happens at read time, if it happens at all, and the responsibility for it sits with whoever is reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Push memory: reconcile at write time instead
&lt;/h2&gt;

&lt;p&gt;Push closes the loop. The contract is read, then work, then write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read current memory  -&amp;gt;  do the work  -&amp;gt;  write a correction
        ^                                        |
        +------  substrate supersedes + flags  --+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before the agent acts, it consults what it already knows. After it acts, it writes back what it learned. The key difference is what happens on that write. It is not an append. When the new fact corrects an old one, the agent writes it as a correction, and the substrate demotes the superseded value and records the link between the two. From then on, every read sees the current value first, with the old one flagged as contradicted, and no one had to ask.&lt;/p&gt;

&lt;p&gt;Reconciliation moves from read time to write time, and from the reader to the substrate. You pay the cost once, when you write, instead of every time you read. Stale facts do not pile up silently, because the moment a contradiction is written, it is resolved and recorded.&lt;/p&gt;

&lt;h2&gt;
  
  
  The axis
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Pull memory&lt;/th&gt;
&lt;th&gt;Push memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shape&lt;/td&gt;
&lt;td&gt;A store you query&lt;/td&gt;
&lt;td&gt;A loop you run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reconciliation&lt;/td&gt;
&lt;td&gt;At read time, by the reader&lt;/td&gt;
&lt;td&gt;At write time, by the substrate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stale facts&lt;/td&gt;
&lt;td&gt;Linger until a reader notices&lt;/td&gt;
&lt;td&gt;Superseded and flagged automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The write&lt;/td&gt;
&lt;td&gt;An append&lt;/td&gt;
&lt;td&gt;A correction, with provenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best when&lt;/td&gt;
&lt;td&gt;Facts are stable, sessions short&lt;/td&gt;
&lt;td&gt;Facts change, agents long-lived, correctness matters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why push memory is only buildable now
&lt;/h2&gt;

&lt;p&gt;The push shape is not a new idea. Truth-maintenance systems and belief revision were studying write-time reconciliation decades ago. The reason memory got built pull-first is that push needs something pull does not: a reliable author. Something has to consult memory before acting and write a principled correction afterward, every time, without being told. For most of computing history that author did not exist at scale. You were not going to get a human to do it on every write.&lt;/p&gt;

&lt;p&gt;A capable LLM agent is that author. It can read before it acts and write a structured correction after, as a normal part of its loop. That is what makes push memory practical today and not five years ago, and it is why the idea is worth a fresh look now even though the underlying theory is old.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which one do you need
&lt;/h2&gt;

&lt;p&gt;Be honest about it. If your agent answers questions over a mostly static corpus and does not live very long, pull is fine and simpler. Reach for push when your agent runs over days or weeks, accumulates decisions, and has to stay correct as the world changes underneath it. The deciding question is whether a fact can be wrong later. If it can, read-time similarity is not enough on its own, and you want write-time reconciliation.&lt;/p&gt;

&lt;p&gt;A quick test for what you already have: does your memory flag a contradiction without being asked? Store two facts that conflict, then query the topic. If you get back whichever is more similar with no signal that they disagree, you have pull. If the system surfaces the conflict and tells you which one is current, you have push.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this lands
&lt;/h2&gt;

&lt;p&gt;The honest framing is a spectrum, not a binary. Plenty of systems can be read either way, and some sit closer to the push end than others. The useful question is not "which store has the best search," it is "where does reconciliation live: in every reader, or in the substrate, once."&lt;/p&gt;

&lt;p&gt;I am building &lt;a href="https://github.com/H-XX-D/recall-memory-substrate" rel="noopener noreferrer"&gt;Recall&lt;/a&gt;, an open-source, local-first push memory substrate, to take the push end seriously. The agent consults a compiled context packet before acting and writes structured corrections back through an admission layer. Supersession is built in. It runs on local SQLite, every fact carries provenance, and there is a one-command undo. No server, no account, no cloud. There is a short screencast of a live supersession in the README, and a benchmark called SENTINEL that measures whether a memory system catches its own contradictions.&lt;/p&gt;

&lt;p&gt;If you think the push vs pull split is wrong, or that your system is push and I have it filed under pull, I want to hear it.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is pull memory?&lt;/strong&gt; A passive store you query at read time, where reconciling stale or conflicting facts is the reader's job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is push memory?&lt;/strong&gt; A loop where the agent reads before acting and writes corrections back, and the substrate reconciles at write time, superseding stale facts automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is RAG push or pull?&lt;/strong&gt; Pull. Retrieval-augmented generation fetches similar chunks at read time and does not reconcile across writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are Mem0, Zep, or Letta push or pull?&lt;/strong&gt; Mostly pull in the sense above, though they differ. Zep's bi-temporal graph does some write-time reconciliation and sits closer to the push end than a plain vector store. The cleanest way to tell is the contradiction test: does it flag a conflict without being asked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When should I not bother with push memory?&lt;/strong&gt; Short-lived agents over a static corpus, where approximate recall is enough. Pull is simpler and fine there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>memory</category>
    </item>
    <item>
      <title>I built a thing</title>
      <dc:creator>Todd Hendricks</dc:creator>
      <pubDate>Fri, 24 Oct 2025 05:26:11 +0000</pubDate>
      <link>https://dev.to/hendrixx/i-built-a-thing-4iel</link>
      <guid>https://dev.to/hendrixx/i-built-a-thing-4iel</guid>
      <description>&lt;p&gt;I've spent the last couple weeks slowly growing a beard becuase i've been hooked to this keyboard trying to shave milli second off latency  here gain throughput there a byte or 2 off this header thats expanding my packet finally getting a decent ratio only test to a diferent data set and get sub standard results but I think I think I did a Lossless Ai data streaming compression pipeline that does a few things. It compresses repetative data streams to sub ms on data it has a a template for it adapts to that data as it changes it has a metadata side channel that always routeing and processing without decompression while maintaining a human readable audit layer that stays server side for compliance and alignment. Now that i got it working 100% i dont know what to do with myself or it I have a working python package for testing What it needs now is traction it could save large companies millions a year in lower bandwidth higher throuhput and decreasce latency while staying compliant. I would gladly partner with someone out there that can navigate this world      &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/hendrixx-cnc/AURA/tree/main/AURA-main" rel="noopener noreferrer"&gt;https://github.com/hendrixx-cnc/AURA/tree/main/AURA-main&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;the patent is there and a python package if you think theres any potentiial I image healthcare financial governmentsector sthe audit layer next up DB look up on store dad comressed data &lt;/p&gt;

</description>
      <category>showdev</category>
      <category>ai</category>
      <category>performance</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
