<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mahendra Gurjar</title>
    <description>The latest articles on DEV Community by Mahendra Gurjar (@mahendra4).</description>
    <link>https://dev.to/mahendra4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3738814%2F035cedc2-5b90-40ed-abf9-57936f254cde.jpg</url>
      <title>DEV Community: Mahendra Gurjar</title>
      <link>https://dev.to/mahendra4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mahendra4"/>
    <language>en</language>
    <item>
      <title>GML5 IndexCache</title>
      <dc:creator>Mahendra Gurjar</dc:creator>
      <pubDate>Tue, 30 Jun 2026 03:42:43 +0000</pubDate>
      <link>https://dev.to/mahendra4/gml5-indexcache-10nf</link>
      <guid>https://dev.to/mahendra4/gml5-indexcache-10nf</guid>
      <description>&lt;h1&gt;
  
  
  IndexCache: Killing the Indexer's O(NL²) Bottleneck in DeepSeek Sparse Attention
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Notes from my notebook on GLM-5.2 / DeepSeek Sparse Attention (DSA), reconstructed from the IndexCache paper (Bai, Dong et al., Tsinghua + Z.ai, 2026) — the mechanism behind GLM-5.2's "IndexShare."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Why this exists — the bottleneck nobody talks about
&lt;/h2&gt;

&lt;p&gt;DSA's whole pitch is: don't do full O(L²) attention, instead let a cheap &lt;strong&gt;lightning indexer&lt;/strong&gt; look at all preceding tokens and pick the top-k (k=2048) that actually matter, then do real attention only on those. That drops core attention from O(L²) → O(Lk).&lt;/p&gt;

&lt;p&gt;Great — except I missed this the first time I read DSA: &lt;strong&gt;the indexer itself is still O(L²)&lt;/strong&gt;. It has to score every preceding token against the query to decide who's in the top-k. So across N layers you've traded one O(L²) cost for N separate O(L²) costs — total O(NL²). At long context this &lt;em&gt;indexer&lt;/em&gt; becomes the dominant cost, not the attention it was supposed to fix.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Adding the indexer is "DSA on steroids" because it kills DSA's one real bottleneck (full attention) — but in doing so, it grows its own. The indexer is cheap per-FLOP (few heads, low-rank, FP8) but it still runs at every single layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fix the paper proposes isn't a smarter indexer — it's &lt;strong&gt;don't run it every layer at all.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The core insight: adjacent layers pick almost the same tokens
&lt;/h2&gt;

&lt;p&gt;If you measure pairwise overlap between the top-k token sets selected by each layer's indexer, &lt;strong&gt;adjacent layers share 70–100% of their picks.&lt;/strong&gt; The heatmap even shows block structure — clusters of layers (e.g. layers 3–5, 17–30, etc.) that all converge on roughly the same "important" tokens.&lt;/p&gt;

&lt;p&gt;So most of the O(NL²) indexer cost is &lt;em&gt;redundant computation of the same answer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This motivates &lt;strong&gt;IndexCache&lt;/strong&gt;: split the N layers into two roles —&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;F (Full)&lt;/strong&gt; layers — run their own indexer, compute fresh top-k, cache it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S (Shared)&lt;/strong&gt; layers — skip the indexer entirely, just reuse the nearest preceding F layer's cached top-k.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first layer is always F (has to seed the cache).&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference loop comparison
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Standard DSA:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;Indexer_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;SparseAttn_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;FFN_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# + norm, residual
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;IndexCache:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c_l&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;Indexer_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;T_cache&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# c_l == S
&lt;/span&gt;        &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;T_cache&lt;/span&gt;         &lt;span class="c1"&gt;# reuse
&lt;/span&gt;    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;SparseAttn_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="err"&gt;⁽&lt;/span&gt;&lt;span class="n"&gt;ˡ&lt;/span&gt;&lt;span class="err"&gt;⁾&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nc"&gt;FFN_l&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;T_cache&lt;/code&gt; is just a temp buffer holding the &lt;em&gt;current&lt;/em&gt; index tensor — it gets overwritten at every F layer, so it adds &lt;strong&gt;zero extra GPU memory&lt;/strong&gt; over standard DSA. The only real change to the loop is one if/else branch. That's the whole elegance of this method — no architecture surgery, just a routing decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Finding top-k (the indexer mechanics, cleaned up)
&lt;/h2&gt;

&lt;p&gt;This part is just DSA's own lightning indexer, for reference since it's what gets shared:&lt;/p&gt;

&lt;p&gt;Compatibility between query &lt;code&gt;q&lt;/code&gt; and each candidate position &lt;code&gt;i&lt;/code&gt;, per block/head:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s_i = q · W_i + b_i&lt;/code&gt; — raw score for position i&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;g_i = max(0, s_i)&lt;/code&gt; — ReLU gate (this is the "lightning" part: cheap, no softmax needed before selection)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Top-k = argmax_i(g_i)&lt;/code&gt; over all i — pick the k highest-scoring positions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This sits &lt;em&gt;underneath&lt;/em&gt; MLA (Multi-head Latent Attention). The reason MLA matters here: instead of every head keeping its own full KV, MLA squeezes all heads' KV into one shared &lt;strong&gt;low-rank latent vector&lt;/strong&gt; — &lt;code&gt;latent = x·W^D&lt;/code&gt; (down-projection). The indexer scores against this compressed representation, which is part of why it's so much cheaper per-FLOP than the main attention.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Two ways to find the F/S pattern
&lt;/h2&gt;

&lt;p&gt;The question is: which layers do you keep as F? Two answers, training-free and training-aware — and notably, the "obvious" third answer (similarity-based) &lt;strong&gt;fails&lt;/strong&gt;. Order of discovery matters here, so I'm keeping it in the order the paper actually tried things.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Why the naive static pattern fails
&lt;/h3&gt;

&lt;p&gt;The dumbest idea: just alternate uniformly, e.g. &lt;code&gt;F S S S F S S S ...&lt;/code&gt; (1 F every 4 layers). This &lt;strong&gt;doesn't work well&lt;/strong&gt;. Why: indexer "importance" is &lt;em&gt;not&lt;/em&gt; uniform across depth. Some layers — especially early/transitional ones — are way more sensitive to losing their own indexer than others. A fixed period can easily land an F on a redundant layer and an S on a critical one. You need the model (or data) to tell you which layers are safe to share.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Training-free IndexCache — greedy search
&lt;/h3&gt;

&lt;p&gt;No weight updates at all. Just:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with &lt;strong&gt;all layers = F&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Pool of candidate layers = &lt;code&gt;{2, 3, ..., N}&lt;/code&gt; (layer 1 is always F — has to seed the cache).&lt;/li&gt;
&lt;li&gt;Pick a small calibration dataset (cached batches from training data — same batches reused for every candidate evaluation, so loss differences come purely from the pattern, not data noise).&lt;/li&gt;
&lt;li&gt;For each step: try flipping every remaining F layer to S, one at a time, measure resulting LM loss on the calibration set, and &lt;strong&gt;commit whichever flip increases the loss the least.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Repeat for K steps, where K = target number of S layers (e.g. K = 3N/4 to keep only 1/4 of indexers).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is literally a greedy "convert layers one-by-one, always pick the one with minimum loss increase" search — full search is O(N²) forward passes, but if you've got pipeline-parallel stages (P of them), you can split layers into P blocks and search them in parallel, cutting total passes by roughly P×.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get out of this (empirically, from the paper's 30B DSA model + GLM-5):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The searched pattern reliably beats uniform interleaving at the same retention ratio.&lt;/li&gt;
&lt;li&gt;The per-step loss curve has a visible kink — first ~20 layers are "easy" (cheap to convert), the rest are "critical" (loss jumps fast). So there's a real ordering of indexer importance baked into the model, not noise.&lt;/li&gt;
&lt;li&gt;This ranking is stable across different calibration sets — it's an intrinsic property of the trained model, not a calibration artifact.&lt;/li&gt;
&lt;li&gt;Retaining only &lt;strong&gt;1/4 of indexers&lt;/strong&gt; (75% removed) with the searched pattern matches the original model's downstream performance almost exactly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Training-aware IndexCache — multi-layer distillation
&lt;/h3&gt;

&lt;p&gt;If you're willing to retrain (continued pretraining, not from scratch), you can go further: &lt;strong&gt;force the indexer to actually learn to serve multiple layers&lt;/strong&gt;, instead of hoping a pattern search finds layers that happen to tolerate sharing.&lt;/p&gt;

&lt;p&gt;Standard DSA already trains each layer's indexer via KL-divergence distillation against that &lt;em&gt;same&lt;/em&gt; layer's aggregated attention distribution &lt;code&gt;p_t⁽ˡ⁾&lt;/code&gt;. The extension here: if layer &lt;code&gt;ℓ&lt;/code&gt; is F and serves S layers &lt;code&gt;ℓ+1, ..., ℓ+m&lt;/code&gt;, train its indexer against &lt;strong&gt;all of them jointly&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L_multi = Σ_{j=0}^{m} [ 1/(m+1) · Σ_t D_KL( p_t^(ℓ+j) || q_t^(ℓ) ) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;q_t⁽ˡ⁾&lt;/code&gt; = indexer's own output distribution (softmax of its scores) at layer ℓ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;p_t⁽ˡ⁾&lt;/code&gt; = the real aggregated attention distribution at layer ℓ (averaged across heads)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1/(m+1)&lt;/code&gt; = just averaging over however many layers reuse this same index&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important note (training detail I almost missed):&lt;/strong&gt; you don't do this from random init. A randomly initialized model's attention distribution has no real structure yet — forcing the indexer to chase an undefined target just injects noise. So this is always done as continued pretraining / fine-tuning on top of an already-trained DSA model, in two stages: a frozen "dense warm-up" that trains only the indexer, then a "sparse training" phase that activates top-k and trains everything jointly.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The proof: L_multi and L_avg give the exact same gradient
&lt;/h2&gt;

&lt;p&gt;This is the part of my notes that was the messiest, so here's the clean derivation.&lt;/p&gt;

&lt;p&gt;Define the &lt;strong&gt;averaged target distribution&lt;/strong&gt; across the m+1 served layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;p̄_t = Σ_{j=0}^{m} [ 1/(m+1) · p_t^(ℓ+j) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the single-target loss using that averaged target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L_avg = Σ_t D_KL( p̄_t || q_t^(ℓ) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claim:&lt;/strong&gt; &lt;code&gt;∇_θ L_multi = ∇_θ L_avg&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proof.&lt;/strong&gt; The key trick: in &lt;code&gt;D_KL(p || q)&lt;/code&gt;, only &lt;code&gt;q&lt;/code&gt; depends on the trainable parameters θ (p is just data — the real attention distribution, treated as a fixed target with &lt;code&gt;stop-gradient&lt;/code&gt;). So when you differentiate KL divergence w.r.t. θ, the entropy term of &lt;code&gt;p&lt;/code&gt; (which doesn't depend on θ) vanishes entirely. What's left is just the cross-entropy term:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;∇_θ D_KL(p || q_t^(ℓ)) = -∇_θ Σ_s p(s) · log q_t^(ℓ)(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the step I got stuck on in my notebook — I wasn't sure &lt;em&gt;why&lt;/em&gt; only the &lt;code&gt;log q&lt;/code&gt; term survives. The answer is straightforward once you write KL out fully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;D_KL(p || q) = Σ_s p(s) log p(s)  −  Σ_s p(s) log q(s)
                └──────┬──────┘     └───────┬───────┘
              entropy term of p      cross-entropy term
              (no θ dependence,        (only term with θ,
               gradient = 0)             via q = softmax(indexer))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now apply this to &lt;code&gt;L_multi&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;∇_θ L_multi = - Σ_{j=0}^{m} [1/(m+1)] Σ_t ∇_θ Σ_s p_t^(ℓ+j)(s) log q_t^(ℓ)(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the sum over j and the sum over s are both linear, swap their order and pull the constant log term out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;            = - Σ_t ∇_θ Σ_s [ Σ_{j=0}^{m} (1/(m+1)) p_t^(ℓ+j)(s) ] · log q_t^(ℓ)(s)
                                  └──────────────────┬──────────────────┘
                                                    = p̄_t(s)

            = - Σ_t ∇_θ Σ_s p̄_t(s) log q_t^(ℓ)(s)
            = ∇_θ L_avg.   ∎
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So averaging &lt;em&gt;before&lt;/em&gt; taking KL and summing the KL terms &lt;em&gt;after&lt;/em&gt; are mathematically identical at the gradient level — the indexer ends up being pulled toward the &lt;strong&gt;centroid&lt;/strong&gt; of all the attention distributions it serves, not toward any one layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then why use L_multi in practice if they're equivalent?&lt;/strong&gt; Pure memory/engineering reason: with &lt;code&gt;L_multi&lt;/code&gt;, each S layer only needs to send its own predicted &lt;code&gt;q&lt;/code&gt; value backward. With &lt;code&gt;L_avg&lt;/code&gt;, you'd need to pass both &lt;code&gt;p&lt;/code&gt; and &lt;code&gt;q&lt;/code&gt; for &lt;em&gt;every&lt;/em&gt; served layer to compute the average first — which means extra memory overhead and extra runtime cost for no actual gain, since the gradient comes out identical either way.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;My takeaway after sitting with this for a while: a lot of "novel" architecture papers ultimately reduce to "design the right loss function for what you want, and let the network figure out the rest." This derivation is a good concrete example — the multi-layer trick isn't a new optimization method, it's just an equivalent (and cheaper) way to write the same gradient.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Performance (30B DSA model, 200K context)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Standard DSA&lt;/th&gt;
&lt;th&gt;+ IndexCache (1/4 retained)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill latency&lt;/td&gt;
&lt;td&gt;19.5 s&lt;/td&gt;
&lt;td&gt;10.7 s (&lt;strong&gt;1.82× speedup&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode throughput (per request)&lt;/td&gt;
&lt;td&gt;58 tok/s&lt;/td&gt;
&lt;td&gt;86 tok/s (&lt;strong&gt;1.48× speedup&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why the training-aware version works where uniform static doesn't:&lt;/strong&gt; the greedy search has to &lt;em&gt;avoid&lt;/em&gt; sensitive layers because the model was never trained to tolerate sharing — without retraining, certain layers are tightly coupled to their own indexer's exact top-k, and feeding them someone else's indices causes a distribution shift that breaks things. Once you train with the multi-layer distillation loss, the S layers themselves &lt;em&gt;learn to adapt&lt;/em&gt; to inherited indices, and the F layer's indexer learns to produce a selection that generalizes across all the layers it serves. That joint adaptation is what makes even a dumb uniform pattern work fine after training — the layer-specific sensitivity just disappears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra structural note from the overlap heatmap:&lt;/strong&gt; the first layer is &lt;em&gt;always&lt;/em&gt; kept as a full F layer (it has to seed the index cache, and early layers attend to a fundamentally different token subset than later ones — overlap with deep layers is ≤0.4). The strongest, most similar index regions cluster near the diagonal — i.e., a layer's indexer output looks most like its &lt;em&gt;immediate&lt;/em&gt; neighbors, decaying as you move further away.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The failure case — and why it's actually an important negative result
&lt;/h2&gt;

&lt;p&gt;Before landing on the greedy LM-loss search, the natural-seeming alternative was tried: &lt;strong&gt;pick the sharing pattern by directly maximizing cosine similarity&lt;/strong&gt; between attention outputs, since that's cheaper to compute than running full LM-loss evaluations.&lt;/p&gt;

&lt;p&gt;Build an N×N similarity matrix &lt;code&gt;S[i][j]&lt;/code&gt; = cosine similarity between layer i's attention output using its own indexer vs. using layer j's indexer instead. Then solve for the best F/S assignment with dynamic programming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dp[i][k] = max over j&amp;lt;i, c_j=F of:
              dp[j][k-1] + Σ_{m=j+1}^{i-1} S[m][j]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;— i.e., find the best previous F layer to "branch" from, accumulating similarity scores for every S layer that would reuse it. Solvable exactly by backtracking through the DP table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This failed.&lt;/strong&gt; The similarity-optimal pattern performed about the same as plain uniform interleaving — both clearly worse than the greedy LM-loss search. The reason is the core insight of the whole negative result:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Cosine similarity is a &lt;strong&gt;local&lt;/strong&gt; metric — it only tells you how well-preserved a single layer's output is &lt;em&gt;in isolation&lt;/em&gt;. It can't see how small token-selection mismatches &lt;strong&gt;propagate and compound&lt;/strong&gt; through all the downstream layers. Two layers can have near-identical attention outputs (similarity ≈ 1) yet differ in exactly the handful of tokens that turn out to matter several layers later. Those subtle errors accumulate — and a layer-local similarity score has no way to predict that.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The LM-loss-based greedy search avoids this because it's a &lt;strong&gt;global, end-to-end&lt;/strong&gt; signal — it measures the actual downstream effect of a sharing decision on the whole model's output, not just on one layer's local activation. This is the real lesson: local geometric similarity is a tempting cheap proxy, but for anything where errors compound across depth, you need an end-to-end metric.&lt;/p&gt;




&lt;h2&gt;
  
  
  My summary of the idea in one line
&lt;/h2&gt;

&lt;p&gt;DSA's indexer recomputes "who matters" from scratch at every layer even though the answer barely changes between adjacent layers — IndexCache just caches that answer and reuses it, and the only real engineering question is &lt;em&gt;which&lt;/em&gt; layers are allowed to skip recomputation, which can be found either by greedy search (no training) or learned directly via a provably-equivalent averaged-KL loss (with training).&lt;/p&gt;

&lt;p&gt;if you found any mismatched detail in this post or want to contribute  in paper or working code for indexcache please open issue on&lt;br&gt;
&lt;a href="https://github.com/Mahendra1706/DecodeAI" rel="noopener noreferrer"&gt;github.link&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructureascode</category>
      <category>programming</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Your AI Agent Keeps Forgetting Things</title>
      <dc:creator>Mahendra Gurjar</dc:creator>
      <pubDate>Tue, 10 Feb 2026 04:27:27 +0000</pubDate>
      <link>https://dev.to/mahendra4/why-your-ai-agent-keeps-forgetting-things-gi3</link>
      <guid>https://dev.to/mahendra4/why-your-ai-agent-keeps-forgetting-things-gi3</guid>
      <description>&lt;p&gt;Ever built an AI agent that just... forgets stuff? You tell it something important, and 10 steps later, it's gone.&lt;/p&gt;

&lt;p&gt;I spent days debugging this exact problem, and it led me to build &lt;strong&gt;MemTrace&lt;/strong&gt; - a framework that automatically diagnoses why AI agents lose their memories.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Imagine you're building a personal assistant agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "My deadline is Friday"
Agent: "Got it! Your deadline is Friday"

[Agent does 15-20 other things]

You: "When's my deadline?"
Agent: "I don't have that information"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happened?&lt;/strong&gt; The agent forgot. But &lt;em&gt;why&lt;/em&gt;?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did it run out of memory space? (Eviction)&lt;/li&gt;
&lt;li&gt;Did it overwrite the deadline with something else? (Overwriting)&lt;/li&gt;
&lt;li&gt;Did the LLM just hallucinate a wrong answer? (Hallucination)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without proper diagnosis, you're just guessing. 🎲&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Event Sourcing for Memory
&lt;/h2&gt;

&lt;p&gt;Here's the key insight: &lt;strong&gt;Track every single memory operation as an immutable event.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of just storing data, we log:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Every WRITE (when data is stored)&lt;/li&gt;
&lt;li&gt;✅ Every READ (when data is retrieved)&lt;/li&gt;
&lt;li&gt;✅ Every UPDATE (when data is overwritten)&lt;/li&gt;
&lt;li&gt;✅ Every EVICT (when data is removed due to capacity)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like a flight recorder for your agent's brain. 🛩️&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ How MemTrace Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Log Everything
&lt;/h3&gt;

&lt;p&gt;Every memory operation creates an event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;MemoryEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WRITE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deadline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Friday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# How critical is this data?
&lt;/span&gt;    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1234567890&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Run Automated Tests
&lt;/h3&gt;

&lt;p&gt;Generate 1000+ random scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random memory operations (writes and reads)&lt;/li&gt;
&lt;li&gt;Different capacity constraints (what if memory is limited?)&lt;/li&gt;
&lt;li&gt;Varying importance levels (some data matters more)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Auto-Diagnose Failures
&lt;/h3&gt;

&lt;p&gt;For every READ operation, MemTrace automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Finds the original WRITE event&lt;/li&gt;
&lt;li&gt;Compares expected vs actual value&lt;/li&gt;
&lt;li&gt;If they don't match, traces through the event log to find out why&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example Diagnosis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ FAILURE DETECTED
Key: "deadline"
Expected: "Friday"
Actual: None

🔍 ROOT CAUSE: Memory Evicted
Evidence:
  • Written at step 1 (importance: 0.9)
  • Evicted at step 15 (reason: capacity overflow)
  • Read attempted at step 20

⚠️ CRITICAL FAILURE: High-importance data lost!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 What I Learned
&lt;/h2&gt;

&lt;p&gt;After running 1000+ scenarios, here's what the data showed:&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding #1: Capacity Matters (Obviously)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Low capacity (5 slots):  21% success rate
High capacity (30 slots): 29% success rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But here's the surprise: &lt;strong&gt;Even with 30 slots, 71% of reads still failed!&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important Context&lt;/strong&gt;: These scenarios are &lt;em&gt;extremely random&lt;/em&gt; - agents write and read completely unrelated keys with no semantic connection. Real agents would perform much better because they use context and patterns. The low pass rate reveals the &lt;strong&gt;worst-case scenario&lt;/strong&gt; under chaotic conditions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Finding #2: Overwrites Are Sneaky
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memory evictions:   ~2200 failures
Memory overwrites:  ~240 failures
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Overwrites happen regardless of capacity - they're about &lt;strong&gt;key reuse patterns&lt;/strong&gt;, not memory size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding #3: Critical Failures Are Real
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total memory failures: 2501
Critical failures:     892 (35.7%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;35% of memory failures involved high-importance data.&lt;/strong&gt; That's your deadlines, user preferences, and key facts - the stuff that actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Architecture (For the Experts)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐
│  User Command   │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│ StructuredAgent     │ ← Routes to STM or LTM
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ MemoryStore         │ ← Executes operation
│ (STM or LTM)        │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ MemoryEvent         │ ← Immutable event logged
│ (event_log)         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ auto_evaluate_all() │ ← Finds all READ events
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ diagnose_failure()  │ ← Root cause analysis
└─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Design Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Event Log = Ground Truth&lt;/strong&gt;: The event log is append-only and never modified. It's the single source of truth for diagnosis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Importance Tracking&lt;/strong&gt;: Each event has an importance score (0.0-1.0). Critical failures are flagged when high-importance data (≥0.7) is lost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Layer Memory&lt;/strong&gt;: Separate STM (capacity-limited) and LTM (unlimited) with automatic routing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Zero Manual Work&lt;/strong&gt;: &lt;code&gt;auto_evaluate_all()&lt;/code&gt; automatically finds every READ event and diagnoses failures without manual test case creation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🚀 Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;-b&lt;/span&gt; ltm https://github.com/Mahendra1706/MemTrace.git
&lt;span class="nb"&gt;cd &lt;/span&gt;MemTrace
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
python3 run.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
MEMTRACE RANDOM TESTING - 1000 Scenarios
============================================================

Total Reads: 4993
✅ Passed: 741 (14.8%)
❌ Failed: 4252 (85.2%)

Failure Breakdown:
  • Memory Evicted: 2261
  • Memory Overwritten: 240
  • Invalid Read: 1708

------------------------------------------------------------
CRITICAL FAILURES (High-Importance Data Loss)
------------------------------------------------------------
Total Critical Failures: 892
  • Critical Evictions: 798
  • Critical Overwrites: 94
============================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎓 What This Means for Your Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Beginners:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test your memory system&lt;/strong&gt; before deploying&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Know why failures happen&lt;/strong&gt;, don't just guess&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track what matters&lt;/strong&gt; with importance scores&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Experts:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Event sourcing&lt;/strong&gt; enables complete audit trails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical testing&lt;/strong&gt; reveals failure patterns at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Importance-based diagnosis&lt;/strong&gt; separates critical from trivial failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-layer architecture&lt;/strong&gt; (STM/LTM) mirrors cognitive science models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🔮 What's Next?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Current Limitations:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Random scenarios&lt;/strong&gt;: Completely random read/write patterns (worst-case testing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No semantic understanding&lt;/strong&gt;: Simple key-value storage, no context awareness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured commands&lt;/strong&gt;: Not integrated with real LLM calls yet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-threaded&lt;/strong&gt;: Sequential execution only&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future Vision: Semantic Memory
&lt;/h3&gt;

&lt;p&gt;The next major upgrade will transform MemTrace from &lt;strong&gt;key-value storage&lt;/strong&gt; to &lt;strong&gt;semantic memory&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current (v1.1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deadline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Friday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deadline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Must match exact key
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Future (v2.0 - Semantic Search):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My project deadline is Friday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;when is my project due?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Semantic match!
# Returns: "Friday" (understands the question relates to deadline)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it will work:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: Convert memories to vector representations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity Search&lt;/strong&gt;: Find relevant memories based on meaning, not exact keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-Aware Retrieval&lt;/strong&gt;: Understand relationships between memories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realistic Behavior&lt;/strong&gt;: Mimic how human memory actually works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This will make agents perform &lt;strong&gt;much better&lt;/strong&gt; than the current 14.8% pass rate, because they'll retrieve memories based on &lt;strong&gt;semantic relevance&lt;/strong&gt; rather than exact key matches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Planned Features:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Integration with LangChain/AutoGPT&lt;/li&gt;
&lt;li&gt;Real-time monitoring dashboard&lt;/li&gt;
&lt;li&gt;Advanced eviction policies (LRU, LFU, importance-based)&lt;/li&gt;
&lt;li&gt;Consolidation logic (STM → LTM based on importance)&lt;/li&gt;
&lt;li&gt;Vector database integration (Pinecone, Weaviate)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Mahendra1706/MemTrace" rel="noopener noreferrer"&gt;Mahendra1706/MemTrace&lt;/a&gt; (see &lt;code&gt;ltm&lt;/code&gt; branch for latest)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inspiration&lt;/strong&gt;: Event sourcing patterns, MemGPT architecture, cognitive memory models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related Work&lt;/strong&gt;: LangChain memory modules, AutoGPT memory systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💬 Let's Discuss
&lt;/h2&gt;

&lt;p&gt;Have you dealt with memory failures in your AI agents? What strategies worked for you?&lt;/p&gt;




&lt;p&gt;It's my first decent project (or that's what I think), so I'd love if you visit the repo or drop a comment!&lt;/p&gt;




</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Tested 3000+ LLM Agent Memory Operations - Here's What I Found</title>
      <dc:creator>Mahendra Gurjar</dc:creator>
      <pubDate>Thu, 29 Jan 2026 06:19:01 +0000</pubDate>
      <link>https://dev.to/mahendra4/i-tested-3000-llm-agent-memory-operations-heres-what-i-found-17pc</link>
      <guid>https://dev.to/mahendra4/i-tested-3000-llm-agent-memory-operations-heres-what-i-found-17pc</guid>
      <description>&lt;h2&gt;
  
  
  🤔 The Problem
&lt;/h2&gt;

&lt;p&gt;If you've built LLM-based agents, you've probably noticed: &lt;strong&gt;they forget things&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A lot.&lt;/p&gt;

&lt;p&gt;Your agent remembers the user's name in message 1, forgets it by message 5, and then hallucinates a completely different name by message 10.&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;why&lt;/strong&gt; do agents forget? Is it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory capacity issues?&lt;/li&gt;
&lt;li&gt;Information getting overwritten?&lt;/li&gt;
&lt;li&gt;The LLM hallucinating?&lt;/li&gt;
&lt;li&gt;Something else?
&lt;strong&gt;Nobody had data.&lt;/strong&gt; Just anecdotes and frustration.
So I built &lt;strong&gt;MemTrace&lt;/strong&gt; to answer this question with actual statistics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;-MemTrace** is a testing framework that tracks every single memory operation an agent makes and diagnoses why recalls fail.&lt;br&gt;
Think of it like a "black box recorder" for agent memory.&lt;br&gt;
&lt;strong&gt;Core idea:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track every WRITE, READ, UPDATE, and EVICT operation&lt;/li&gt;
&lt;li&gt;Compare what the agent returns vs. what was originally stored&lt;/li&gt;
&lt;li&gt;Diagnose failures with evidence from the event log
I tested &lt;strong&gt;1000 random scenarios&lt;/strong&gt; with &lt;strong&gt;3030 memory operations&lt;/strong&gt; to find patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;
&lt;h1&gt;
  
  
  Finding 1: Agents Forget 60% of the Time
&lt;/h1&gt;

&lt;p&gt;Valid Recall Rate: 39.6%**&lt;br&gt;
That means when an agent tries to recall information it previously          stored, it fails &lt;strong&gt;6 out of 10 times&lt;/strong&gt;.&lt;br&gt;
(This excludes "invalid reads" where the agent tries to read something that was never written - those are test artifacts, not real failures)&lt;/p&gt;
&lt;h1&gt;
  
  
  Finding 2: Evictions Dominate
&lt;/h1&gt;

&lt;p&gt;Memory Evicted: 46.2% of all failures**&lt;br&gt;
Nearly half of all memory failures happen because the agent ran out of space and had to evict old information.&lt;/p&gt;
&lt;h1&gt;
  
  
  Breakdown:
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Memory Evicted**: 994 failures (46.2%)&lt;/li&gt;
&lt;li&gt;Invalid Read**: 815 failures (37.9%)&lt;/li&gt;
&lt;li&gt;Memory Overwritten**: 343 failures (15.9%)&lt;/li&gt;
&lt;li&gt;LLM Hallucination**: 0 failures (0.0%)
*(Note: No hallucinations in this test because I used a deterministic agent. Real LLMs would show hallucinations too.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Finding 3: Capacity Matters (Proven)
&lt;/h1&gt;

&lt;p&gt;Validated Invariant: Capacity ↑ → Eviction ↓&lt;br&gt;
I tested different memory capacities and found a clear pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capacity Low    (1-5)   Evictions 1354  Pass Rate 21.3%&lt;/li&gt;
&lt;li&gt;Capacity Medium (10-15) Evictions 1150  Pass Rate 25.6% &lt;/li&gt;
&lt;li&gt;Capacity High   (20-30) Evictions 994   Pass Rate 29.0% &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑Higher capacity = fewer evictions = better recall.&lt;br&gt;
Seems obvious, but now we have data to prove it.&lt;/p&gt;
&lt;h1&gt;
  
  
  Finding 4: Overwrites Are Independent
&lt;/h1&gt;

&lt;p&gt;Overwrites stay constant regardless of capacity*&lt;br&gt;
Whether you have capacity of 5 or 30, you get ~340 overwrites per        1000scenarios.&lt;br&gt;
Why? Overwrites depend on how often you reuse the same keys, not how much memory you have.&lt;/p&gt;
&lt;h2&gt;
  
  
  Event Sourcing
&lt;/h2&gt;

&lt;p&gt;Every memory operation creates an immutable event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; python
 MemoryEvent(
     event_id="uuid-1234",
     event_type=MemoryEventType.WRITE,
     memory_layer=MemoryLayer.STM,
     step=1,
     timestamp=1706345678.123,
     key="user_name",
     value="Alice",
     metadata={}
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The event log becomes the single source of truth.&lt;/p&gt;

&lt;h1&gt;
  
  
  Automated Diagnosis
&lt;/h1&gt;

&lt;p&gt;When a read fails, MemTrace analyzes the event history:&lt;/p&gt;

&lt;p&gt;`&lt;/p&gt;

&lt;h1&gt;
  
  
  Scenario: Agent tries to read "deadline" but gets None
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Event log shows:
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;br&gt;
WRITE deadline="Friday" (step 2)&lt;br&gt;
 EVICT deadline="Friday" (step 5, reason: capacity_overflow)&lt;br&gt;
 READ deadline=None (step 7)&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Diagnosis: "memory_evicted"
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Evidence: "Key was written at step 2, evicted at step 5, recall attempted at step 7"
&lt;/h1&gt;

&lt;p&gt;`&lt;/p&gt;

&lt;h1&gt;
  
  
  4 Failure Types
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Memory Evicted - Removed due to capacity constraints&lt;/li&gt;
&lt;li&gt;Memory Overwritten - Updated with different value&lt;/li&gt;
&lt;li&gt;Invalid Read - Never written in the first place&lt;/li&gt;
&lt;li&gt;LLM Hallucination - Agent returns wrong value despite correct memory&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  System Invariants (Validated)
&lt;/h2&gt;

&lt;p&gt;After testing 1000 scenarios, these patterns hold:&lt;/p&gt;

&lt;p&gt;Capacity ↑ → Eviction ↓&lt;br&gt;
 Overwrite ~ independent of capacity&lt;br&gt;
 Invalid Read = scenario artifact&lt;br&gt;
 Unknown = 0 always (100% failure categorization)&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h1&gt;
  
  
  1. Event Sourcing Is Powerful
&lt;/h1&gt;

&lt;p&gt;Having a complete history of every operation makes debugging so much easier.&lt;/p&gt;

&lt;p&gt;Instead of guessing why something failed, you can trace back through the exact sequence of events.&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Capacity Is Critical
&lt;/h1&gt;

&lt;p&gt;If your agent has limited memory, evictions will dominate your failures.&lt;/p&gt;

&lt;p&gt;The data shows a clear linear relationship: double the capacity, reduce evictions by ~15%.&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Overwrites Are Sneaky
&lt;/h1&gt;

&lt;p&gt;Overwrites happen when you reuse keys. They're independent of capacity, which means you can't solve them by just adding more memory.&lt;/p&gt;

&lt;p&gt;You need better key management or versioning.&lt;/p&gt;

&lt;h1&gt;
  
  
  4. Testing Reveals Patterns
&lt;/h1&gt;

&lt;p&gt;Before building MemTrace, I thought hallucinations would be the main issue.&lt;br&gt;
Nope. Evictions are 3x more common (in my tests with deterministic agents).&lt;/p&gt;

&lt;p&gt;Real LLMs would show more hallucinations, but capacity is still a huge factor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback Welcome
&lt;/h2&gt;

&lt;p&gt;This is my first open-source project and I'm still learning!&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;p&gt;Have ideas for improvements&lt;br&gt;
Found a bug&lt;br&gt;
Have questions&lt;/p&gt;

&lt;p&gt;Please reach out!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Mahendra1706/MemTrace" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>agents</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
