DEV Community: Mahendra Gurjar

GML5 IndexCache

Mahendra Gurjar — Tue, 30 Jun 2026 03:42:43 +0000

IndexCache: Killing the Indexer's O(NL²) Bottleneck in DeepSeek Sparse Attention

Notes from my notebook on GLM-5.2 / DeepSeek Sparse Attention (DSA), reconstructed from the IndexCache paper (Bai, Dong et al., Tsinghua + Z.ai, 2026) — the mechanism behind GLM-5.2's "IndexShare."

1. Why this exists — the bottleneck nobody talks about

DSA's whole pitch is: don't do full O(L²) attention, instead let a cheap lightning indexer look at all preceding tokens and pick the top-k (k=2048) that actually matter, then do real attention only on those. That drops core attention from O(L²) → O(Lk).

Great — except I missed this the first time I read DSA: the indexer itself is still O(L²). It has to score every preceding token against the query to decide who's in the top-k. So across N layers you've traded one O(L²) cost for N separate O(L²) costs — total O(NL²). At long context this indexer becomes the dominant cost, not the attention it was supposed to fix.

Adding the indexer is "DSA on steroids" because it kills DSA's one real bottleneck (full attention) — but in doing so, it grows its own. The indexer is cheap per-FLOP (few heads, low-rank, FP8) but it still runs at every single layer.

The fix the paper proposes isn't a smarter indexer — it's don't run it every layer at all.

2. The core insight: adjacent layers pick almost the same tokens

If you measure pairwise overlap between the top-k token sets selected by each layer's indexer, adjacent layers share 70–100% of their picks. The heatmap even shows block structure — clusters of layers (e.g. layers 3–5, 17–30, etc.) that all converge on roughly the same "important" tokens.

So most of the O(NL²) indexer cost is redundant computation of the same answer.

This motivates IndexCache: split the N layers into two roles —

F (Full) layers — run their own indexer, compute fresh top-k, cache it.
S (Shared) layers — skip the indexer entirely, just reuse the nearest preceding F layer's cached top-k.

The first layer is always F (has to seed the cache).

Inference loop comparison

Standard DSA:

for l = 1 to N:
    I⁽ˡ⁾ ← Indexer_l(X)
    T⁽ˡ⁾ ← top-k(I⁽ˡ⁾)
    X ← SparseAttn_l(X, T⁽ˡ⁾)
    X ← FFN_l(X)  # + norm, residual

IndexCache:

for l = 1 to N:
    if c_l == F:
        I⁽ˡ⁾ ← Indexer_l(X)
        T⁽ˡ⁾ ← top-k(I⁽ˡ⁾)
        T_cache ← T⁽ˡ⁾
    else:  # c_l == S
        T⁽ˡ⁾ ← T_cache         # reuse
    X ← SparseAttn_l(X, T⁽ˡ⁾)
    X ← FFN_l(X)

T_cache is just a temp buffer holding the current index tensor — it gets overwritten at every F layer, so it adds zero extra GPU memory over standard DSA. The only real change to the loop is one if/else branch. That's the whole elegance of this method — no architecture surgery, just a routing decision.

3. Finding top-k (the indexer mechanics, cleaned up)

This part is just DSA's own lightning indexer, for reference since it's what gets shared:

Compatibility between query q and each candidate position i, per block/head:

s_i = q · W_i + b_i — raw score for position i
g_i = max(0, s_i) — ReLU gate (this is the "lightning" part: cheap, no softmax needed before selection)
Top-k = argmax_i(g_i) over all i — pick the k highest-scoring positions

This sits underneath MLA (Multi-head Latent Attention). The reason MLA matters here: instead of every head keeping its own full KV, MLA squeezes all heads' KV into one shared low-rank latent vector — latent = x·W^D (down-projection). The indexer scores against this compressed representation, which is part of why it's so much cheaper per-FLOP than the main attention.

4. Two ways to find the F/S pattern

The question is: which layers do you keep as F? Two answers, training-free and training-aware — and notably, the "obvious" third answer (similarity-based) fails. Order of discovery matters here, so I'm keeping it in the order the paper actually tried things.

4.1 Why the naive static pattern fails

The dumbest idea: just alternate uniformly, e.g. F S S S F S S S ... (1 F every 4 layers). This doesn't work well. Why: indexer "importance" is not uniform across depth. Some layers — especially early/transitional ones — are way more sensitive to losing their own indexer than others. A fixed period can easily land an F on a redundant layer and an S on a critical one. You need the model (or data) to tell you which layers are safe to share.

4.2 Training-free IndexCache — greedy search

No weight updates at all. Just:

Start with all layers = F.
Pool of candidate layers = {2, 3, ..., N} (layer 1 is always F — has to seed the cache).
Pick a small calibration dataset (cached batches from training data — same batches reused for every candidate evaluation, so loss differences come purely from the pattern, not data noise).
For each step: try flipping every remaining F layer to S, one at a time, measure resulting LM loss on the calibration set, and commit whichever flip increases the loss the least.
Repeat for K steps, where K = target number of S layers (e.g. K = 3N/4 to keep only 1/4 of indexers).

This is literally a greedy "convert layers one-by-one, always pick the one with minimum loss increase" search — full search is O(N²) forward passes, but if you've got pipeline-parallel stages (P of them), you can split layers into P blocks and search them in parallel, cutting total passes by roughly P×.

What you get out of this (empirically, from the paper's 30B DSA model + GLM-5):

The searched pattern reliably beats uniform interleaving at the same retention ratio.
The per-step loss curve has a visible kink — first ~20 layers are "easy" (cheap to convert), the rest are "critical" (loss jumps fast). So there's a real ordering of indexer importance baked into the model, not noise.
This ranking is stable across different calibration sets — it's an intrinsic property of the trained model, not a calibration artifact.
Retaining only 1/4 of indexers (75% removed) with the searched pattern matches the original model's downstream performance almost exactly.

4.3 Training-aware IndexCache — multi-layer distillation

If you're willing to retrain (continued pretraining, not from scratch), you can go further: force the indexer to actually learn to serve multiple layers, instead of hoping a pattern search finds layers that happen to tolerate sharing.

Standard DSA already trains each layer's indexer via KL-divergence distillation against that same layer's aggregated attention distribution p_t⁽ˡ⁾. The extension here: if layer ℓ is F and serves S layers ℓ+1, ..., ℓ+m, train its indexer against all of them jointly:

L_multi = Σ_{j=0}^{m} [ 1/(m+1) · Σ_t D_KL( p_t^(ℓ+j) || q_t^(ℓ) ) ]

where:

q_t⁽ˡ⁾ = indexer's own output distribution (softmax of its scores) at layer ℓ
p_t⁽ˡ⁾ = the real aggregated attention distribution at layer ℓ (averaged across heads)
1/(m+1) = just averaging over however many layers reuse this same index

Important note (training detail I almost missed): you don't do this from random init. A randomly initialized model's attention distribution has no real structure yet — forcing the indexer to chase an undefined target just injects noise. So this is always done as continued pretraining / fine-tuning on top of an already-trained DSA model, in two stages: a frozen "dense warm-up" that trains only the indexer, then a "sparse training" phase that activates top-k and trains everything jointly.

5. The proof: L_multi and L_avg give the exact same gradient

This is the part of my notes that was the messiest, so here's the clean derivation.

Define the averaged target distribution across the m+1 served layers:

p̄_t = Σ_{j=0}^{m} [ 1/(m+1) · p_t^(ℓ+j) ]

and the single-target loss using that averaged target:

L_avg = Σ_t D_KL( p̄_t || q_t^(ℓ) )

Claim: ∇_θ L_multi = ∇_θ L_avg.

Proof. The key trick: in D_KL(p || q), only q depends on the trainable parameters θ (p is just data — the real attention distribution, treated as a fixed target with stop-gradient). So when you differentiate KL divergence w.r.t. θ, the entropy term of p (which doesn't depend on θ) vanishes entirely. What's left is just the cross-entropy term:

∇_θ D_KL(p || q_t^(ℓ)) = -∇_θ Σ_s p(s) · log q_t^(ℓ)(s)

This is the step I got stuck on in my notebook — I wasn't sure why only the log q term survives. The answer is straightforward once you write KL out fully:

D_KL(p || q) = Σ_s p(s) log p(s)  −  Σ_s p(s) log q(s)
                └──────┬──────┘     └───────┬───────┘
              entropy term of p      cross-entropy term
              (no θ dependence,        (only term with θ,
               gradient = 0)             via q = softmax(indexer))

Now apply this to L_multi:

∇_θ L_multi = - Σ_{j=0}^{m} [1/(m+1)] Σ_t ∇_θ Σ_s p_t^(ℓ+j)(s) log q_t^(ℓ)(s)

Since the sum over j and the sum over s are both linear, swap their order and pull the constant log term out:

            = - Σ_t ∇_θ Σ_s [ Σ_{j=0}^{m} (1/(m+1)) p_t^(ℓ+j)(s) ] · log q_t^(ℓ)(s)
                                  └──────────────────┬──────────────────┘
                                                    = p̄_t(s)

            = - Σ_t ∇_θ Σ_s p̄_t(s) log q_t^(ℓ)(s)
            = ∇_θ L_avg.   ∎

So averaging before taking KL and summing the KL terms after are mathematically identical at the gradient level — the indexer ends up being pulled toward the centroid of all the attention distributions it serves, not toward any one layer.

Then why use L_multi in practice if they're equivalent? Pure memory/engineering reason: with L_multi, each S layer only needs to send its own predicted q value backward. With L_avg, you'd need to pass both p and q for every served layer to compute the average first — which means extra memory overhead and extra runtime cost for no actual gain, since the gradient comes out identical either way.

My takeaway after sitting with this for a while: a lot of "novel" architecture papers ultimately reduce to "design the right loss function for what you want, and let the network figure out the rest." This derivation is a good concrete example — the multi-layer trick isn't a new optimization method, it's just an equivalent (and cheaper) way to write the same gradient.

6. Performance (30B DSA model, 200K context)

Metric	Standard DSA	+ IndexCache (1/4 retained)
Prefill latency	19.5 s	10.7 s (1.82× speedup)
Decode throughput (per request)	58 tok/s	86 tok/s (1.48× speedup)

Why the training-aware version works where uniform static doesn't: the greedy search has to avoid sensitive layers because the model was never trained to tolerate sharing — without retraining, certain layers are tightly coupled to their own indexer's exact top-k, and feeding them someone else's indices causes a distribution shift that breaks things. Once you train with the multi-layer distillation loss, the S layers themselves learn to adapt to inherited indices, and the F layer's indexer learns to produce a selection that generalizes across all the layers it serves. That joint adaptation is what makes even a dumb uniform pattern work fine after training — the layer-specific sensitivity just disappears.

Extra structural note from the overlap heatmap: the first layer is always kept as a full F layer (it has to seed the index cache, and early layers attend to a fundamentally different token subset than later ones — overlap with deep layers is ≤0.4). The strongest, most similar index regions cluster near the diagonal — i.e., a layer's indexer output looks most like its immediate neighbors, decaying as you move further away.

7. The failure case — and why it's actually an important negative result

Before landing on the greedy LM-loss search, the natural-seeming alternative was tried: pick the sharing pattern by directly maximizing cosine similarity between attention outputs, since that's cheaper to compute than running full LM-loss evaluations.

Build an N×N similarity matrix S[i][j] = cosine similarity between layer i's attention output using its own indexer vs. using layer j's indexer instead. Then solve for the best F/S assignment with dynamic programming:

dp[i][k] = max over j<i, c_j=F of:
              dp[j][k-1] + Σ_{m=j+1}^{i-1} S[m][j]

— i.e., find the best previous F layer to "branch" from, accumulating similarity scores for every S layer that would reuse it. Solvable exactly by backtracking through the DP table.

This failed. The similarity-optimal pattern performed about the same as plain uniform interleaving — both clearly worse than the greedy LM-loss search. The reason is the core insight of the whole negative result:

Cosine similarity is a local metric — it only tells you how well-preserved a single layer's output is in isolation. It can't see how small token-selection mismatches propagate and compound through all the downstream layers. Two layers can have near-identical attention outputs (similarity ≈ 1) yet differ in exactly the handful of tokens that turn out to matter several layers later. Those subtle errors accumulate — and a layer-local similarity score has no way to predict that.

The LM-loss-based greedy search avoids this because it's a global, end-to-end signal — it measures the actual downstream effect of a sharing decision on the whole model's output, not just on one layer's local activation. This is the real lesson: local geometric similarity is a tempting cheap proxy, but for anything where errors compound across depth, you need an end-to-end metric.

My summary of the idea in one line

DSA's indexer recomputes "who matters" from scratch at every layer even though the answer barely changes between adjacent layers — IndexCache just caches that answer and reuses it, and the only real engineering question is which layers are allowed to skip recomputation, which can be found either by greedy search (no training) or learned directly via a provably-equivalent averaged-KL loss (with training).

if you found any mismatched detail in this post or want to contribute in paper or working code for indexcache please open issue on
github.link

Why Your AI Agent Keeps Forgetting Things

Mahendra Gurjar — Tue, 10 Feb 2026 04:27:27 +0000

Ever built an AI agent that just... forgets stuff? You tell it something important, and 10 steps later, it's gone.

I spent days debugging this exact problem, and it led me to build MemTrace - a framework that automatically diagnoses why AI agents lose their memories.

The Problem

Imagine you're building a personal assistant agent:

You: "My deadline is Friday"
Agent: "Got it! Your deadline is Friday"

[Agent does 15-20 other things]

You: "When's my deadline?"
Agent: "I don't have that information"

What happened? The agent forgot. But why?

Did it run out of memory space? (Eviction)
Did it overwrite the deadline with something else? (Overwriting)
Did the LLM just hallucinate a wrong answer? (Hallucination)

Without proper diagnosis, you're just guessing. 🎲

The Solution: Event Sourcing for Memory

Here's the key insight: Track every single memory operation as an immutable event.

Instead of just storing data, we log:

✅ Every WRITE (when data is stored)
✅ Every READ (when data is retrieved)
✅ Every UPDATE (when data is overwritten)
✅ Every EVICT (when data is removed due to capacity)

Think of it like a flight recorder for your agent's brain. 🛩️

🏗️ How MemTrace Works

Step 1: Log Everything

Every memory operation creates an event:

MemoryEvent(
    event_type="WRITE",
    step=1,
    key="deadline",
    value="Friday",
    importance=0.9,  # How critical is this data?
    timestamp=1234567890
)

Step 2: Run Automated Tests

Generate 1000+ random scenarios:

Random memory operations (writes and reads)
Different capacity constraints (what if memory is limited?)
Varying importance levels (some data matters more)

Step 3: Auto-Diagnose Failures

For every READ operation, MemTrace automatically:

Finds the original WRITE event
Compares expected vs actual value
If they don't match, traces through the event log to find out why

Example Diagnosis:

❌ FAILURE DETECTED
Key: "deadline"
Expected: "Friday"
Actual: None

🔍 ROOT CAUSE: Memory Evicted
Evidence:
  • Written at step 1 (importance: 0.9)
  • Evicted at step 15 (reason: capacity overflow)
  • Read attempted at step 20

⚠️ CRITICAL FAILURE: High-importance data lost!

📊 What I Learned

After running 1000+ scenarios, here's what the data showed:

Finding #1: Capacity Matters (Obviously)

Low capacity (5 slots):  21% success rate
High capacity (30 slots): 29% success rate

But here's the surprise: Even with 30 slots, 71% of reads still failed!

Important Context: These scenarios are extremely random - agents write and read completely unrelated keys with no semantic connection. Real agents would perform much better because they use context and patterns. The low pass rate reveals the worst-case scenario under chaotic conditions.

Finding #2: Overwrites Are Sneaky

Memory evictions:   ~2200 failures
Memory overwrites:  ~240 failures

Overwrites happen regardless of capacity - they're about key reuse patterns, not memory size.

Finding #3: Critical Failures Are Real

Total memory failures: 2501
Critical failures:     892 (35.7%)

35% of memory failures involved high-importance data. That's your deadlines, user preferences, and key facts - the stuff that actually matters.

🎯 The Architecture (For the Experts)

┌─────────────────┐
│  User Command   │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│ StructuredAgent     │ ← Routes to STM or LTM
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ MemoryStore         │ ← Executes operation
│ (STM or LTM)        │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ MemoryEvent         │ ← Immutable event logged
│ (event_log)         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ auto_evaluate_all() │ ← Finds all READ events
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ diagnose_failure()  │ ← Root cause analysis
└─────────────────────┘

Key Design Decisions:

Event Log = Ground Truth: The event log is append-only and never modified. It's the single source of truth for diagnosis.
Importance Tracking: Each event has an importance score (0.0-1.0). Critical failures are flagged when high-importance data (≥0.7) is lost.
Multi-Layer Memory: Separate STM (capacity-limited) and LTM (unlimited) with automatic routing.
Zero Manual Work: auto_evaluate_all() automatically finds every READ event and diagnoses failures without manual test case creation.

🚀 Try It Yourself

git clone -b ltm https://github.com/Mahendra1706/MemTrace.git
cd MemTrace
pip install -e .
python3 run.py

You'll see output like:

============================================================
MEMTRACE RANDOM TESTING - 1000 Scenarios
============================================================

Total Reads: 4993
✅ Passed: 741 (14.8%)
❌ Failed: 4252 (85.2%)

Failure Breakdown:
  • Memory Evicted: 2261
  • Memory Overwritten: 240
  • Invalid Read: 1708

------------------------------------------------------------
CRITICAL FAILURES (High-Importance Data Loss)
------------------------------------------------------------
Total Critical Failures: 892
  • Critical Evictions: 798
  • Critical Overwrites: 94
============================================================

🎓 What This Means for Your Agents

For Beginners:

Test your memory system before deploying
Know why failures happen, don't just guess
Track what matters with importance scores

For Experts:

Event sourcing enables complete audit trails
Statistical testing reveals failure patterns at scale
Importance-based diagnosis separates critical from trivial failures
Multi-layer architecture (STM/LTM) mirrors cognitive science models

🔮 What's Next?

Current Limitations:

Random scenarios: Completely random read/write patterns (worst-case testing)
No semantic understanding: Simple key-value storage, no context awareness
Structured commands: Not integrated with real LLM calls yet
Single-threaded: Sequential execution only

Future Vision: Semantic Memory

The next major upgrade will transform MemTrace from key-value storage to semantic memory:

Current (v1.1):

write("deadline", "Friday")
read("deadline")  # Must match exact key

Future (v2.0 - Semantic Search):

write("My project deadline is Friday", importance=0.9)
read("when is my project due?")  # Semantic match!
# Returns: "Friday" (understands the question relates to deadline)

How it will work:

Embeddings: Convert memories to vector representations
Similarity Search: Find relevant memories based on meaning, not exact keys
Context-Aware Retrieval: Understand relationships between memories
Realistic Behavior: Mimic how human memory actually works

This will make agents perform much better than the current 14.8% pass rate, because they'll retrieve memories based on semantic relevance rather than exact key matches.

Other Planned Features:

Integration with LangChain/AutoGPT
Real-time monitoring dashboard
Advanced eviction policies (LRU, LFU, importance-based)
Consolidation logic (STM → LTM based on importance)
Vector database integration (Pinecone, Weaviate)

📚 References

GitHub: Mahendra1706/MemTrace (see ltm branch for latest)
Inspiration: Event sourcing patterns, MemGPT architecture, cognitive memory models
Related Work: LangChain memory modules, AutoGPT memory systems

💬 Let's Discuss

Have you dealt with memory failures in your AI agents? What strategies worked for you?

It's my first decent project (or that's what I think), so I'd love if you visit the repo or drop a comment!

I Tested 3000+ LLM Agent Memory Operations - Here's What I Found

Mahendra Gurjar — Thu, 29 Jan 2026 06:19:01 +0000

🤔 The Problem

If you've built LLM-based agents, you've probably noticed: they forget things.

A lot.

Your agent remembers the user's name in message 1, forgets it by message 5, and then hallucinates a completely different name by message 10.

But why do agents forget? Is it:

Memory capacity issues?
Information getting overwritten?
The LLM hallucinating?
Something else? Nobody had data. Just anecdotes and frustration. So I built MemTrace to answer this question with actual statistics.

What I Built

-MemTrace** is a testing framework that tracks every single memory operation an agent makes and diagnoses why recalls fail.
Think of it like a "black box recorder" for agent memory.
Core idea:

Track every WRITE, READ, UPDATE, and EVICT operation
Compare what the agent returns vs. what was originally stored
Diagnose failures with evidence from the event log I tested 1000 random scenarios with 3030 memory operations to find patterns.

Key Findings

Finding 1: Agents Forget 60% of the Time

Valid Recall Rate: 39.6%**
That means when an agent tries to recall information it previously stored, it fails 6 out of 10 times.
(This excludes "invalid reads" where the agent tries to read something that was never written - those are test artifacts, not real failures)

Finding 2: Evictions Dominate

Memory Evicted: 46.2% of all failures**
Nearly half of all memory failures happen because the agent ran out of space and had to evict old information.

Breakdown:

Memory Evicted**: 994 failures (46.2%)
Invalid Read**: 815 failures (37.9%)
Memory Overwritten**: 343 failures (15.9%)
LLM Hallucination**: 0 failures (0.0%) *(Note: No hallucinations in this test because I used a deterministic agent. Real LLMs would show hallucinations too.)

Finding 3: Capacity Matters (Proven)

Validated Invariant: Capacity ↑ → Eviction ↓
I tested different memory capacities and found a clear pattern:

Capacity Low (1-5) Evictions 1354 Pass Rate 21.3%
Capacity Medium (10-15) Evictions 1150 Pass Rate 25.6%
Capacity High (20-30) Evictions 994 Pass Rate 29.0%

↑Higher capacity = fewer evictions = better recall.
Seems obvious, but now we have data to prove it.

Finding 4: Overwrites Are Independent

Overwrites stay constant regardless of capacity*
Whether you have capacity of 5 or 30, you get ~340 overwrites per 1000scenarios.
Why? Overwrites depend on how often you reuse the same keys, not how much memory you have.

Event Sourcing

Every memory operation creates an immutable event:

 python
 MemoryEvent(
     event_id="uuid-1234",
     event_type=MemoryEventType.WRITE,
     memory_layer=MemoryLayer.STM,
     step=1,
     timestamp=1706345678.123,
     key="user_name",
     value="Alice",
     metadata={}
 )

The event log becomes the single source of truth.

Automated Diagnosis

When a read fails, MemTrace analyzes the event history:

Scenario: Agent tries to read "deadline" but gets None

Event log shows:

WRITE deadline="Friday" (step 2) EVICT deadline="Friday" (step 5, reason: capacity_overflow) READ deadline=None (step 7)

Diagnosis: "memory_evicted"

Evidence: "Key was written at step 2, evicted at step 5, recall attempted at step 7"

4 Failure Types

Memory Evicted - Removed due to capacity constraints
Memory Overwritten - Updated with different value
Invalid Read - Never written in the first place
LLM Hallucination - Agent returns wrong value despite correct memory

System Invariants (Validated)

After testing 1000 scenarios, these patterns hold:

Capacity ↑ → Eviction ↓
Overwrite ~ independent of capacity
Invalid Read = scenario artifact
Unknown = 0 always (100% failure categorization)

What I Learned

1. Event Sourcing Is Powerful

Having a complete history of every operation makes debugging so much easier.

Instead of guessing why something failed, you can trace back through the exact sequence of events.

2. Capacity Is Critical

If your agent has limited memory, evictions will dominate your failures.

The data shows a clear linear relationship: double the capacity, reduce evictions by ~15%.

3. Overwrites Are Sneaky

Overwrites happen when you reuse keys. They're independent of capacity, which means you can't solve them by just adding more memory.

You need better key management or versioning.

4. Testing Reveals Patterns

Before building MemTrace, I thought hallucinations would be the main issue.
Nope. Evictions are 3x more common (in my tests with deterministic agents).

Real LLMs would show more hallucinations, but capacity is still a huge factor.

Feedback Welcome

This is my first open-source project and I'm still learning!

If you:

Have ideas for improvements
Found a bug
Have questions

Please reach out!

GitHub