DEV Community: Swapnanil Saha

Embedding Dilution: Why Semantic Code Search Misses the Answer

Swapnanil Saha — Mon, 20 Jul 2026 18:12:46 +0000

I had a query that should have been boring. "Get a single object from the database." In Django, that is QuerySet.get — the method whose entire job is to fetch exactly one row matching your lookup, or raise. Its docstring reads, almost verbatim: "Perform the query and return a single object matching the given keyword arguments." That is not a loose match to my query. It is nearly a paraphrase of it.

The chunk was indexed. I checked. The embedding was computed and stored like every other chunk in the corpus. And when I ran the search, the method was not in the top result, not at rank fifty — it was not in the top two hundred candidates at all. It never made it far enough into the pipeline to be judged. Two hundred other chunks, none of which described getting a single object from the database, beat it into the pool.

That failure bothered me enough to take apart the whole retrieval path and measure where the answer died. This post is the post-mortem. The system under test is vectr, a semantic code-search and working-memory tool I build; the corpus is Django, used purely as a public witness. But the mechanism I found is not specific to code, and it is not specific to my tool. It is a property of how a single embedding vector has to summarize a long, mixed passage — and it quietly limits recall in a lot of retrieval systems that look like they are working. If you have never read the embeddings foundations, my earlier complete guide to text embeddings and RAG is the primer; this is the failure that guide's happy path hides.

A note on terms. An embedding is a list of numbers (a vector) that captures a passage's meaning, so passages with similar meaning get similar numbers. A chunk is one indexed piece of the codebase — roughly one method plus a little context. A docstring is the documentation written inside a function. Cosine similarity measures the angle between two vectors: 1.0 is identical direction, 0 is unrelated.

Part 1 · The Miss

The Pipeline and the Query That Broke It

Before the failure makes sense, you need the shape of the pipeline it happened in, because the shape is where the whole story turns. At measurement time, vectr retrieved in the way most production semantic-search systems do — a two-stage funnel.

Stage one is hybrid retrieval. It runs two searches in parallel and merges them. One leg is dense retrieval: encode the query into a vector, encode every chunk into a vector, and rank chunks by cosine similarity — so it can match on meaning even with no shared words. The other leg is BM25, a keyword-scoring function that rewards exact term overlap. Together they produce a candidate pool of the top 200 chunks.

Stage two reranks that pool. A cross-encoder reranker — bge-reranker-base — reads the query alongside each of the 200 pool members and re-scores them properly, followed by a quality pass. That reranker is the smart part of the system. It is also the expensive part, which is exactly why it only ever sees 200 candidates instead of all 40,538.

The one number that decides everything. The reranker, the importance priors, the quality scores — every clever thing downstream operates only on the 200 chunks in the pool. A chunk that is not in the pool is invisible to all of it. So the first question for any retrieval miss is never "why did the reranker score it low." It is "was it even in the pool to be scored." Recall gates everything after it.

Now the pieces that matter for the failure. The dense embedder was snowflake-arctic-embed-m-v1.5 — a general text embedder, not a code-specialized one. Hold onto that; it is not the villain, but it shapes the numbers. The corpus was a Django checkout from June 2026: 4,129 files, 40,538 indexed chunks. And the chunk for QuerySet.get was, by any reasonable standard, ideal. It carried a class marker (QuerySet), the full method signature, and that near-perfect docstring — followed by roughly thirty lines of mechanical implementation.

That last clause is the whole problem in embryo. But to see why, we first have to pin down where the miss happened, because the fix depends entirely on that.

The Miss Is at Pool Entry, Not Ranking

My first instinct was the wrong one, and it is probably yours too: the reranker must have mis-scored it. Bump the reranker, add an importance prior for well-known symbols, tune the quality pass. Every one of those fixes operates on the pool. So I checked the pool directly, leg by leg, and the reranker turned out to be innocent — it never got the chance to be guilty.

The dense leg. For every natural-language phrasing I tried — "get a single object from the database," "fetch one row matching criteria," "retrieve a single record by lookup" — QuerySet.get was absent from the top 200 dense results. Not low-ranked. Absent. The only phrasing that got it into the dense pool at all was a deliberately ORM-flavored control, written in Django's own vocabulary, and even that reached only #123 of 200 — barely inside a pool it should have topped.

The keyword leg. BM25 was all over the place, which is its nature: it lives and dies on exact term overlap. Phrase the query as "return exactly one matching object or raise…" — words that literally appear near the method — and BM25 ranked it #1. Phrase it as "fetch one row matching criteria" and the same method fell to #127. Other phrasings missed entirely. BM25 wasn't a safety net; it was a coin whose bias depended on whether I happened to echo the source text.

Where the answer actually died. The reranker never saw QuerySet.get for the natural-language queries, because QuerySet.get was never in the 200 it was handed. This is the structural point the rest of the post builds on: if the right chunk never enters the pool, nothing downstream can save it. A brilliant reranker on an incomplete pool is a brilliant answer to the wrong question.

This is also the moment most retrieval dashboards lie to you by omission. They display the reranked top-k — the final, polished output — which looks fine because the reranker did a competent job on the pool it received. The miss is one layer up, invisible on that screen. You have to instrument pool entry itself to see it.

Trace the chunk's fate through the funnel (three real scenarios from this run):

Phrasing	Dense top-200	BM25 top-200	Fused pool	Outcome
A · "get a single object from the database"	absent (>200)	missed	not in pool	miss
B · ORM-vocabulary control	#123 of 200	—	enters (weak)	reaches reranker
C · "return exactly one matching object or raise…"	—	#1	dropped	miss

Scenario C is the case where BM25 ranked the target first, yet the dense-dominated fusion dropped it before the returned top-60 — more on that in Part 4. A dash means that leg's rank was not separately recorded for that phrasing.

Part 2 · The Cause

Dilution, Measured

So the dense leg failed to rank a chunk whose docstring paraphrases the query. Why? The lazy answer is "the embedder isn't good enough, throw a bigger model at it." That answer is wrong, and I can show it is wrong with one micro-experiment.

I took the exact same embedder and embedded two things. First, the full chunk: class marker, signature, perfect docstring, plus the ~30 lines of body — the combinator handling, the _chain() call, the select_for_update checks, the NotSupportedError raises. Second, just the signature and docstring alone — call it the "purpose-only" version, the part that says what this is for with none of the machinery that says how it does it. Then I measured cosine similarity from each version to four query phrasings.

Query	Full chunk	Purpose-only	Delta
get a single object from the database	0.601	0.706	+0.105
fetch one row matching criteria	0.511	0.606	+0.095
retrieve a single record by lookup	0.529	0.625	+0.096
return exactly one matching object or raise…	0.617	0.678	+0.061

Purpose-only is closer to the query on every single phrasing, by +0.06 to +0.10 cosine. Same embedder, same docstring, same query. The only thing I removed was the implementation body — and the chunk got measurably more relevant to the thing it does. The signal that answers the query was in the chunk the whole time. The body was burying it.

Why a longer chunk drifts away from its own purpose. An encoder turns a passage into one fixed-size vector. Cosine similarity then compares directions:
cos(q, d) = (q · d) / (‖q‖ · ‖d‖)
The catch is d. Whether the model builds it by literally averaging its token vectors (mean pooling) or by a summary token that attends across all tokens, the result is one point that must stand in for the whole passage. For the literal mean-pooling case it is just an average:
d ≈ (1/N) · Σ eᵢ
A CLS- or attention-pooled encoder weights the tokens unevenly instead of averaging them flat, but the consequence is the same. Add thirty lines of body and you pour in dozens of token vectors pointing toward "loop, chain, raise, check." They pull d toward the body's center of mass and away from the docstring's direction. The docstring's contribution does not vanish — it gets outvoted. That is dilution: not a missing signal, a drowned one.

Now the calibration that turns this from a curiosity into a recall failure. In this embedding space, for the "get a single object" query, the weakest chunk that made it into the 200-deep pool sat at a cosine of about 0.697. Purpose-only scored 0.706 — over the line, into the pool, in front of the reranker. The full chunk scored 0.601 — under the line, out of the pool, invisible. The entire difference between "the reranker gets a shot at the right answer" and "the right answer is never considered" is that +0.10 of cosine the body ate.

A caveat on that floor: 0.697 was calibrated on the first query only, and the pool floor is a per-query quantity — treat it as a reference for that query, not a universal threshold. The general lesson is the recovered delta, which is positive for all four.

The instinct this kills: "just enrich the input." The reflex when recall is bad is to feed the embedder more context — add the class body, the surrounding file, richer metadata. Here that makes it worse. The purpose signal is already present; enriching the chunk only adds more body tokens to average against it. You cannot fix a drowning by adding water. The problem is the pooling of a long, mixed chunk into one vector — so the fix has to change what gets pooled, not what gets added.

Phrasing Doesn't Rescue It — and the Symbol Index Proves Why

There is an obvious objection here: maybe I just phrased the queries badly. Maybe the right words would have pulled the chunk in. So I ran a 60-query sweep — 10 topics, 6 phrasings each — to give phrasing every chance to matter.

It didn't. Rephrasing shuffled which wrong answers came back; it did not surface the right ones. Ask for a "signal dispatcher implementation" and the top results were 1-to-6-line re-export stubs — the little shim modules that just re-expose a name — while the real Signal class, the thing that actually implements dispatch, was absent from the top three. Across the sweep, conceptual queries kept returning wrong or incomplete top-5 sets no matter how I said them. Phrasing is a knob on the query side. Dilution is a problem on the document side. Turning the query knob cannot un-average a document vector.

Then came the control that settled it. vectr also keeps a deterministic symbol graph: a plain lookup from a name to its definition site, no embeddings involved. For every canonical symbol that semantic search had just missed, the deterministic lookup resolved it exactly and instantly — Signal, BaseCache, Query, SQLCompiler, QuerySet.get, each to the correct file and line.

This is the control that localizes the bug. Same corpus, same index build. The symbol table resolves the target perfectly; semantic search cannot find it. That gap is not the parser's fault, not the chunker's fault, not a missing document. The index and the symbol table were correct. The failure is purely in the embedding and search layer. When two views of the same index disagree this cleanly, the broken one tells you exactly where to look.

One more result, and it is the one that made me stop trusting cold semantic search on its own. I ran a round of "famous symbols" — targets every Django developer knows by heart. Cold semantic search was roughly a coin flip even there. get_object_or_404, QuerySet.get, and reverse were all absent from the top-5; ForeignKey came in at #2 and Paginator at #4, each sitting behind look-alikes with more generic wording. If a system cannot reliably surface get_object_or_404, the failure is not exotic. It is the common case wearing a docstring.

Part 3 · A Trap in the Scores

Normalized Scores Lie About Confidence

This one is a short aside, but it burned me while I was debugging the above, so it earns its place. While hunting the dilution bug I ran control queries for concepts that do not exist in Django core — CORS handling, for instance, which Django leaves to middleware and third-party packages. A search for something absent should come back empty, or at least visibly unsure.

It came back with five hits scored between 0.77 and 1.0, looking every bit as confident as a real match. Nothing in the corpus answered the query, and the system reported near-certainty anyway.

The reason is a modeling choice that is easy to make and easy to forget. The displayed score was the reranker's output after per-query normalization — rescaled so the best result of this query becomes ≈1.0. That rescaling throws away the only thing you needed: how good the top match is in absolute terms.

The top score is always ≈1.0 by construction. A per-query-normalized score can't tell you "nothing here matches," because it is defined to make the best available result look like a perfect one — even when the best available result is garbage. The number describes rank within the query, not relevance to the world. If you surface it as confidence, your UI will radiate certainty at the exact moment it has found nothing.

The practical rule that fell out of this: never show a per-query-normalized score as if it were confidence. Keep a non-normalized signal alongside it — a raw cosine, or a BM25 floor — so the system retains an honest way to say "nothing here is actually close." Recall failures like the dilution one are already invisible enough; a score that reads 0.99 over an empty result set makes them worse, because it converts a silent miss into a confident wrong answer.

Part 4 · The Fix That Shipped

The Fix I Shipped: Dual-Vector Indexing

The measurement points at its own fix. If purpose-only embeddings score +0.06 to +0.10 higher — enough, in the calibrated case, to clear the pool floor — then the answer is not to throw away the body vector. It is to also keep a purpose vector, and let a query match whichever one fits it.

That is dual-vector indexing. At index time, store two vectors per symbol: a purpose vector built from the qualified signature and docstring with the body stripped out, alongside the existing full-body vector. At query time, retrieve over both, and blend or take the max of the two similarities for pool entry. Nothing else in the pipeline changes.

Analogy · The book and its spine. A full-body embedding is like shelving a book by blending every word in it into one average color. Two books with very different covers but similar bulk end up the same muddy shade, and you can't find either by its subject. The purpose vector is the printed spine: title and one-line description, nothing else. You keep the whole book on the shelf — you just also write a legible spine, so someone looking for the subject can find it without reading all 300 pages first. Dual-vector indexing shelves every symbol with both: the full text, and a spine.

What I like about this shape is that it does not privilege one kind of query. Intent-shaped queries — "get a single object from the database" — land on the purpose vector, where the docstring is undiluted. Implementation-detail queries — "where is select_for_update checked" — still land on the body vector, because that string only exists in the body. You are not trading one failure mode for its mirror image; you are giving each query the surface it needs.

Two properties make it safe to apply blindly across a whole corpus:

Undocumented symbols degrade gracefully. No docstring? The purpose vector is just the qualified signature. It never gets worse than the name itself, and the name is often enough.
It is a uniform structural transform. The same body-stripping rule applies to every symbol, at index time, with no query-side special-casing — no keyword lists, no "if the query looks conceptual, reroute it." Query-side heuristics are the thing I have spent a long time deleting from this system; they are brittle, they compound, and they never generalize. A transform on the index side has none of those failure modes.

It is not free, though, and I would rather name the costs than let you find them. Storing a second vector per symbol roughly doubles the number of vectors in the index — reason enough to confirm dilution is actually your problem before you spend on it. And it does not stand alone: dual-vector composes with structural ranking signals like symbol importance rather than replacing them. A diluted docstring and an under-weighted call graph are different failure classes; fixing one leaves the other exactly where it was.

The honest sequence, stated plainly: I measured the cause first, then shipped the fix. The spike proved the mechanism — purpose-only embeddings clear the pool floor where full-chunk embeddings do not — and that measurement, not a hunch, is why dual-vector indexing shipped in vectr v1.0.0 on 8 July 2026. If someone tells you a retrieval change "should work," ask them for the cosine table. This one had one before a line of the fix was written.

The fusion bug hiding underneath. While validating the direction I tripped over a second, separate problem worth its own paragraph, because it will bite anyone running hybrid retrieval. Remember that BM25 ranked the target #1 for one phrasing. You would assume a #1 in either leg guarantees pool entry. It did not: the fused final top-60 did not contain the target even though BM25 had ranked it first. The fusion was dense-dominated, and the dense leg's absence outvoted the keyword leg's #1.

Check that your fusion can't discard a leg's #1. Hybrid retrieval is supposed to be a safety net: if one leg misses, the other catches. That promise only holds if your fusion actually lets a strong single-leg result survive. A dense-dominated blend can throw away the exact result BM25 nailed. Before you trust hybrid search, feed it a query where you know one leg ranks the answer #1, and confirm the answer is still in the fused output. Mine wasn't.

The Same Dilution Shows Up Far Beyond Code

I found this in code search, but nothing about the mechanism is about code. Dilution appears in any corpus where a document mixes "what this is for" with "how it works" or with plain boilerplate. The purpose is a small fraction of the tokens; the pooled vector drifts toward the bulk; a query written in terms of purpose lands short.

You have almost certainly hit it without naming it:

API reference pages where a one-line summary sits on top of exhaustive parameter tables and examples. Search for what the endpoint does and the parameter soup dominates the vector.
Legal clauses buried inside pages of recitals and boilerplate. The operative sentence is three lines; the surrounding scaffolding is three hundred.
Product descriptions embedded in spec sheets, where the one line a buyer would search for is outweighed by dimensions, SKUs, and compliance notices.

The two mitigations generalize as cleanly as the problem does. First, embed a purpose or summary field separately from the full text, and retrieve over both — the dual-vector idea, minus the word "symbol." A short, curated summary vector per document is often the single highest-leverage change you can make to recall, precisely because it is immune to dilution by construction. Second, and this is the cheaper habit to build: audit recall at the pool level, not just the final ranking. Most RAG dashboards show you the reranked top-k and nothing else, which means a pool-entry miss is completely invisible on the screen you are staring at. The failure that started this whole post would never have shown up on a top-k view. I only found it because I went looking one layer up.

Choosing the embedder is a real lever — measure it, don't assume it. These deltas came from one general text embedder on one corpus. A code-specialized model would move the numbers; the CoIR benchmark evaluates nine retrieval models across ten code datasets and eight tasks and finds even state-of-the-art systems struggle with code retrieval, which is exactly why the embedder is not a detail. But a better embedder does not repeal dilution — it raises the whole curve, floor included, and a long mixed chunk still averages its purpose away. Dual-vector composes with a better model; it does not compete with one.

Part 5 · Takeaways

What To Actually Do With This

Go back to the opening. A function whose docstring paraphrased the query ranked below two hundred chunks that didn't. You now know the chain underneath that sentence. The docstring's signal was real and present. The thirty lines of body around it pulled the pooled vector away — enough to cost about a tenth of a cosine point. That tenth was the difference between clearing the pool floor and never entering the pool at all — between reaching the reranker and never being considered. The reranker never failed, because the reranker never saw it. And a normalized score would have happily reported confidence over whatever wrong answers did make the pool.

If you build retrieval, here is the short version to carry out of this:

Measure recall at pool entry, not at the reranked top-k. The reranker can only be as good as its pool. The miss that matters most is the one that never reaches the screen you monitor.
When recall is bad, test purpose-only against full-chunk. Embed a summary or signature alone, measure its cosine to the query, and compare. A large positive delta names your problem: dilution, not a weak model.
Index a separate purpose vector, and retrieve over both. Keep the full text for detail queries; add a body-stripped summary vector for intent queries. It is a structural transform on the index, so it needs no query-side heuristics to work.
Confirm your fusion can't drop a leg's #1, and never surface a per-query-normalized score as confidence. Keep a raw, absolute signal for an honest no-match.

The dual-vector fix shipped in vectr v1.0.0 on the strength of the cosine table above rather than a hunch. The boring query that started this — get a single object from the database — is exactly the case it was built to catch: the method whose docstring says precisely that, given a surface where its own body can no longer outvote it. The signal was never missing. It just needed somewhere to be read on its own.

Links & Further Reading

Every external claim in this post was confirmed against the source it points to; where a source did not confirm a specific number, that number is not stated here.

vectr — the semantic code-search and working-memory tool used as the system under test, and the instrument that produced these measurements.
The Complete Guide to Text Embeddings, Vector Databases & LLMs — the primer this post assumes: tokenization, pooling, cosine similarity, and how a RAG pipeline fits together.
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models — a benchmark of ten code datasets across eight retrieval tasks and seven domains; it evaluates nine retrieval models and finds significant difficulty with code retrieval even for state-of-the-art systems. arXiv:2407.02883
Structural Code Search using Natural Language Queries — reports that a natural-language-driven structural search outperforms baselines based on semantic code search by up to 57% F1; embeddings alone under-serve structural queries. arXiv:2507.02107

Numbers here are from one embedder (arctic-embed-m-v1.5) on one corpus (Django, June 2026 checkout). The deltas are model-specific; the mechanism is general. Cosine thresholds like the ~0.697 pool floor are corpus- and index-specific calibration points, not universal constants.

The Four Families of Context Relief for LLM Coding Agents

Swapnanil Saha — Fri, 17 Jul 2026 13:50:58 +0000

Run a coding agent on anything bigger than a toy repo and you hit the same wall. The context window fills up. Not with the answer — with the search for the answer. Twelve file reads, four grep results, a stack trace, the output of a test run that failed for an unrelated reason. By the time the agent is ready to write the fix, half its working memory is archaeology it will never look at again.

I've spent the last few months building a semantic-search-plus-working-memory MCP server (I'll call it vectr throughout — it's the running example, not the point of the post), and the single most clarifying thing I did early on was stop treating "the context is full" as one problem. It's four problems wearing a trench coat. Each has its own mechanism, its own cost, its own failure mode, and — this is the part people miss — they only work when you compose them correctly. Get the composition wrong and you don't get relief; you get a subtle new class of bug where the agent confidently reasons over information it no longer has.

So here's the map I wish someone had handed me. Four families of context relief: what each one actually buys you, where each one bites, and how they fit together.

A few definitions first, because the jargon is dense. A token is the unit an LLM reads and bills by — roughly three-quarters of a word. The context window is the fixed number of tokens the model can attend to at once (a million, on the current Claude models). Prompt caching lets you pay a reduced rate to re-send an identical prefix instead of reprocessing it. And MCP (Model Context Protocol) is the standard interface an agent uses to call external tools. Keep those four in your head and the rest follows.

Family	What it does	Direction	Fails when…
1 · Eviction	Delete stale context, leave a placeholder	Removes	You threw away what you can't cheaply restore
2 · Offload & recall	Write findings to a durable store, fetch on demand	Restores	Recall isn't automatic — the model forgets to look
3 · Retrieval	Fetch the exact function, not the whole file	Restores	Used on structural questions (call graphs)
4 · Subagents	Burn messy work in a separate window	Removes	No shared memory — children re-derive everything

Family 1 — Eviction: throw it away, but keep a receipt

Eviction is the most literal answer to a full context: delete the stale stuff. The tool result from twenty turns ago, the file you read and already edited, the thinking block from a reasoning step that's now resolved — drop it out of the live window so the model stops paying to carry it.

The harness can do this for you, and increasingly it does. Claude Code calls this compaction: as a session approaches its context limit, it clears the oldest tool outputs first and only summarizes the rest of the conversation if that isn't enough on its own. Recent tool results stay inline so you can keep reasoning over them; older ones get cleared first.

At the API level there's a more configurable version. Anthropic's context editing feature exposes a strategy with the delightfully machine-generated name clear_tool_uses_20250919. You turn it on with a beta header (anthropic-beta: context-management-2025-06-27) and it watches your accumulating tool results. Once input tokens cross a threshold — the default trigger is 100,000 input tokens — it clears the oldest tool results in chronological order, keeping the most recent few (keep defaults to 3 tool uses).

Here's the detail that matters more than any of the parameters: each cleared result is replaced with placeholder text so the model knows it was removed. The agent doesn't silently lose a tool result and then hallucinate what was in it. It sees a tombstone — "this tool result was cleared" — which is a very different thing from a gap.

The config below overrides the defaults to make the behaviour easy to see — I've set trigger to 30,000 tokens rather than the stock 100,000 so it fires early, and pinned clear_at_least so each pass removes a real chunk. In production you'd leave trigger higher.

context_management={
    "edits": [
        {
            "type": "clear_tool_uses_20250919",
            "trigger": {"type": "input_tokens", "value": 30000},
            "keep": {"type": "tool_uses", "value": 3},
            "clear_at_least": {"type": "input_tokens", "value": 5000},
            "exclude_tools": ["web_search"],
        }
    ]
}

What it saves

Straightforwardly, input tokens — though the headline figure Anthropic has published for this feature is a task-performance number, not a token count: on an internal agentic-search evaluation, context editing alone lifted performance 29% over baseline, rising to 39% when paired with a memory tool. Hold onto that second number. It's the whole thesis of this post hiding in a benchmark.

What it costs

Two things, and the second is non-obvious. The first is the risk that you evict something the model actually needed — mitigated by the placeholder tombstone and by exclude_tools, which lets you mark, say, your search tool's results as never-clearable. The second cost is about prompt caching, and it's where a lot of naive eviction setups quietly lose money.

The cache-invalidation math. Prompt caching bills a cached prefix at 0.1× the base input rate on a read, but a cache write costs 1.25× the base rate for the default 5-minute TTL (2× for the 1-hour TTL). The catch is invalidation: the cache key is a cumulative hash of everything up to and including your cache breakpoint, so changing any block at or before the breakpoint produces a different hash and a full cache miss.

Read those two facts together and the tension jumps out. Eviction edits the middle of your conversation. Every time it fires, it changes the prefix, which invalidates the cache from that point forward, which means your next request pays the 1.25× write cost to re-cache the new prefix. Evict a little bit, often, and you can spend more on cache churn than you saved on the evicted tokens. This is exactly why the clear_at_least parameter exists: it forces each clearing pass to remove a worthwhile chunk of tokens so the cache invalidation is amortized against a real saving, not a rounding error. If you take one operational lesson from this whole family, make it that one — evict in big, infrequent passes, never in a trickle.

When it fails

Eviction fails the moment the model needs something you threw away and can't cheaply get it back. A tombstone that says "tool result cleared" is honest, but honesty doesn't reconstruct the file. If the only copy of that information lived in the evicted tool result, the agent is now stuck: it has to re-run the tool, re-read the file, re-derive the thing. You've converted a token cost into a latency-and-tool-call cost — and if the underlying state has changed in the meantime, possibly into a correctness bug.

The governing rule of eviction. You can only safely evict what you can cheaply restore. Eviction on its own is not a memory strategy — it's a bet that restoration is cheap. That bet is only good if something else in your system guarantees it. Which is the entire reason the other three families exist.

Family 2 — Offload & recall: write it down where it survives

If eviction is "throw it away and hope you don't need it," offload-and-recall is "write it down somewhere durable before you throw it away, and fetch it back on demand." The agent, mid-session, notices it has learned something worth keeping — a function signature, a gotcha, a decision, a partial result — and commits it to an external store. Later, instead of carrying that finding in the context window the whole time, it recalls it in a single cheap call exactly when it's relevant.

This is the working-memory pattern, and it's the half of vectr I care about most. When the agent discovers that, say, a workspace lock is acquired at resolver.rs:214 and released on scope exit, it doesn't keep that fact parked in context for forty turns. It stores a note. The note sits in a local store keyed to the workspace, and a recall call pulls it back — in my case in under 50 milliseconds — whenever the agent's current task touches locking.

The reason this is a distinct family, and not just "eviction with extra steps," is what it survives. A finding in the live context window dies three deaths. It costs tokens the entire time it sits there. It gets mangled or dropped when the conversation is compacted into a summary — compaction preserves the gist and loses the exact line number. And it vanishes completely when the session ends. A note in an external store survives all three. It's there after /compact. It's there in tomorrow's session. It costs nothing until you ask for it.

Anthropic's own memory tool works on this principle, and it's the reason for that 39%-versus-29% gap I told you to hold onto. Context editing alone improves performance 29% over baseline on Anthropic's internal agentic-search evaluation. Context editing plus a memory tool gets you 39% — because the agent writes the important bits to memory before the eviction pass clears them, so clearing becomes safe instead of lossy. That extra ten points isn't a second independent optimization stacked on the first. It's the same optimization made safe to run harder, because family 2 backs it up.

What it saves, what it costs

It saves the standing token cost of carrying a finding you only need occasionally. And, less measurably but more importantly, it saves re-derivation — the agent doesn't have to re-read the file and re-reason to the same conclusion next time.

Against that, three real costs. The agent has to decide what's worth remembering, which is a judgment call it will sometimes get wrong — store noise and your recall gets diluted. The recall has to actually be relevant when it fires, which is a retrieval-quality problem in miniature. And there's a token cost to the recall itself, a fact I had to make peace with: store terse one-line notes and recall is cheap but thin; store full code blocks and recall is rich but heavier. There's no free lunch. There's a dial.

When it fails

It fails when recall isn't automatic. If your architecture depends on the model choosing, of its own accord, to call the recall tool at the right moment, it will frequently just… not. The model has no reliable sense of what it stored three sessions ago. The fix is to stop relying on the model's initiative and inject the relevant notes into context deterministically — which, in Claude Code, means hooks, and which is a whole post of its own. The short version: an offload store the agent forgets to read is a filing cabinet in a locked room.

The note-taking engineer. Working memory is the difference between an engineer who takes notes and one who doesn't. The note-taker doesn't hold the whole system in their head at once — they hold a pointer to where they wrote it down, and the act of writing it down is cheap insurance against the cost of re-discovering it. The catch is the same for both: a note you never look at again is just slower forgetting.

Family 3 — Retrieval over stuffing: fetch the 40 lines, not the file

The third family attacks a different waste. The first two are about getting rid of information you already loaded. This one is about never over-loading it in the first place.

The default way an agent explores an unfamiliar codebase is grep-and-read. Grep for a likely keyword, get forty hits, read the six files that look plausible, discard five of them. Every one of those reads lands the entire file in context — a 400-line module of which the agent needed one function. The signal-to-noise ratio is brutal, and unlike a human skimming, the model pays full token price for every line whether it was useful or not.

Retrieval-over-stuffing replaces the blunt read with a targeted fetch. Instead of loading whole files and letting the model sift, you run a ranked retrieval — semantic search over the codebase, ideally chunked at function and class boundaries so each result is a self-contained unit of meaning — and hand back the forty lines that actually match the query. "JWT validation logic" returns the verify_token function directly, even though neither word appears in it, and it returns that function, not the 400-line file it lives in.

This is the search half of vectr, and the payoff on unfamiliar code is large: on a big Java codebase in my own benchmarks, ranked retrieval cut the read-and-grep calls before the first edit by roughly three-quarters compared to the grep-and-read baseline. The mechanism is boring — embeddings plus a keyword index, merged — but the discipline is the point: the unit you put in context should be the unit of meaning, not the unit of storage. A file is a storage unit. A function is a meaning unit.

What it saves, what it costs

It saves the bulk of exploratory token spend, and the turns that go with it. A search that returns the right function in one call replaces a grep-plus-four-reads sequence. In exchange, you need an index — which means an indexing step and the machinery to keep it fresh as files change — and retrieval quality becomes a first-class concern.

Confident-wrong retrieval is worse than a miss. A search that returns the wrong forty lines is more dangerous than a grep that returns nothing, because the agent trusts it more. I've watched an agent build a wrong mental model off a top-ranked result that was subtly off-topic, then reason confidently from that bad premise for a dozen turns. An honest empty result would have sent it looking again; a plausible wrong one didn't.

When it fails

It fails on questions retrieval is the wrong tool for. "Who calls this function?" is not a similarity question — the callers don't contain the callee's body, they contain a reference to it by name. That's a graph traversal, not a search. Reach for semantic retrieval there and you'll get plausible-looking garbage. Part of doing this family well is knowing which questions are retrieval questions (concepts, patterns, "how does X work") and which are structural ones (definitions, call graphs) that want an exact lookup instead.

Family 4 — Subagent isolation: burn the tokens in someone else's window

The fourth family is the cleverest and the easiest to get subtly wrong. The idea: when a subtask is going to generate a pile of context you'll never reference again — a research spike, a log-diving expedition, a broad search — you don't do it in your main conversation. You spawn a subagent, let it do the messy work in its own context window, and take back only the distilled answer.

Claude Code's subagents work exactly this way. Each one runs in its own context window with a custom system prompt, does its work independently, and returns only the result — the docs frame it as keeping exploration and implementation out of your main conversation. The parent agent spends, say, 800 tokens receiving a clean summary of an investigation that cost the subagent 40,000 tokens of reading and reasoning. Those 40,000 tokens are burned in a window that gets discarded. The parent's context stays clean.

There's a nice secondary benefit. Because a subagent has its own tool permissions and its own system prompt, you can also use it to constrain work — a read-only research agent that literally cannot write files — and to route cheap work to a cheaper, faster model. Context isolation and cost control fall out of the same mechanism.

What it saves, what it costs

It saves the largest single chunk of exploratory context there is. A well-scoped subagent is the difference between your main window holding a conclusion and holding the entire messy derivation of that conclusion. The cost is a framing-and-parsing tax at the boundary: you have to specify the subtask well enough that the subagent can run without hand-holding, and you have to trust the summary it returns without seeing its work. If the summary is lossy in exactly the way that matters, the parent proceeds on a bad abstraction — and it can't tell, because the detail that would have flagged the problem got left behind in the discarded window.

When it fails

Here's the failure mode nobody warns you about. Subagent isolation with no shared memory means every subagent starts cold. It re-derives context the parent already had and the last subagent already found. Spawn three subagents to investigate three corners of the same system and, without a shared store between them, each one re-reads the same core files, re-learns the same architecture, and re-discovers the same gotcha — three times, in three separate windows, at full price each. You've isolated the context so well that you've also isolated the learning. The isolation that saves the parent's window quietly taxes every child.

Isolation without a shared bus is a false economy. Subagent isolation without a shared memory store is a false economy at scale. You save the parent's context by making the children re-derive everything from scratch. The fix is to give the subagents the same durable store from family 2 — so the first subagent's findings are recalled by the next instead of rediscovered. Isolation controls what flows up; shared memory controls what flows sideways. You want both.

The families compose — that's the whole point

I've been dropping the composition hints deliberately, so let me make them explicit, because treating these four as a menu you pick one item from is the mistake I most want to talk you out of.

Eviction (1) is only safe on top of offload (2) or retrieval (3). This is the load-bearing relationship. You can only throw information away cheaply if you can get it back cheaply — and "getting it back cheaply" is precisely what families 2 and 3 provide. Evict a tool result whose contents you already wrote to a memory note: safe, because recall restores it. Evict a file you can re-fetch with one targeted search: safe, because retrieval restores it. Evict something that exists nowhere else and you've just planted a bug that will surface three turns later as confident nonsense. The 29% → 39% jump from adding a memory tool to context editing is this relationship, quantified — the memory tool is what makes the eviction safe to run harder.

Retrieval (3) keeps the working set small enough that eviction (1) rarely has to fire. If you never stuffed the whole file in, there's less to evict later. The two attack the same waste from opposite ends — one at load time, one at cleanup time — and a system with good retrieval needs less aggressive eviction.

Subagents (4) need a shared store (2) or they re-derive context. Covered above, but it's the composition people skip most often, because subagents feel self-contained. They're self-contained in their context, not in their knowledge. Wire them to the same working-memory store and the isolation stops being a re-derivation tax.

The unifying idea is almost embarrassingly simple once you see it: cheap restoration is the license to be aggressive about relief. Every family is either a way to remove context (1, 4) or a way to make removal safe by guaranteeing you can get the important parts back (2, 3). Build only the removal half and you get an agent that forgets things it needed. Build only the restoration half and you get an agent that never frees anything and grinds to a halt at the context limit. You need the pair.

This is also why I stopped thinking of vectr as "a search tool" or "a memory tool." It's families 2 and 3 in one MCP server, deliberately, because on their own each is half a solution. Search without memory re-explores every session. Memory without search has nothing good to store. And both of them exist, in the end, to make the harness's eviction — family 1, which I don't even own — safe to run.

A short field guide

If you operate a coding agent and want to actually apply this, here's the compressed version I'd give a colleague over coffee.

Start with retrieval (3), because it's the one that prevents the mess instead of cleaning it up, and it pays off immediately on any codebase you don't have memorized. Add offload-and-recall (2) next, and make the recall automatic rather than something the model has to remember to do — a store the agent forgets to read is worthless. Let the harness handle eviction (1), but check that it's evicting in big infrequent passes (mind the cache-invalidation math) and that everything it evicts is backed by 2 or 3. Reach for subagent isolation (4) on genuinely large exploratory subtasks, and if you use more than one subagent on related work, give them a shared memory bus or accept that each is paying full freight to learn what the last one already knew.

None of these is exotic. The compaction and context-editing pieces ship in the tools already. The retrieval and memory pieces are a weekend to prototype — I wrote up the honest numbers on how far mine actually got if you want the unvarnished version. What's rare is treating them as one system with a single governing rule — restore-ability licenses removal — instead of four disconnected tricks. Get the rule right and the context window stops being the thing you fight and starts being the thing you manage.

Four problems in a trench coat, then — not one. And once you've split them apart, the thing that surprised me is how little the individual tricks matter next to the relationship between them. Any single family, run on its own, either forgets something it needed or refuses to let go of anything. What actually works is the pair: a way to remove context sitting on top of a guarantee that you can get the important parts back. Cheap restoration is what buys you the right to be ruthless. Wire that in and the context window quietly changes from the wall you keep hitting into a budget you spend on purpose.

Sources

Claude Code — Context window and compaction. https://code.claude.com/docs/en/context-window (accessed 2026-07-07)
Anthropic — Context editing (clear_tool_uses_20250919, clear_at_least, keep, trigger, exclude_tools). https://platform.claude.com/docs/en/build-with-claude/context-editing (accessed 2026-07-07)
Anthropic — Managing context on the Claude Developer Platform (29% / 39% performance figures for context editing alone vs. context editing plus the memory tool). https://www.anthropic.com/news/context-management (accessed 2026-07-07)
Anthropic — Prompt caching (5-minute / 1-hour TTL, 1.25× / 2× write, 0.1× read, cumulative-hash invalidation). https://platform.claude.com/docs/en/build-with-claude/prompt-caching (accessed 2026-07-07)
Claude Code — Subagents (per-subagent context window, returns only the summary). https://code.claude.com/docs/en/sub-agents (accessed 2026-07-07)

Claude Code Hooks: A Practical Deep-Dive on Deterministic Agent Behavior

Swapnanil Saha — Sun, 12 Jul 2026 22:36:47 +0000

Here's a thing that took me embarrassingly long to accept about coding agents: you cannot instruct your way to reliability.

I had a working-memory system — a semantic-search-plus-notes MCP (Model Context Protocol) server I've been building, and it's the case study for this whole post — and it worked beautifully in demos. The agent would discover something, store a note, recall it later, save itself a re-read. Then I'd watch a real session and the agent would just... not recall. It had notes sitting right there, one tool call away, verbatim, and it would instead re-read the same file it had already read two sessions ago, because nothing made it check. My CLAUDE.md said "call recall at the start of every task." The model read that instruction and ignored it, the way it ignores roughly anything that competes with the task actually in front of it.

The lesson generalizes past my project. Any behavior you need to happen every single time — inject context, run a linter, block a dangerous command, snapshot state before it's destroyed — cannot depend on the model deciding to do it. The model is a probabilistic thing optimizing for the current turn. You need something outside the model, in the harness, that fires deterministically. In Claude Code, that thing is hooks.

This is the practical guide I wanted when I started: what the events are and what each is genuinely good for, the exact configuration and I/O contract (which is fiddlier than the docs make it look), and then a real production hook pipeline — mine — walked through end to end, including the design calls and the parts that bit me.

The problem hooks actually solve

Before the plumbing, the point. There is a whole class of things you want an agent to do that instructions are simply the wrong tool for. Not because the instruction is badly worded — because instructions target the model, and the model is the part of the system you don't control.

Think about what "the model complies with an instruction" actually means. On any given turn there's some probability the behavior happens, and that probability is well short of 1. It drops when the task gets absorbing, when the context is long, when an unusual prompt pulls attention elsewhere. That's fine for a preference — "prefer functional style," "keep commits small." It is a disaster for a guarantee. If the only thing standing between your agent and an rm -rf on the wrong directory is a politely worded line in a config file, you don't have a control. You have a hope.

Hooks move the decision out of the model and into the harness. The harness is deterministic: it runs code on a schedule, whether or not the model would have thought to. That single relocation — from "the model should" to "the harness will" — is the entire idea, and everything below is mechanics in service of it.

The core distinction. CLAUDE.md is where you put things the model should tend to do. Hooks are where you put things that must deterministically happen. Confusing the two — trying to instruction-engineer a guarantee — is how you end up with a system that works in the demo and flakes in production.

What a hook actually is

A hook is a shell command — or, increasingly, an HTTP call or an MCP tool invocation — that Claude Code runs automatically when a specific event fires in the session lifecycle. The event hands your command a JSON blob on stdin describing what's happening. Your command does whatever it wants and communicates back through two channels: its exit code and its stdout. That's the entire model. It's Unix-plumbing simple, which is exactly why it's reliable — there's no LLM in the loop deciding whether to honor it.

The events cover the session from birth to death. When I first wrote my pipeline the list was short; by mid-2026 it has grown considerably — the reference now documents around thirty event types spanning session lifecycle, per-turn, per-tool-call, permissions, subagents, worktrees, and MCP elicitation. Most of them you'll never touch. The workhorses — the ones worth learning cold — are these:

Event	Fires	What it's for
`SessionStart`	Session begins or resumes (matchers: `startup`, `resume`, `clear`, `compact`)	Inject state the agent needs before turn 1 — branch info, environment, recalled memory
`UserPromptSubmit`	Before Claude processes each user prompt	Inject per-turn context keyed to what the user just asked; can also block the prompt
`PreToolUse`	Before a tool call runs (matches on tool name)	Block dangerous calls, rewrite arguments, or surface a warning tied to the specific action
`PostToolUse`	After a tool call succeeds	React to results, replace tool output, add follow-up context
`PreCompact`	Before context compaction (matchers: `manual`, `auto`)	Persist anything that's about to be summarized away
`Stop` / `SubagentStop`	When the agent (or a subagent) finishes a turn	Enforce "you're not done yet" — block the stop and send it back to work
`SessionEnd`	Session terminates	Cleanup, flush, teardown
`Notification`	Claude Code emits a notification (permission prompt, idle, etc.)	Route notifications to your own channels

The mental split that helps: session-scoped events (SessionStart, SessionEnd) bracket the whole thing; per-turn events (UserPromptSubmit, Stop) fire once per user exchange; per-tool events (PreToolUse, PostToolUse) fire around individual tool calls, potentially dozens of times a turn. Match the cadence of your hook to the cadence of the thing it's reacting to, or you'll either miss events or fire far too often.

The configuration surface

Hooks live in settings.json. There are three tiers, and the tier decides who the hook applies to and whether it's shared:

~/.claude/settings.json — all your projects, never checked in.
.claude/settings.json — one project, checked in and shared with the team.
.claude/settings.local.json — one project, gitignored, personal.

The shape is nested and, honestly, a little awkward until it clicks. Under a top-level hooks key, each event name maps to a list of groups. Each group has an optional matcher and a list of hooks to run:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "${CLAUDE_PROJECT_DIR}/.claude/hooks/guard.sh",
            "timeout": 30
          }
        ]
      }
    ]
  }
}

The matcher is the filter. For tool events it matches against the tool name — "Bash", "Edit|Write" for either, or a regex like "mcp__memory__.*" for a whole MCP server's tools. For non-tool events it matches against the event's reason: SessionStart takes startup|resume|clear|compact, PreCompact takes manual|auto. Omit the matcher (or use "*") to fire on everything.

The type used to be implicitly "command." It's now explicit and there are several — command (shell), http (POST to a URL), mcp_tool (invoke an already-connected MCP tool), and two LLM-in-the-loop types (prompt, and the experimental agent) that let a hook ask a model to make a yes/no call. For anything latency-sensitive you want command, because it's a local process with no network round trip. The useful knobs on a command hook are timeout (seconds; the default is generous but UserPromptSubmit is capped lower because it's on the critical path of every turn), and the ${CLAUDE_PROJECT_DIR} placeholder so your command path survives the user's working directory changing.

The I/O contract, which is where people trip

This is the part the quickstart glosses and the part that determines whether your hook works. A hook talks back through exit code and stdout, and the two are read differently depending on the code.

Exit code 0 — success. Claude Code parses stdout looking for JSON. If it's JSON, the fields are honored; if it's not, no decision is taken and the session proceeds normally. (Two events, SessionStart and UserPromptSubmit, are more generous: they also fold plain, non-JSON stdout straight into the context. For every other event, if you want to inject something you emit the JSON form.) This is the channel you use to inject context.

Exit code 2 — blocking error. stdout's JSON is ignored; instead, stderr is read as an error message, and for blockable events (PreToolUse, UserPromptSubmit, Stop/SubagentStop) the action is blocked. This is how a PreToolUse hook vetoes an rm -rf. (For PreToolUse specifically there's now a cleaner path too: emit permissionDecision — allow, deny, or ask — inside hookSpecificOutput on exit 0, which is more expressive than the blunt exit-2 veto and lets you attach a reason the model reads. I still reach for exit 2 when I just want a hard no.)

Any other exit code — non-blocking error. The session continues; the failure is surfaced in the transcript and logged, but nothing is blocked.

The JSON you emit on stdout (with exit 0) has a couple of shapes. There's a set of top-level universal fields — continue (set false to stop Claude entirely), stopReason, suppressOutput, systemMessage. And there's an event-specific envelope, hookSpecificOutput, which is where the good stuff lives. The single field I use most is additionalContext: a string that Claude Code injects into the model's context at the point the hook fired.

{
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": "Current branch: main\nUncommitted: auth.ts, config.py"
  }
}

That's the whole trick behind hook-injected memory. additionalContext on a SessionStart hook lands at the start of the conversation. The same field on UserPromptSubmit lands next to the prompt the user just submitted. On PreToolUse/PostToolUse it lands next to the tool result. The docs render it as a system reminder and advise writing it as plain factual statements rather than instructions — the model treats "the deployment target is production" better than "remember to be careful about production." There's a cap on how much you can push through it (on the order of ten thousand characters), which is less a limit than a hint: injection is not a place to dump files.

The one rule that fails silently. If you emit JSON on stdout with exit 0, stdout must contain only that JSON. A stray echo from your shell profile, a debug print, a warning from a Python import — any of it corrupts the JSON and your injection silently does nothing. More than one of my early hooks failed for exactly this reason and gave no error, because a malformed stdout on exit 0 just means "no decision," not "error." Nothing tells you. The session simply proceeds as if the hook weren't there.

In the interactive version of this post there's a small explorer where you pick an event, an exit code, and a stdout shape and see exactly what Claude Code does — including how the "stray echo" case turns an injection into a silent no-op. It's on swapnanilsaha.com.

A real pipeline: injecting working memory

Now the case study. My tool is a working-memory MCP server: the agent stores notes during a session and recalls them later. The problem from the intro was that recall is a tool the model has to choose to call, and it wouldn't, reliably. Hooks are how I took the choice away from the model and made recall happen deterministically.

When you run vectr init --hooks, the tool writes four hook groups into the project's .claude/settings.json. Every one of them calls back into the same CLI — vectr hook <event> — which owns the output contract so the settings file stays a thin, stable pointer. Here's the shape it writes (the install code is idempotent — re-running never duplicates entries and leaves any hooks you added yourself untouched):

{
  "hooks": {
    "SessionStart": [
      { "matcher": "startup|resume|clear|compact",
        "hooks": [{ "type": "command", "command": "vectr hook session-start" }] }
    ],
    "UserPromptSubmit": [
      { "hooks": [{ "type": "command", "command": "vectr hook user-prompt-submit" }] }
    ],
    "PreToolUse": [
      { "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "vectr hook pre-tool-use" }] }
    ],
    "PreCompact": [
      { "matcher": "manual|auto",
        "hooks": [{ "type": "command", "command": "vectr hook pre-compact" }] }
    ]
  }
}

Four events, four jobs. Walk through why each one is the event it is.

SessionStart — the boot set

Before the agent's first turn, vectr hook session-start fires. It resolves which of my running daemons serves this workspace, asks it for the boot set — the must-see notes, meaning standing directives plus high-priority task context — and emits them as additionalContext. This is the MEMORY.md-equivalent: the handful of things that should be true in the agent's head from turn one, present with zero model agency. The matcher startup|resume|clear|compact means it fires not just on a fresh start but also after a /compact and after a /clear — precisely the moments when the agent has just lost its context and most needs the boot set re-injected.

UserPromptSubmit — per-turn recall

This is the one that fixed the original problem. Every time the user submits a prompt, vectr hook user-prompt-submit reads the prompt text off stdin, runs a semantic recall against the note store keyed to that specific prompt, and injects the top matches next to the prompt before the model ever sees it. Ask about workspace locking and the locking notes are already there. The agent doesn't decide to recall — recall already happened, invisibly, on the way in.

The tuning here matters because this hook is on the hot path of every single turn. I cap it hard: at most 3 notes, with a relevance floor (a minimum similarity of 0.35) so an off-topic prompt injects nothing rather than dragging in vaguely-related noise. And it injects the terse one-line index form of each note, not the full body — enough for the model to know the note exists and decide whether to expand it, without spending a paragraph of tokens on every turn. An injection that fires every turn has to be miserly or it becomes the context bloat it was meant to prevent.

PreToolUse (Edit|Write) — the gotcha at the moment of the edit

This one I'm quietly proud of. When the agent is about to edit or write a file, vectr hook pre-tool-use pulls the file_path out of the tool input and recalls any gotcha recorded against that exact file — then injects it right there, at the instant of the edit. "This file's config is regenerated; edit schema.ts instead." "This function looks unrelated but changing it breaks the lock invariant." Static path-scoped rules can't do this, because the gotcha is something an earlier session learned and wrote down, and it surfaces exactly when it's actionable rather than sitting in a rules file the agent skimmed once.

PreCompact — save it before it's gone

/compact replaces the conversation with a summary and, in doing so, throws away exact detail. So right before it runs, vectr hook pre-compact snapshots the working-memory store — sealing the current notes as a named checkpoint. Notably this hook injects nothing into context; compaction is about to discard context anyway, so there's no point. Its whole job is the side effect of persisting state, and the boot set gets re-injected on the other side by the SessionStart compact matcher. The two hooks are a matched pair around the compaction event: one saves, the other restores.

Why hook-injected memory beats a recall tool

Let me make the central design argument sharp, because it's the reason the pipeline exists in this shape.

A recall tool and a recall hook retrieve the exact same notes from the exact same store. The only difference is who pulls the trigger. With a tool, the model decides — and "the model decides" means a probability, well short of 1, that it happens on any given turn, dropping further as the task gets absorbing. With a hook, the harness decides, and the harness is deterministic. It fires every time, on schedule, whether or not the model would have thought to.

For a capability whose entire value proposition is reliability across sessions, a probabilistic trigger is a contradiction in terms. Working memory you recall 60% of the time isn't 60% as good as working memory you recall always — it's worse than that, because the times it fails are unpredictable and the agent has no way to know it's operating on a stale or empty picture. Moving the trigger from the model into the harness is the difference between a feature that demos well and one that holds up.

The arithmetic makes it concrete. If a recall tool fires with probability p on each turn, the chance it fires on every turn of an N-turn session is p^N. At p = 0.60 over 20 turns that's about 0.004% — a clean session is essentially impossible. Even a very obedient p = 0.95 gives you only about 36%. A hook is p = 1, so p^N = 1, every session, forever.

The general principle. Anything that must happen every time belongs in a hook, not in an instruction. A guarantee cannot be prompt-engineered, because the thing you'd be prompting is the exact thing you don't control. Relocate the trigger, not the words.

There's a subtle second-order bug that falls out of doing this, and it's worth telling because it's the kind of thing you only find in real transcripts. Once SessionStart and UserPromptSubmit are auto-injecting notes, the model — which has also been told in CLAUDE.md to recall notes — will sometimes call the recall tool on top of the injection, paying for the same memory twice. I caught this in an eval transcript: the agent got its notes injected by the hook and then immediately called recall for the same thing. The fix is a one-line notice prepended to the injected context: "Your working-memory notes are auto-injected below — do not call recall to re-fetch them; call it only for something not shown here." It resolves the double-dip cleanly, but I'd never have known to write it without watching the failure happen.

The three things that will hurt you

Hooks are simple to write and easy to write dangerously. Three concerns dominate, and they're all about what happens when a hook misbehaves.

1. A hook must never break the session

This is the rule I hold most rigidly, and it shapes every line of my hook code. A hook runs on the critical path — UserPromptSubmit fires before every prompt the user sends. If that hook throws, hangs, or crashes, it degrades or breaks the user's session. So the hook code is paranoid by construction:

The recall function that feeds the injection catches every exception and returns an empty string on any failure. Daemon down, slow, error, malformed response — doesn't matter, it yields nothing and the session proceeds.
The top-level hook handler wraps its entire body in a try/except and always exits 0. There is no code path where my hook returns a non-zero exit and accidentally blocks a prompt or a tool call.
If there's no memory to inject — a brand-new workspace with zero notes — the hook emits nothing at all, not an empty JSON envelope. A fresh project should feel exactly like no hook is installed.

The design stance is that the memory injection is a bonus, never a dependency. The session must work identically whether the daemon is up, down, or on fire. If your hook can make the agent worse when it fails, you've built a liability, not a feature.

2. Latency is a tax on every turn

UserPromptSubmit sits between the user hitting enter and the model starting to think. Whatever your hook spends there, the user waits. Claude Code caps this event's hook timeout lower than others for exactly this reason, but a timeout is a backstop, not a budget — you want to be nowhere near it. My recall is designed to return in well under 50 milliseconds, and the hook does the absolute minimum: read stdin, one local HTTP call to an already-running daemon, print, exit. No model loading, no indexing, no network beyond localhost. If your per-turn hook does anything that can take a second, move it off the hot path — make it async, or attach it to a less frequent event.

3. Hooks run arbitrary shell — treat them as such

The official docs are blunt about this and they're right: hooks execute arbitrary shell commands with your full user permissions, automatically. They can read your files and your environment variables. A malicious or careless hook in a shared .claude/settings.json is a genuine attack surface — someone commits a hook, you pull the repo, and now their command runs on your machine the next time you start a session.

Practical defenses: review hook configs before committing to shared repos, exactly as you'd review a Makefile or a git hook; keep personal hooks in the gitignored settings.local.json so they can't leak; and know that enterprise setups can lock this down with allowManagedHooksOnly. In my own design I lean on a smaller mitigation — the settings file never contains logic, only vectr hook <event>, a call into a versioned, inspectable CLI. There's no shell one-liner in the JSON to audit; the behavior lives in code you can read. And the CLI only ever talks to a localhost daemon, so a hook firing in the wrong directory can't reach across to another workspace's memory. That last point is deliberate: the resolver walks up from the current directory to find the daemon that serves this workspace and refuses to fall back to a default — because a default port could belong to an unrelated project and leak its notes into your session.

The honest limitations

A few things I've hit that the enthusiastic tutorials leave out.

Injected context is still context. Every note a hook injects costs tokens, every turn, forever. The UserPromptSubmit hook is genuinely helpful because it's disciplined — 3 notes, relevance floor, terse index form. An undisciplined version that injected ten full notes per turn would reintroduce the exact context bloat the memory system exists to fight. Hook injection is a budget you're spending; spend it like one.

Debugging is opaque by design. Because a malformed stdout on exit 0 means "no decision" rather than "error," a broken hook fails silently. The session just proceeds as if the hook weren't there. When an injection isn't landing, my first move is always to run the exact command by hand, pipe a sample event JSON into its stdin, and stare at stdout for the one stray character breaking the JSON. There's no substitute; the harness won't tell you.

It's Claude Code-shaped. This whole mechanism is specific to one harness. My pipeline's determinism comes from Claude Code's hook system, and other agent environments have different injection points, or none. If you want the same deterministic-injection behavior elsewhere, you're re-implementing against a different (or absent) surface, and in the worst case you fall back to the very thing hooks let you escape — hoping the model calls the tool. That portability gap is real and I don't have a clean answer to it yet.

Compaction timing isn't fully in your hands. PreCompact fires before compaction, which is great, but auto-compaction triggers on the harness's schedule, near the context limit — not necessarily at a clean task boundary. My snapshot is a safety net, not a substitute for the agent proactively writing important findings to memory as it goes. The hook catches what the agent forgot to save; it works best when there's little to catch.

What hooks are really for

Strip away the specifics and hooks are one idea: a place to put behavior that must not depend on the model's cooperation. Injection, enforcement, persistence, cleanup — anything where "usually" isn't good enough and you need "always." The model is brilliant at the open-ended, judgment-heavy work in the middle of a turn. It is not the thing you want deciding whether the guardrail runs.

For my working-memory tool, hooks are what turned a good idea that demoed well into something that actually holds across sessions. The notes were always there. What was missing was a guarantee that the agent would look — and that guarantee cannot come from the agent. It comes from four small shell commands wired into the right four events, each one paranoid about never breaking the session, each one doing exactly one deterministic job. That's the whole art of it: not clever hooks, but reliable ones.

Companion posts

This post is part of a series on the working-memory tool the pipeline is built on:

Sources

Claude Code — Hooks reference (event list, settings.json structure, matchers, exit-code protocol, hookSpecificOutput/additionalContext, security warning, allowManagedHooksOnly). Accessed 2026-07-07.
Claude Code — Automate actions with hooks (worked configuration examples).
Claude Code — Context window, compaction, and /compact. Accessed 2026-07-07.

Building Vectr, Part 2: What /compact Destroys and How to Survive It

Swapnanil Saha — Tue, 16 Jun 2026 13:28:51 +0000

Session three of a bug hunt in CPython's garbage collector. Two sessions in, I had what felt like a solid map: the exact call chain from PyObject_GC_Del through the generational collector, the non-obvious invariant around finalizer ordering, the three files where the relevant logic lived. Then /compact fired.

The summary said something like: "we were investigating CPython's garbage collector, specifically the interaction between finalizers and the generational GC." Accurate. Useless. The exact function signatures were gone. The specific line numbers were gone. The invariant that took two sessions to understand — compressed to one sentence that had lost all the nuance. The next 20 minutes: re-reading files to rebuild what I already knew.

This post is about what I learned from that, and from the working memory system I built to prevent it. Part 1 covered the indexing layer — how Vectr finds things in a codebase semantically. This part covers what happens after you find something: how to keep the knowledge alive across session boundaries, why my initial design was wrong in a fundamental way, and what actually works.

Part 1: The Problem With /compact

What /compact Actually Destroys

Most people treat /compact as "clear the context to keep going." That framing is roughly correct but understates the damage. The issue isn't just that context gets shorter — it's that the compression is lossy in exactly the cases where being wrong is most expensive.

/compact works by asking the AI to summarize the current conversation, then replacing the full history with that summary. Token count drops from (say) 180,000 to 12,000. Here's what the summary doesn't preserve:

Exact function signatures. A summary might say "the function takes a path and a flag." The conversation had def process_workspace_changes(path: Path, db: Database, *, force: bool = False) -> list[ChangeResult]. The difference between those two descriptions is the difference between a valid call site and a runtime error.

Specific line numbers. "The resolver module" and /src/workspace/resolver.rs:214 are not the same precision. You can reconstruct the file path, but it costs you a tool call.

Non-obvious behavioral invariants. If you spent three turns establishing that acquire_lock() must be called before touching workspace metadata because there's a race condition with the filesystem watcher, that three-turn understanding might survive as "be careful with locking." The exact invariant — the one that matters when you're writing the code — is gone.

The reasoning chain. Sometimes the value of an exploration session isn't the final answer but the chain of observations that produced it. Summaries discard chains. They keep endpoints.

Key insight: Summaries are fine for preserving topics and general direction. They fail specifically at exact signatures, line numbers, and subtle behavioral invariants — which is also where being wrong is most expensive. A summary of "be careful with locking" covers the topic. It doesn't tell you which function must be called first, or why, or what breaks if you get it wrong.

In the CPython scenario, re-establishing the finalizer ordering invariant from scratch means re-reading several files and re-following a non-obvious call chain — roughly 15–20 minutes of work that was already done. A note stored at the end of session two takes about a minute to write and ten milliseconds to retrieve.

Why You Can't Tell the AI to Just Forget Things

When I started building Vectr's memory layer, I had a clean model: the AI finds something useful, stores it with vectr_remember, then drops the file from its context window. The note is 50 tokens. The file was 800 tokens. Net gain: 750 tokens freed for new content. I called this "context offload."

I built it this way. I wrote documentation describing it this way. I designed vectr_evict_hint entirely around it.

It doesn't work.

The KV cache is append-only. Think of the transformer's memory as a lookup table it builds as it reads each token. For each token it processes, it computes a key-value representation that gets stored at each attention layer. Every subsequent token attends back to every previous token through these cached representations — that's how earlier context influences later output.

Once a token's representation is computed and cached, it stays until the context is cleared. There is no mechanism to evict specific tokens by instruction. "You can drop chunk X from your context window" is itself processed as tokens — added to the cache, not used to remove other entries from it.

A subtlety worth naming: the KV cache is maintained server-side by the inference provider. What you see as "context window usage" is a count of tokens in the current conversation, not a direct readout of GPU memory. The principle holds regardless: every token in the conversation occupies a slot in the cache, and you cannot remove individual tokens from a running session without ending or compressing the whole thing.

The KV cache memory cost formula:

KV cache size = 2 × L × n_heads × d_head × T × bytes_per_float

For a representative mid-size model: L=32 layers, n_heads=32, d_head=128, T=50,000 tokens at fp16 (2 bytes):

2 × 32 × 32 × 128 × 50,000 × 2 = 13.1 GB

The cache grows linearly with sequence length T. No selective removal. The operations that genuinely reduce context are: end the session (total loss), use /compact (precision loss), or rely on provider-side prefix caching — which stores stable prefix representations like system prompts to avoid recomputing them, but doesn't remove anything from your active context budget.

I measured context window usage before and after sequences of vectr_remember + vectr_evict_hint calls: essentially unchanged. The hint was adding tokens to the cache while accomplishing nothing at the context management level. In some cases it made things marginally worse.

Warning: Any tool or documentation claiming "store to external memory to free context budget" is describing something the system cannot deliver. Tokens in a live context window cannot be selectively evicted. Working memory tools are genuinely valuable — but not for freeing active context. Building around that claim confuses your benchmarks and misleads anyone using the tool.

Part 2: What Working Memory Actually Does

Three Tiers of Value

Once I dropped the context-offload framing, the actual value of vectr_remember became clear. It operates on three time horizons:

Tier 1 — In-session re-read avoidance. Within a single session, before any /compact: recalling a stored note costs ~50 tokens instead of re-reading the original file at ~600 tokens. Real savings, but the file is still sitting in your context window anyway. Genuinely useful, but not the reason to build this.

Tier 2 — /compact survival. When /compact compresses the conversation, notes stored on disk (SQLite + ChromaDB) are untouched. Exact signatures and behavioral invariants survive verbatim. The session resumes from actual precision. This is where the system earns its cost.

Tier 3 — Cross-session persistence. Between separate sessions — the editor closed and reopened — the AI starts with nothing. Notes survive. A new session calling vectr_status() + vectr_recall() recovers findings from sessions ago without re-reading a single file. Each session builds on the ones before it.

Analogy — The surgeon's notes: A surgeon takes detailed notes before starting a complex procedure. Halfway through, an emergency calls them away for two hours. When they return: (a) their notes are on the desk — exact measurements, named vessels, where they left off; or (b) a colleague wrote a summary: "patient is partially through a vascular procedure, some complications noted." Option (b) is dangerous. Option (a) lets you continue precisely. vectr_remember is option (a). /compact without notes is option (b).

Tier 3 compounds in a way that's easy to underestimate. The first session on a complex codebase pays the discovery cost. The second benefits from the first session's notes. By the tenth session, a well-maintained note store is a persistent model of the codebase that makes every session faster.

What to Store and How

Don't store file pointers. "See resolver.rs:214 for the lock implementation" is a bad note. File paths change during refactoring. Line numbers drift with every edit. A pointer hasn't captured what you learned — it's a reference. When you recall it, you still have to read the file.

Store the finding itself:

WorkspaceLock: defined at resolver.rs:214 (as of 2026-06-08)
- acquire(): blocks if .vectr_lock exists; writes current PID + timestamp
- release(): validates PID match before deleting lock file
  (returns Err if mismatch — this is intentional, not a bug)
- CRITICAL: acquire() must be called BEFORE touching workspace
  metadata. The filesystem watcher reads metadata; touching it
  without holding the lock fires an invalid re-index.
  This caused the race condition in issue #1247.

Key callsites: workspace.rs:89 (init), daemon.rs:203 (shutdown)

This note is ~120 tokens. Reading the relevant files to reconstruct this knowledge would cost 600+ tokens plus two turns. The note captures the actual insight — the non-obvious invariant about lock order — not just a pointer.

Priority and tags are not cosmetic. priority affects recall ordering: high-priority notes rank higher when multiple notes match a query with similar scores. tags enable filtered recall — vectr_recall(query="locking", tags=["concurrency"]) returns only notes tagged with "concurrency" that semantically match the query. In a large note store accumulated over months, filtering by subsystem makes recall precise.

Part 3: The Bugs That Shaped the Design

The B9 Bug: When Recall Doesn't Recall

For several early benchmark runs, vectr_recall was firing in implementation sessions but returning nothing useful — 0 relevant results across 5 separate sessions on CPython tasks, even though the research session had stored detailed notes about exactly the functions being modified.

Root cause: recall was using SQL LIKE queries, not semantic search.

# The broken implementation (pre-B9)
def recall(query: str) -> list[Note]:
    return db.execute(
        "SELECT * FROM notes WHERE content LIKE ? LIMIT 20",
        (f"%{query}%",)
    ).fetchall()

SQL LIKE is substring matching. vectr_recall("garbage collector finalizer ordering") would only return notes containing that exact string. A note about PyObject_GC_Del describing finalizer behavior — stored with different wording in a different session — wouldn't match.

The fix: use the ChromaDB vector store for recall. Notes are embedded when stored, retrieved by semantic similarity when recalled.

# The correct implementation (post-B9)
def recall(query: str, tags: list[str] | None = None) -> list[Note]:
    results = chroma_collection.query(
        query_texts=[query],
        n_results=10,
        where={"tags": {"$in": tags}} if tags else None,
    )
    return [Note.from_chroma(r) for r in results]

Impact was immediate: vectr_recall fired with relevant results in 4 of 6 implementation sessions in the CPython re-run, compared to 0 of 6 before. This bug sat undetected because the initial benchmark design didn't make empty recalls visible. Per-tool logging — "vectr_recall called 5 times, 5 empty responses" — made it obvious.

Warning: SQL LIKE requires the query string to be a literal substring of the stored content. For anything more than exact-match lookup, it's not just suboptimal — it's functionally broken for most real queries.

vectr_evict_hint: What It Actually Does After the Reframe

After fixing the context-offload misconception, I kept vectr_evict_hint but reframed it completely. What it actually does: it tracks the cumulative token cost of all code chunks Vectr has retrieved in the current session. When this cost crosses a threshold (40K tokens or 20 tool calls — whichever fires first), it appends a hint:

[vectr_evict_hint] You've retrieved ~42,000 tokens of indexed chunks
this session. The following chunks are fully indexed and re-retrievable
in <50ms — no need to re-read these files later:

  - resolver.rs:214  WorkspaceLock::acquire  (retrieved 8 turns ago)
  - resolver.rs:267  WorkspaceLock::release  (retrieved 8 turns ago)
  - workspace.rs:89  init call site          (retrieved 5 turns ago)

Consider calling vectr_remember now if you have key findings you
haven't stored yet.

The word is "re-retrievable," not "droppable." The hint doesn't claim to free tokens. It tells the AI: these files are in the index, you can get them back in under 50ms if you need them — don't re-read out of caution when you already have what you need or could re-search instantly. It's a behavioral nudge, not a memory management operation.

The threshold values come from MemGPT (arXiv:2310.08560), which found models begin exhibiting "lost in the middle" degradation at roughly 70% context fill. Using a disjunction (first threshold reached triggers the hint) keeps it from firing too late on sessions that accumulate few large files but many small searches.

Lost in the middle: LLM performance on retrieval tasks follows a U-shaped curve over context position — accuracy highest at the beginning and end, degrading for content in the middle. The evict_hint threshold is set to fire before relevant information drifts into that degraded zone.

Part 4: The Mechanics of Actually Using It

The Save-Moment Problem

Knowing notes are valuable doesn't make the AI store them. In early sessions, vectr_remember call rates were low — not because the AI couldn't see the tool, but because there was no clear trigger for "now is the moment to save this."

Saving notes is a habit humans develop from experiencing loss. An AI editor in session 1 has never lost anything to /compact here — it's optimizing for the task in front of it, not a compression event that might happen three hours from now.

The solution: making the save-moment explicit and concrete in the CLAUDE.md template that Vectr writes into a workspace.

**The moment you find a key definition, pattern, or non-obvious detail:**
call vectr_remember(content, tags=[...], priority="high"|"medium"|"low")
— store the actual code block or finding, not a file pointer.

Treat every vectr_search or vectr_locate call as a **pair**: search,
then immediately save the key finding before your next retrieval.

If /compact runs later, the conversation summary loses exact signatures
and line numbers — your note does not.

"Pair every search with a save" turned out to be the most effective framing. Not "save when it feels important" (too vague), but "pair every retrieval with a note" (concrete, immediate trigger). Sessions that stored the most notes also had the lowest re-discovery costs in subsequent tasks.

When not to search: the SR-RAG finding. The pair pattern addresses when to save. There's a complementary question that ended up in the same CLAUDE.md template: when to search at all. Before calling vectr_search on a well-known API or framework, the AI should first write out what it already knows and only search if genuine gaps remain.

This comes from SR-RAG (arXiv:2504.01018). The finding: models often retrieve information already baked in from training, adding token cost without improving answer quality. Writing out what you already know before searching reduces unnecessary calls by 26–40% on familiar codebases. On an unfamiliar codebase, the AI's training knowledge rarely applies — every search turns up something new. On well-known frameworks, training knowledge is often more accurate than indexed documentation. The verbalization step surfaces which situation you're actually in.

Snapshots: Checkpointing an Investigation

Beyond individual notes, there's a use case for checkpointing entire session states. vectr_snapshot("lock-subsystem-mapped") seals the current note set under a named label with a timestamp. vectr_snapshot_list() at session start shows all checkpoints.

Typical multi-session workflow:

Exploration sessions: explore, call vectr_remember on each key finding. Pair every search with a save.
Exploration complete: vectr_snapshot("exploration-complete"). Seals the note state for this phase.
Implementation sessions: vectr_status() → vectr_recall(query) → build on the snapshot.
Implementation done: vectr_snapshot("implementation-done"). Two named checkpoints marking the arc.
Revisiting months later: vectr_snapshot_list() shows the investigation history. The snapshot timestamp tells you which notes were established before a given change.

When Notes Are Wrong: vectr_forget

Notes can be wrong. A note about function behavior written before a refactor may describe the old behavior. Stale notes are worse than no notes — false confidence in outdated information.

vectr_forget(note_id) deletes it. Every vectr_recall response includes note IDs alongside the content so you can act on them inline. The workflow: recall → verify against current code → forget the stale note → store the updated one.

Vectr also appends a [STALE] marker automatically when a file path extracted from a note's content no longer exists in the workspace. The extraction is a regex scan for path-like strings — when those paths disappear from the file tree, the note gets flagged. It only catches path-level staleness, not behavioral changes in files that kept their names.

Warning: The [STALE] marker fires when a referenced file path disappears. It does NOT fire when file content changes. A note about function behavior after a refactor that renamed the file gets flagged; a note about function behavior after a refactor that changed the logic without renaming gets no warning. Always verify behavioral notes against current code before acting on them for implementation work.

The Design Principle I'd Rephrase

Looking back at the original Vectr documentation for working memory, almost every sentence led with the wrong framing. "Store to vectr, then drop from context." "Offload findings to free context budget." "Context offload layer." Every one of these is technically false, and I shipped all of them.

The correct version is shorter: store findings now so you can recall them precisely later. Through /compact. Through a new session. Through however many turns separate the discovery from the moment you need to use it. The value is in the later. The storing is cheap. The recalling is where you get the hours back.

If I were writing the documentation from scratch I'd lead with the /compact scenario — with the specific moment when a detailed understanding of a complex system compresses into a three-sentence summary that can't be acted on. That's the moment where a stored note is worth exactly what it cost to write it.

What's Next

The part I haven't answered yet: does any of this actually save time? Not in the abstract — in real benchmarks, on real codebases, compared against an AI editor with no indexing and no memory. The number I care about is not total session cost (which includes upfront research overhead that inflates the naive comparison) but re-discovery cost per task across repeated sessions on the same codebase.

Part 3 covers that measurement — including why the total sprint cost comparison is almost exactly the wrong metric to report, and what the data from CPython, Django, and Apache Camel actually showed once I separated research overhead from implementation savings.

If you want to try Vectr now, the tool page has setup instructions. The full working memory layer — vectr_remember, vectr_recall, vectr_snapshot, vectr_forget — is in the current release alongside the semantic search tools from Part 1.

References

Packer et al., MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560, 2023
Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172, 2023
Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization, arXiv:2504.01018, 2025
Vaswani et al., Attention Is All You Need, NeurIPS 2017
A Survey on LLM Acceleration Based on KV Cache Management, arXiv:2412.19442, 2024

Building Vectr, Part 1: Why grep Fails When You Don't Know the Keywords

Swapnanil Saha — Tue, 09 Jun 2026 14:10:06 +0000

This is Part 1 of the Building Vectr series (1 of 3).

You get dropped into an unfamiliar codebase. Not a toy project — real production code, 8,000 files, three years of accumulated complexity and clever abstractions. Your job is to fix a bug in the request validation pipeline. What does an AI code editor do next?

This post is about a problem I kept running into, a tax I kept paying, and the indexing system I built to eliminate it. It covers the technical decisions behind Vectr's search layer: why naive chunking produces bad embeddings, how tree-sitter solves the code-parsing problem, what BM25 does that vector search can't, and why you need a symbol graph for questions that text search cannot answer at all.

Part 1 — The Problem

The Re-discovery Tax

If you're a human engineer navigating an unfamiliar codebase, here's what you probably do: you ask someone who knows it, or you grep for the error message, or you open the entry point and follow imports until you find the thing. Your brain does semantic compression the whole way — building a model of the system, discarding noise, following intuitions about where complexity tends to live. By the time you've read 20 files, you have a rough map that persists across days and sessions.

An AI code editor has the same tools — read files, run shell commands, grep — but completely different economics. Every Read call costs tokens. Every Bash call for grep costs a turn. Unlike a human who can skim-read at 1,000 words per minute and discard irrelevant content almost for free, an AI editor pays full price for every character it reads: it sits in the context window whether or not it was useful. Read the wrong 500-line file and you've burned context that could have held the answer.

The result, on unfamiliar codebases, is what I started calling the re-discovery tax: a cluster of navigation calls at the start of every session, before any actual implementation begins, spent on figuring out where things are. And because AI editors have no persistent memory between sessions, they pay this tax again and again — every session, on the same codebase.

In benchmarks I ran against real open-source codebases (more detail in Part 3), the re-discovery tax on CPython internals ranged from 6 to 23 tool calls per task before the first file write. Some sessions spent more turns navigating than implementing.

Key observation: The re-discovery tax is paid every session, not once. A human engineer's mental map of a codebase accumulates and compounds. An AI editor's map is fully rebuilt from scratch at the start of each session. The economic gap widens as the codebase grows.

Why grep Fails at the Boundary of Your Knowledge

Before explaining what I built, I want to be precise about where grep breaks down — because "just use grep" is the natural reaction, and it's not obviously wrong until you try to use it systematically on unfamiliar code.

grep is a brilliant tool for confirming hypotheses you already have. If you know what you're looking for, it's nearly perfect. The problem is the case that isn't really an edge case: you don't know what you're looking for.

Say you're trying to understand how a Django application validates incoming JSON payloads before they hit the ORM layer. You might grep for validate. You'll get 200 results across 40 files — field validators, form validators, configuration validators, test fixtures. None of them are obviously the thing you want. You grep for json.loads. You get 30 results. You grep for request.data. That gets you closer, maybe. But you spent four greps and 15 minutes before you found the right file.

The deeper problem: grep requires you to already have a mental model of the codebase's naming conventions. An AI editor running on an unfamiliar codebase doesn't know whether payload validation is called validate_payload, check_request, parse_input, or _pre_process.

Analogy: Think of keyword search as asking for directions by street name in a city you've never visited. "Where is Maple Street?" gets a precise answer. But "where is the street with the good coffee shop near the park?" — keyword search has nothing to offer. You need a different kind of index: one that understands what places are for, not just what they're called.

Semantic search inverts this. It maps your query and every code chunk into the same high-dimensional vector space, then finds the chunks closest to your query by meaning — regardless of whether they share any words. "JWT validation logic" finds verify_token even if neither of those words appears in the function body.

Part 2 — Building the Index

The Chunking Problem: Why Line Windows Break on Code

Prose text has a natural unit of meaning: the paragraph. You can split a Wikipedia article into 200-word chunks, embed each one, and get a reasonable search system. Code doesn't work this way.

The standard naive approach for code indexing is the same line-window strategy borrowed from document search: take a sliding window of N lines with M lines of overlap, create a chunk, embed it, move the window. A common default might be 150-line windows with 50 lines of overlap. Simple, language-agnostic, works on any file format.

The problem is what happens at the window boundaries. Consider this function:

def process_workspace_changes(
    path: Path, db: Database, *, force: bool = False
) -> list[ChangeResult]:
    """Process all pending changes in a workspace, optionally forcing re-indexing."""
    pending = db.get_pending_changes(path)
    if not pending and not force:
        return []

    results = []
    for change in pending:
        if change.kind == ChangeKind.DELETED:
            db.remove_chunks_for_file(change.file)
            results.append(ChangeResult(file=change.file, status="removed"))
        elif change.kind in (ChangeKind.MODIFIED, ChangeKind.CREATED):
            chunks = chunk_file(change.file, db.language_for(change.file))
            db.upsert_chunks(chunks)
            results.append(ChangeResult(
                file=change.file, status="indexed", chunk_count=len(chunks)
            ))

    db.mark_changes_processed(path)
    return results

If a 150-line window happens to cut through this function, neither resulting chunk is independently meaningful. The chunk with just the body is missing the parameter names and return type. The chunk with just the signature has no implementation context. The embedding of a half-function is significantly worse than the embedding of the complete thing.

The fix: split at semantic boundaries. Functions should be complete units. Classes should contain their methods, or each method should be its own chunk with the class header prepended for context.

Why completeness matters: An embedding model compresses everything in its context into a single fixed-size vector. A complete function gives the model everything it needs to capture the function's purpose, parameters, return behavior, and side effects in that vector. A half-function forces the model to compress an ambiguous fragment — the resulting vector is a blurred average of possible interpretations.

Parsing Code with tree-sitter

tree-sitter is a parser library that produces concrete syntax trees for source code — every construct in the language has a named node with exact byte boundaries in the source. Unlike a regex-based approach, tree-sitter actually parses the grammar and handles edge cases correctly: nested functions, decorators on multiple lines, multiline function signatures, arrow functions in JavaScript, generic bounds in Rust.

For Python, the tree-sitter query:

(function_definition
  name: (identifier) @name
  parameters: (parameters) @params
  body: (block) @body) @function

(class_definition
  name: (identifier) @name
  superclasses: (argument_list)? @bases
  body: (block) @body) @class

This matches any function or class definition anywhere in the file and captures the name, parameters, and body as named nodes with precise byte-range positions. You can then slice the original source file at those byte positions to extract complete, syntactically valid chunks.

For classes, Vectr attaches the full class signature — including the base class list captured by @bases — as a header to each method chunk. So the chunk for WorkspaceLock.acquire() includes its inheritance context. A method of AuthenticatedView(LoginRequiredMixin, View) has a meaningfully different semantic context than a method of a plain View.

A subtlety: very large functions. AST-aware chunking breaks down for functions that are genuinely enormous — 500+ lines. Vectr handles this by further splitting large functions at their major control-flow boundaries (default threshold: 200 lines). The resulting sub-chunks each include the function signature as a header to preserve context. Their embedding quality is better than one giant embedding, though still lower than a naturally small function.

Code-Specific Embeddings Running Locally

Not all embedding models are equally good at code. Models trained primarily on prose text have learned representations of natural language semantics. Code has different regularities: symbol names, type signatures, control flow patterns, API call chains. Code-aware models routinely outperform general-purpose models by 10–20% on tasks like "find the function that handles X."

Vectr uses Snowflake/snowflake-arctic-embed-m-v1.5, a 110-million-parameter model that produces 768-dimensional embedding vectors and runs in under 100ms per batch on a modern laptop CPU.

Why local inference instead of an API? Two practical constraints:

Cost: a tool that fires 20–50 search calls per session would accumulate non-trivial API costs quickly. Local inference is free at query time after the one-time model download.
Data privacy: many codebases cannot be sent to third-party APIs. Internal tools, proprietary algorithms, customer data models — many organizations have policies or contractual obligations that prohibit sending source code to external services.

The tradeoff: the model weighs roughly 440MB and needs to be downloaded on first run. This is a real friction point.

One critical detail: queries and chunks are embedded with different input prefixes. Queries use Represent this query for searching relevant code:, chunks use Represent this code snippet:. arctic-embed-m is a single encoder, but it was trained with different prefixes for query-side and document-side inputs. Using the wrong prefix reduces the cosine similarity between semantically related query-chunk pairs — the vectors for "user authentication" and verify_token end up further apart in embedding space than they should be. Getting this wrong costs 5–15% in retrieval quality.

Part 3 — The Search Layer

Hybrid Search: Why BM25 and Vector Search Need Each Other

Vector search handles concept queries well. But if you search for _handle_workspace_lock_conflict — an exact function name — a vector search might not rank it first. The embedding is just one point in a crowded neighborhood of similar-looking function names. BM25, on the other hand, will find it immediately: exact string matches get the highest possible score.

The inverse is also true: BM25 cannot find "retry logic with exponential backoff" if the function is called _schedule_attempt_with_delay and its docstring says nothing about backoff. Zero keyword overlap means zero BM25 score. Vector search finds it because the semantic cluster it belongs to is close to the query in embedding space.

The right system uses both. Every query in Vectr runs both a vector search and a BM25 search in parallel, then combines the two ranked lists using a weighted formula.

BM25 scoring formula:

score(D, Q) = Σᵢ IDF(qᵢ) · [ tf(qᵢ, D) · (k₁ + 1) ] / [ tf(qᵢ, D) + k₁ · (1 − b + b · |D| / avgdl) ]

IDF(qᵢ) = log( (N − nᵢ + 0.5) / (nᵢ + 0.5) )

Where:

tf(qᵢ, D) — term frequency of qᵢ in document D
N — total documents; nᵢ — documents containing qᵢ
|D| — document length in tokens; avgdl — average document length
k₁ = 1.5 (term-frequency saturation), b = 0.75 (length normalization)

This is the Robertson–Sparck Jones variant. Some implementations add +1 inside the IDF log to prevent negative values for very common terms.

The weight assigned to each approach depends on codebase familiarity:

Situation	BM25 weight	Vector weight
Large unfamiliar codebase	0.2	0.8
Small familiar codebase	0.7	0.3
Explicit symbol name in query	0.8	0.2
Natural language concept query	0.2	0.8

These weights are the actual values used in Vectr's implementation, tuned against the benchmark dataset.

The benchmark on Apache Camel (58,000+ Java files) showed a 73% reduction in Read+Bash navigation calls compared to the baseline AI editor with no index.

The Symbol Graph: What Text Search Cannot Answer

Semantic search and BM25 handle "find me the code for this concept" well. But there's a different navigation pattern that neither handles: "find me everything that calls this function."

Vectr builds a symbol graph during indexing. For each file, tree-sitter extracts:

Definitions — every function, class, method, and module-level constant with name and line number
Call edges — every call site, mapping callee name to the calling function's context
Import edges — every import statement, mapping the imported symbol to its likely source module
HTTP routes — Flask/FastAPI @router.get(), Express app.post(), Spring @GetMapping — extracted as named symbols

The resulting graph enables exact lookups. vectr_locate("WorkspaceLock") returns a file path and line number in under 10ms — no embedding, no ranking, pure symbol table lookup. vectr_trace("acquire_lock") returns all callers and all callees in one round-trip. These are not search results — they are graph traversals, and they produce exact answers rather than relevance rankings.

Text search vs. graph traversal: These are not competing approaches — they answer different questions. "Find code that does X" is a search problem. "Find who calls Y" or "find where Z is defined" is a graph traversal problem. Relying only on text search for definition lookups is like looking up a phone number by describing the person rather than looking them up by name.

Six Fallback Strategies in vectr_locate

vectr_locate runs six fallback strategies in sequence, stopping at the first match:

Exact match — direct lookup in the symbol table. Sub-millisecond. Highest confidence.
Suffix match — Lock matches WorkspaceLock, AcquireLock, LockManager.
Same-module priority — if a caller file is provided, search definitions within the same module first.
Unique name — if there is exactly one symbol across the entire codebase whose name contains your query string, return it.
Import chain follow — follow import statements from a given file to find where the name likely comes from.
Fuzzy (Levenshtein ≤ 2) — edit distance ≤ 2 across all symbol names. Catches typos. Lowest confidence.

Each strategy produces a LocateResult with a resolution_strategy field. An exact match means you can act on the result immediately. A fuzzy match with edit distance 2 means you should verify before relying on it. A silent wrong navigation is worse than no navigation at all.

Part 4 — The Runtime Layer

mtime Cache and Incremental Re-indexing

The first time you run vectr start on a large codebase, indexing takes time. CPython's 4,000+ files: about 8 minutes. Django's ~1,800 Python files: about 2 minutes. Apache Camel's 58,000+ Java files: closer to 45 minutes.

During initial indexing, Vectr writes a file at ~/.cache/vectr/{hash}/index_cache.json that stores the modification timestamp of every indexed file. The {hash} is a short SHA-256 hash of the absolute workspace root path. On subsequent runs, only files whose mtime has changed are re-indexed. On a typical active session where you've modified 5–10 files, subsequent re-indexing takes under 5 seconds.

Handling deletions: Vectr also stores the complete set of indexed file paths. At startup, it diffs this set against the current file tree and removes all chunks belonging to deleted files before re-indexing modified ones. Process deletions first, then updates, then new files — this prevents a renamed file from leaving orphaned chunks in the index.

The watchdog listener: During an active session, Vectr runs a watchdog filesystem listener on the workspace root. When a file is saved, the listener queues it for re-indexing in the background. Events are debounced at 300ms — only the last write in a burst counts. Without debouncing, a single save in a project using aggressive auto-formatting would trigger 3–5 redundant re-index operations.

.vectrignore: Keeping the Index Clean

Vectr reads a .vectrignore file from the workspace root using glob patterns. The syntax follows .gitignore conventions — trailing slash for directories, * for single-level wildcard, ** for recursive match (via Python's pathlib.Path.match()) — but Vectr does not implement the full gitignore specification: the ! negation prefix is not supported.

vendor/
node_modules/
dist/
*.pb.go        # generated protobuf Go files
*.min.js       # minified JavaScript
__pycache__/
.venv/
coverage/
*.snap         # Jest snapshots
migrations/    # Django database migrations

A codebase with node_modules/ will typically contain 5–20x more code from installed packages than from the project itself. Excluding vendor directories before the initial index run is the single most impactful configuration change most users can make.

What Actually Happens When You Call vectr_search

1. Query string is embedded using arctic-embed-m with query prefix
   → 768-dimensional float vector, ~15ms on CPU

2. Vector similarity search against ChromaDB store
   → Top-20 chunks by cosine similarity, with scores

3. Same query runs through BM25 index (rank-bm25, in-memory)
   → Top-20 chunks by BM25 score, with scores

4. Two ranked lists are merged
   → Weight BM25/vector based on codebase characterization
   → Normalized scores combined; top-N results selected (default N=5)

5. Symbol names in the query are detected (camelCase, snake_case, PascalCase)
   → If found: also run vectr_locate as a side channel
   → Merge symbol lookup results into final output if relevant

6. Final top-N chunks returned with:
   file path, start line, end line, matched text, search method

Result for vectr_search("workspace lock acquisition and release"):

[1] resolver.rs:214 — WorkspaceLock::acquire()
    Acquires the workspace-scoped lock. Blocks if another process holds it.

[2] resolver.rs:267 — WorkspaceLock::release()
    Releases the workspace-scoped lock. Validates that the current process
    holds the lock before releasing (returns Err if not held).

[3] workspace.py:89 — _acquire_workspace_lock(path)
    Context manager: acquires, yields, releases on exit.

Instead of reading 15 files to find these three functions, the AI editor reads one search result.

Part 5 — Design Decisions I'd Make Differently

The Python 3.14 requirement. The codebase uses match/case pattern matching extensively and some asyncio patterns that behave differently in earlier versions. In retrospect, 3.11 would probably work with a few hours of refactoring. The 3.14 requirement has been the single biggest adoption friction.

ChromaDB as the vector store. A vector store handles embedding persistence and similarity search. ChromaDB works, but the full HNSW index with persistence, the Python client layer, and the inter-process communication overhead add about 200ms specifically to ChromaDB's startup contribution — not total Vectr startup (~280ms including mtime diffing and watchdog initialization). For v2, I'd consider a lighter in-process option.

The BM25 implementation. The rank-bm25 library is pure Python and fast enough for codebases under 50,000 chunks. Beyond that, it starts to show latency. The right long-term solution is integrating BM25 scoring directly into the vector store query pipeline. For current use cases (most codebases are under 20K chunks), it's fine.

Conclusion

The indexing layer is the foundation, not the product. What it enables is an AI code editor that can navigate a large unfamiliar codebase as efficiently as a human engineer who has worked in it for months — finding the right functions in one or two calls instead of fifteen.

But the index tells you where things are. It doesn't tell you why things are the way they are — the non-obvious invariants, the patterns that emerge from reading 50 files, the bugs that were fixed by changing two lines in a place that looks completely unrelated.

That's what Part 2 addresses: a note store where an AI editor can save findings in structured, tagged form — "the lock acquisition logic is at resolver.rs:214, and it acquires an exclusive file lock using fcntl.flock, not a threading primitive" — and retrieve them in under 50ms at the start of any future session. When /compact runs and replaces the conversation with a summary, exact signatures and line numbers evaporate — but notes don't. The indexer tells you where to look. The working memory layer tells you what you already know about what you found.

Summary of core decisions

Decision	Rationale
AST-aware chunking via tree-sitter	Complete functions as the unit of meaning. Biggest quality improvement over naive line windows.
Local embeddings (arctic-embed-m)	No API cost, no data leaving the machine. One-time 440MB download.
Hybrid BM25 + vector search	Concept queries route to vector. Exact symbol names route to BM25.
Symbol graph	Definitions, call edges, import edges, HTTP routes — exact graph traversal for questions text search cannot answer.
Six fallback strategies in vectr_locate	Exact → suffix → same_module → unique_name → import_chain → fuzzy. Each result carries its resolution strategy.
mtime cache + watchdog	Sub-5-second re-indexing on subsequent runs. In-session saves trigger background re-indexing automatically.

LLM Context Window Token Budget: Why Your Window Fills Up Fast

Swapnanil Saha — Tue, 26 May 2026 19:59:18 +0000

You build something with GPT-4o. The model supports 128,000 tokens. You think: that's enough for a full novel. Then, four or five conversation turns in, the model starts forgetting things that were said earlier. Eight turns in, you hit an error. You check the token count — you've used over 100,000 tokens, and you've typed maybe 400 words.

This isn't a bug. It's the predictable consequence of not accounting for where those tokens actually go. A context window isn't blank space waiting to be filled with your words. By the time the first user message arrives, it is already partially consumed — by system instructions, by tool definitions, by retrieved documents, by the tokens the model itself generated in earlier turns. In a production AI agent, 30–60% of the context window is gone before a user types anything.

What follows is a precise accounting of where those tokens go — the four layers that consume the window before users say anything, why the effective limit is substantially lower than the advertised one, what happens to response quality as the window approaches capacity, and which engineering patterns actually manage it at production scale.

Part 1: The Problem

1. The Illusion of Abundance

GPT-4o supports 128K tokens. Claude 3.5 supports 200K. Gemini 1.5 Pro has been demonstrated at a million tokens — roughly 750,000 words, about ten average novels. The numbers sound absurdly generous. How could you possibly run out?

Start with a calibration exercise. What is 128,000 tokens, actually?

In English prose, one token is roughly four characters — about three-quarters of a word. A 1,000-word article runs to around 1,300 tokens, so 128K tokens can hold close to 96,000 words of clean text. That genuinely is a lot.

But text in an LLM application is rarely clean English prose. It is JSON payloads from tool calls. It is API responses full of structured data. It is code. It is URLs. It is conversation history with speaker labels, timestamps, and formatting. All of these serialize into tokens at rates much higher than 4 characters per token.

Then there is the question of performance. The advertised number represents a technical limit — the longest sequence the model can physically process. It does not represent the length at which the model operates at peak accuracy. Research has repeatedly found a significant gap between the two. Long-context benchmarks like RULER (2024) and HELMET (2024) found that in adversarial multi-document tasks, most frontier LLMs showed accuracy drops well before 32K tokens — GPT-4o fell from near-perfect baseline scores to the high-60s percentage range at 32K in some configurations. The technical limit says 128K. The accuracy cliff arrives much earlier.

The Effective Limit Is Not the Advertised Limit
Models claiming 200K context windows show measurable quality degradation around 130K tokens in practice. Treating the advertised number as your operating budget is how production systems quietly degrade without triggering any explicit error.

Cost is the third angle. Every token in the context is a token billed. At GPT-4o's pricing, 128K tokens of input costs several dollars per call — and agents often make dozens of calls per session, each with the full accumulated context. The monthly bill from a badly-managed context window can surprise you well before any error appears in the logs.

2. How Tokens Are Counted — and Why the Count Surprises You

An LLM does not read text. It reads a sequence of integers. Before any word reaches the model, it passes through a tokenizer that converts characters into integer IDs from a vocabulary of roughly 50,000–200,000 entries. The tokenizer used by GPT-4 and GPT-4o is called cl100k_base; it has about 100,000 vocabulary entries. OpenAI's newer models use o200k_base, with about 200,000.

The vocabulary is built using BPE — Byte Pair Encoding. The name comes from the construction: you start with individual characters, then repeatedly merge the pair of adjacent symbols that appears most often in your training corpus, replacing each occurrence of that pair with a new combined token. Do this enough times and common English words end up as single tokens. The algorithm learns what to merge entirely from what was common in the training text — mostly English prose on the internet. That's why "the", "is", "running" each become a single token, while "tokenization" becomes ["token", "ization"] — less common as a whole word, so BPE never fully merged it. Characters and raw bytes are the fallback for anything the vocabulary doesn't cover. The consequence is simple: anything that wasn't well-represented in training data — JSON brackets, URL slashes, code indentation — never got merged aggressively, so those sequences remain expensive in tokens relative to the characters they contain.

The rule-of-thumb of 1 token ≈ 4 characters holds for clean English prose — decent enough for napkin estimates. It falls apart under several conditions that appear constantly in real applications:

Numbers tokenize unexpectedly. BPE learns tokens from frequency in training data. The number "2023" is common in training data — it became a single token. But "2026" is less common, and "19847" is rare — these get split into per-digit or per-pair tokens. The price "USD 1,234,567.89" produces approximately 10–12 tokens, because the commas, period, digits, and currency symbol may each claim separate tokens.

URLs are disproportionately expensive. A URL like https://api.example.com/v2/users/12345 looks compact — 38 characters, which by the prose rule should be about 9–10 tokens. In practice it is closer to 15–20 tokens. Slashes, dots, hyphens, underscores, and alphanumeric path segments each claim their own tokens or merge into small fragments, because URLs are structurally uncommon in prose.

JSON and structured data use roughly 2x the token count of plain text. Consider:

Plain text: The user's name is Alice, she is 28 years old, and her account is active.
JSON:       {"user": {"name": "Alice", "age": 28, "status": "active"}}

The plain text version: approximately 18 tokens. The JSON version: approximately 22 tokens — and this is a trivially small object. Real API responses with deeply nested keys, repeated field names, and verbose formatting can be far more expensive. Every brace, colon, and comma is a token or part of a token. A 500-word JSON payload can use 800+ tokens.

Code tokenizes inefficiently in some languages. Research found that Python uses roughly 46% more tokens than equivalent Haskell to express the same computational idea. Python's indentation-based structure requires whitespace tokens, and Python's identifiers were less densely represented in the pre-GPT-4 training corpora.

Analogy: The Luggage Weight Problem
Think of the context window as checked baggage with a weight limit, not a size limit. A suitcase full of dense sweaters weighs less than one with foam packing material filling the same volume. Plain prose is the dense sweaters — you pack a lot of meaning into few tokens. JSON, URLs, and code are the foam — structurally bulky, meaning-sparse, yet they count toward the same limit.

Part 2: The Consumers

3. The Four Layers That Eat Your Context Window

Every LLM API call is a full context payload assembled from four distinct layers. Most developers think about only one: the user's current message. The other three arrive already loaded — silent costs that accumulate before the user types anything.

Layer 1: The System Prompt

The system prompt is the foundational layer. It is always present, on every API call. A minimal system prompt — "You are a helpful assistant" — costs about 7 tokens. But real production system prompts are not minimal.

A typical customer-facing chatbot system prompt contains: the model's persona and tone guidelines, a list of topics it should and should not address, instructions about response format, domain-specific knowledge, legal disclaimers, and formatting instructions. Measured in practice, these range from 800 to 2,500 tokens. They are charged on every single API call. A 1,500-token system prompt running 1,000 calls per day costs you 1.5 million input tokens per day before a user says anything.

Layer 2: Tool Schemas

When you give an LLM access to external tools, you must describe each tool to the model in the context window. These descriptions are written in JSON and can be verbose. A single moderately documented tool schema costs roughly 200 tokens. An agent with five tools carries around 1,000 tokens of tool descriptions on every call, before any user input. The JSON structure alone — all those braces, colons, and quoted keys — is part of why the token cost is higher than reading the description would suggest.

Layer 3: Retrieved Context (RAG)

Many production LLM applications retrieve relevant documents from a database and inject them as supporting material. A typical RAG retrieval returns 3–8 document chunks, each 300–600 tokens. Three chunks at 400 tokens each: 1,200 tokens. Eight chunks at 500 tokens each: 4,000 tokens. In a research assistant with a generous retrieval budget, you might inject 8,000–12,000 tokens of context per query.

The Hidden Fixed Cost
System prompt + tool schemas is your fixed cost floor. It doesn't change turn-to-turn. It can easily reach 2,000–4,000 tokens in a real agent — charged on every single API call in your fleet.

Layer 4: Conversation History

The model has no persistent memory. You create the illusion of memory by re-sending the full conversation history on every API call. Every turn appends two new entries (a user message and a model response) to a history that is re-sent in its entirety. Model responses can be long — a detailed answer with a code snippet might be 600–800 tokens. After ten exchanges, the conversation history alone can be 8,000–12,000 tokens.

4. Context Creep — Watching the Window Fill

The process by which a context window fills over a conversation has a name in production systems: context creep. Consider a realistic customer support agent: 1,200-token system prompt, three tool schemas totaling 600 tokens, RAG retrieval returning two chunks (~800 tokens per turn), user messages averaging 60 tokens, model responses averaging 350 tokens.

Context budget:
  Fixed overhead: 1,200 + 600 = 1,800 tokens
  Per-turn RAG:   800 tokens
  Per-turn history growth: 60 (user) + 350 (model) = 410 tokens

  Turns until 80% of 128K:
    (1,800 + n × 800 + n × 410) ≥ 102,400
    n × 1,210 ≥ 100,600
    n ≈ 84 turns

  If model reply averages 800 tokens instead:
    Per-turn growth: 60 + 800 = 860
    n × 1,660 ≥ 100,600
    n ≈ 60 turns

Change the model reply length to 800 tokens — a detailed-answer agent — and the window hits 80% around turn 60 rather than 84. Quality degradation begins before you hit the hard limit.

Part 3: The Physics

5. KV Cache Memory — Why Context Has a Physical Cost

The context window limit is not an arbitrary policy. It is enforced by physics — GPU memory.

The transformer's attention mechanism works by comparing every token in the context with every other token. For each token, the model creates a query ("what am I looking for?"), and every other token offers a key ("what do I contain?"). A third vector — the value — carries the actual information that gets passed when attention is high. Assembled across all tokens, these become the matrices Q, K, and V:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The QKᵀ product is an n × n matrix where n is the sequence length. Doubling n quadruples this computation.

There are two distinct computational phases in LLM inference. Prefill processes the entire input prompt at once — O(n²) per attention layer. Implementations like FlashAttention reduce the memory bandwidth pressure dramatically via tiled computation, but the asymptotic complexity doesn't change. Decode generates one token at a time, attending only to the current token against the cached history — O(n) per step with the KV cache. Without caching, decode would also be O(n²). The KV cache converts decode from O(n²) to O(n) at the cost of memory.

KV Cache Memory Formula (Multi-Head Attention):

KV_memory = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_param

For a 7B-parameter model with standard MHA (32 layers, 32 heads, head_dim 128) at bfloat16 (2 bytes):

KV_memory per token ≈ 2 × 32 × 32 × 128 × 1 × 2 = 524,288 bytes ≈ 0.5 MB

At 128K context: 0.5 MB × 128,000 = 64 GB of KV cache alone — more than the model weights at bfloat16 (~14 GB).

Note on GQA and MLA: Most modern models (Llama 3, Mistral, GPT-4o) use Grouped-Query Attention (GQA), which reduces the KV cache by sharing key-value heads across groups of query heads. A model with 32 query heads and 8 KV heads (4× reduction) brings the per-token cache from ~0.5 MB to ~0.125 MB — about 16 GB at 128K context. Still the dominant memory consumer at long contexts. DeepSeek-class models use Multi-head Latent Attention (MLA), which compresses the K and V projections into a low-rank latent space before storing them, achieving 5–10× memory reduction over standard MHA.

A 70B MHA model (80 layers, 64 heads, head_dim 128, bfloat16) runs to roughly 2.5 MB per token: 2 × 80 × 64 × 128 × 2 bytes = 2,621,440 bytes. At 128K context that's ~320 GB — which is why providers either cap context length aggressively for large models, or charge steeply for long-context calls. GQA with 8 KV heads drops it to ~40 GB, still substantial.

Prompt caching (available from OpenAI, Anthropic, Google) caches the computed KV activations for repeated prompt prefixes. Subsequent calls beginning with the same prefix pay 50–75% less for those cached tokens and benefit from lower latency because the prefill phase for the cached portion is skipped. A stable system prompt is an ideal caching candidate. One practical constraint: both OpenAI and Anthropic require a minimum prefix length of at least 1,024 tokens before caching activates. A 200-token system prompt won't benefit — another reason to consolidate instructions into one substantial block rather than spreading them across multiple small messages.

KV cache quantization is an active area of production optimization: storing the K and V tensors in lower-precision formats (int8 or int4) cuts KV cache memory by 2–4× with modest accuracy penalties. Research like KVQuant explores going to 2-bit precision for certain layers while targeting 10M-token contexts on commodity hardware.

6. Lost in the Middle — Why Performance Collapses Before You Hit the Limit

Memory is the first constraint. Attention quality is the second — and it bites you even when your window is half-empty.

In 2023, researchers at Stanford and UC Berkeley published "Lost in the Middle." They gave LLMs a task requiring them to find a specific document from a set of twenty documents, all injected into the context window. The position of the relevant document was varied systematically.

When the relevant document was first or last, models retrieved it accurately. When it was in the middle positions, accuracy dropped by more than 30%. Newer models — Claude 3.5, GPT-4o — have partially mitigated this bias through long-context fine-tuning. "Partially" is doing a lot of work there: independent evaluations continue to find meaningful position-dependent performance gaps in all current models, even at lengths well within their advertised limits.

Analogy: The Lecture Hall Effect
Students reliably remember a lecture's opening and closing. What happened in the middle of hour one is murky. LLMs have an analogous concentration pattern: strong attention to the beginning and end of the context, with a trough in the middle.

The mechanism is structural. RoPE (Rotary Position Embedding), used in most modern architectures, encodes position as a rotation applied to query and key vectors. The mathematical property of this rotation is that the similarity score between two vectors naturally decreases as the distance between their positions increases. At short contexts, the decay is a feature. At long contexts, it becomes a bug: tokens in the middle of a 100K-token window are thousands of positions away from both the beginning and from where the model is currently generating, so their similarity scores are systematically suppressed.

A separate effect, context dilution, compounds this: longer surrounding irrelevant context degrades performance even when the relevant content is guaranteed present. The model's attention distributes across noise, reducing effective attention for the signal — like finding one red marble in a bag of ten thousand, even knowing it's there.

A Subtle RAG Bug
If your RAG system retrieves 8 documents and inserts them in the middle of a long conversation history, the most relevant chunks may be in the attention trough. The model generates a response, you see no error, but the answer doesn't reflect those documents. The failure is silent.

Part 4: Solutions

7. Token Budget Math — Calculating Your Real Available Space

Every LLM application needs an explicit token budget with five zones:

Zone	Typical token range	Fixed or variable?
System Prompt	500–2,500	Fixed per application
Tool Schemas	200–400 per tool	Fixed per agent
RAG Context	0–12,000	Variable per turn
Conversation History	0 → grows	Grows each turn
Generation Reserve	500–2,000	Reserved explicitly

The generation reserve must be reserved explicitly — if your prompt consumes the entire window, the model either generates nothing or truncates its response.

A worked example. Customer support agent, GPT-4o (128K):

Total window:          128,000 tokens
System prompt:          -1,400 tokens  (measured)
Tool schemas (4 tools):   -800 tokens  (measured)
Generation reserve:     -1,500 tokens  (set by us)
─────────────────────────────────────────
Available for dynamic:  124,300 tokens

  Of that:
    RAG budget:           20,000 tokens  (5 chunks × 4,000 avg)
    History budget:       ~104,300 tokens (fills over time)

  ─────────────────────────────────────────
  Turns until 80% full:
    80% of 128K = 102,400 prompt tokens
    Fixed overhead = 1,400 + 800 = 2,200
    Per-turn RAG = 800
    Per-turn growth = user avg (60) + model avg (350) = 410
    Turns until (2,200 + n × 800 + n × 410) ≥ 102,400
    n × 1,210 ≥ 100,200
    n ≈ 82 turns

82 turns sounds comfortable. But this assumes constant 350-token model replies. A user who triggers several detailed answers can double the history growth rate, cutting that to ~41 turns before the 80% threshold.

Measure, Don't Estimate
The system prompt and tool schema token counts must be measured with the actual tokenizer, not estimated from character counts. Log prompt_tokens and completion_tokens from every API response. The distribution of prompt_tokens over time is your context growth curve.

8. Four Strategies for Managing Context Window Limits

Strategy 1: Sliding Window

Keep only the most recent turns of conversation verbatim. In production, truncate by token count, not turn count — a 5-turn history could range from 500 to 8,000 tokens depending on response lengths.

# Turn-count version — simple, good enough for prototyping
MAX_HISTORY_TURNS = 20

def build_messages(system_prompt, history, new_message, rag_chunks):
    trimmed_history = history[-MAX_HISTORY_TURNS:]
    messages = [{"role": "system", "content": system_prompt}]
    if rag_chunks:
        context_block = "\n\n".join(rag_chunks)
        messages.append({"role": "system", "content": f"Context:\n{context_block}"})
    messages.extend(trimmed_history)
    messages.append({"role": "user", "content": new_message})
    return messages

# Production version — truncate by token count, not turn count
# HISTORY_TOKEN_BUDGET = context_limit - fixed_costs - generation_reserve
# Example for 128K window: 128000 - 2200 (sys+tools) - 1500 (reserve) - 20000 (RAG) ≈ 104000
HISTORY_TOKEN_BUDGET = 40_000  # adjust for your application

def build_messages_token_bounded(system_prompt, history, new_message, rag_chunks):
    fixed_tokens = count_tokens(system_prompt) + sum(count_tokens(c) for c in rag_chunks)
    new_msg_tokens = count_tokens(new_message)
    remaining = HISTORY_TOKEN_BUDGET - fixed_tokens - new_msg_tokens

    # Walk history from newest to oldest, keep what fits
    trimmed_rev = []
    for turn in reversed(history):
        turn_tokens = count_tokens(turn["content"])
        if remaining - turn_tokens < 0:
            break
        trimmed_rev.append(turn)
        remaining -= turn_tokens
    trimmed = list(reversed(trimmed_rev))

    messages = [{"role": "system", "content": system_prompt}]
    if rag_chunks:
        messages.append({"role": "system", "content": "Context:\n" + "\n\n".join(rag_chunks)})
    messages.extend(trimmed)
    messages.append({"role": "user", "content": new_message})
    return messages

The drawback of the sliding window is abrupt forgetting: when turn 1 drops, any fact established there is simply gone. For short-lived task-completion agents, this is fine. For long-running conversational assistants, it creates visible gaps.

Strategy 2: Hierarchical Summarization

Keep recent turns verbatim; compress older turns into a rolling summary.

async def maybe_compress_history(history, summary, buffer_size=10):
    verbatim_turns = history[-buffer_size:]
    turns_to_summarize = history[:-buffer_size]

    if not turns_to_summarize:
        return history, summary

    new_summary = await llm.complete(
        f"Existing summary: {summary}\n\n"
        f"New exchanges to incorporate:\n{format_turns(turns_to_summarize)}\n\n"
        "Update the summary to include these exchanges. "
        "Preserve all concrete facts, decisions, and commitments. "
        "Drop conversational filler. Be dense. Max ~400 tokens."
    )
    return verbatim_turns, new_summary

Cap the summary at 200–400 tokens. Run summarization asynchronously — don't make the user wait for the compression cycle.

Strategy 3: Token Compression (LLMLingua)

Use a compression model to identify and remove low-entropy tokens from prompts, achieving 2–3× compression with minor accuracy loss. The most effective targets are verbose system prompts, RAG context chunks, and few-shot examples.

Never apply compression to the current user message — compressing user input changes their meaning before the model sees it. Test in your specific domain for tasks where precision matters (legal, medical, code).

Strategy 4: Embedding-based Retrieval Over History

Store each conversation turn as a dense vector. At each new turn, embed the current user message and retrieve the most relevant prior turns by similarity. Concretely: as each turn completes, embed the user + assistant text and store it in a vector store alongside the full text. On the next user message, embed it, search for top-k similar turns, inject those into context. Keep only 2–3 verbatim recent turns for coherence.

The effect: only the conversation history relevant to the current question enters the context window. A user asking "what was the budget we discussed?" triggers retrieval of those turns — even if they happened fifty exchanges ago. This requires an embedding model, a vector store, and a retrieval call per user message (adding roughly 50–150ms round-trip with a managed API, under 10ms with a self-hosted model).

The four strategies are not mutually exclusive. Production systems often combine them: a sliding window of 5–8 verbatim turns + rolling summary + retrieval from older history covers all distance scales simultaneously.

9. The Practical Playbook

Short task-completion agents (under 20 turns): Use a sliding window of 10–15 turns. Reserve optimization effort for fixed-cost reduction: audit your system prompt for redundant language, consider dynamic tool registration (load only the tools relevant to the current turn).

Long-running conversational assistants: Implement hierarchical summarization with 8–12 verbatim turns. Cap summaries at 400 tokens. Run asynchronously. Periodically audit system prompt size — prompt creep through edits is real. A prompt that started at 600 tokens can quietly grow to 3,000 across six months of product changes.

Document-heavy research assistants (heavy RAG): Limit retrieval to 3–5 top chunks. Apply token compression to chunks before injection. Sort retrieved chunks so the most relevant appears last in the injected block — adjacent to the user question, within the recency attention peak.

Production agents with many tools: Use dynamic tool registration. A routing classifier (even a keyword matcher) identifies which tools are needed before the main model call and includes only those schemas — reducing 2,000 tokens of tool overhead to ~400 on most turns.

Context ordering (exploit the attention curve): Instead of the framework default (system → history → RAG → user), use: system → recent history (most-recent last) → RAG chunks (most relevant last, adjacent to the user message) → current user message. The most relevant content sits at the end of the context, within the recency attention peak. Older history — the least relevant content — occupies the lower-attention middle.

What to monitor:

prompt_tokens / context_limit — alert above 70%, act above 80%
Token count by zone per call — when total grows, know which zone is responsible
Quality signals segmented by context utilization — you may find degradation starts at 60% in your application

Conclusion: The Window Is a System Resource

A context window isn't a document store you fill until it overflows. It's a compute and memory resource with hard physical limits, a quality curve that degrades well before those limits, and an inference cost that grows with every token you put in it.

In a typical agent, the window is 30–60% consumed before the first user message lands. The fix isn't a bigger context window, though headroom helps. It's building a real budget: measure each zone with an actual tokenizer, set hard limits per zone, implement a context manager that enforces those limits on every call, and track utilization in production dashboards the same way you'd track memory or CPU.

The attention degradation problem — "lost in the middle" — adds a second dimension: even when your window is not full, quality depends on where in the window the important information sits. The primacy bias and recency bias are real, measurable effects that application design can exploit or fall victim to.

The four strategies aren't competitors — most production systems end up combining them. Sliding window for the recent turns, rolling summary for the older ones, compression for the RAG chunks, and retrieval for anything that needs to survive beyond the window. Start with the simplest thing that doesn't break your use case, and add layers as your traffic and conversation length grow.

Context engineering doesn't have the glamour of prompt engineering, but it's where most production LLM failures actually live. Missed retrievals, incoherent multi-turn conversations, bloated inference bills — these trace back to context mismanagement more often than they trace back to the wrong model. It fails silently, which is exactly why it's easy to ignore until you can't.

References

Research Papers

Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" — Stanford / Berkeley / Samaya AI, 2023. The original paper quantifying the U-shaped attention bias across context positions.
"Found in the Middle: Calibrating Positional Attention Bias" — 2024. Proposes an architectural fix to the lost-in-the-middle problem, recovering up to 15pp accuracy.
"Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" — 2025. Demonstrates context dilution: longer irrelevant context degrades performance even when the relevant content is guaranteed present.
"KVQuant: Towards 10M Context Length LLM Inference with KV Cache Quantization" — 2024. Explores per-channel quantization of the KV cache to enable extreme context lengths on commodity hardware.

Technical References

OpenAI — Managing Conversation State — Official docs on conversation history management and token counting.
Anthropic — Context Window Documentation — Claude context limits, caching strategies, and best practices.
LLMLingua — Prompt Compression — Microsoft Research open-source project for token-level prompt compression.
KV Cache Memory: Calculating GPU Requirements for LLM Inference — Interactive calculator for KV cache memory requirements given model architecture parameters.

Background Reading

The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems — TianPan.co, 2025. Production-focused survey of context management challenges.
Context Window Management for LLM Apps: Developer Guide — Redis, 2025. Practical implementation patterns for context management in production.
The Complete Guide to Text Embeddings, Vector Databases & LLMs — Swapnanil Saha, 2026. Deep background on tokenization, BPE, transformer attention, and RAG pipelines referenced throughout this post.

Why AI Code Assistants Waste Context — and How RAG Fixes It

Swapnanil Saha — Tue, 26 May 2026 19:25:33 +0000

Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint in the module-level docstring that it technically "saw." The model didn't forget. The information was technically present in the prompt, but the transformer's attention mechanism never meaningfully focused on it. That's a different kind of failure, and it doesn't get better with a bigger context window.

There's a persistent intuition in this industry that more context is always better. Send the whole file. Send the whole codebase. This intuition breaks in a specific and measurable way. The mechanism is called attention dilution — softmax normalization means that every token in the context competes for a fixed budget of attention weight, and as the sequence grows longer, any given piece of information gets a smaller share of that budget.

This post walks through the transformer attention math to explain exactly why the naive approach fails, then covers how RAG (Retrieval-Augmented Generation) addresses it — by retrieving only the specific code chunks relevant to the current task and injecting those into the context window instead of dumping everything.

Part 1: The Problem with Stuffing

1. The Naive Approach: Just Send Everything

The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Throw that in too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?

This is called context window stuffing. Three things go wrong with it, and each gets worse as the codebase grows. The first is attention dilution — the focus of this section. The second is position bias (Section 3). The third is raw cost (Section 4). To understand why these happen, you need a concrete model of how a transformer actually reads a prompt.

A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens simultaneously, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.

The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's query vector against every other token's key vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.

Analogy: Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.

2. Why Attention Dilutes: The Math

The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

Where:

Q — the query matrix (what each token is "asking for")
K — the key matrix (what each token "offers" for comparison)
V — the value matrix (the actual content passed forward if selected)
d_k — the dimension of the key vectors (scales to prevent extreme dot products)
softmax — converts a vector of raw scores into a probability distribution that sums to 1

The notation QK^T means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix K^T does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by √d_k prevents those scores from becoming so large that softmax saturates.

The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the average attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.

This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset if the dot-product scores separate those tokens sharply from the rest. Softmax is non-linear and can be quite aggressive when there is a large score gap between relevant and irrelevant tokens. But in a real codebase, that gap is rarely clean. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products — they're not completely irrelevant, they just aren't what you need right now. These tokens collectively consume most of the softmax budget. The useful signal must compete against this crowd, and as N grows, the signal's share degrades continuously. It isn't a cliff; it's a steady erosion that compounds with each additional file you stuff in.

Key Insight: The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model cannot read long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.

3. Lost in the Middle: Position Bias

Attention dilution is one problem. A second, independent problem compounds it: position bias. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.

This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled Lost in the Middle: How Language Models Use Long Contexts. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. When the answer document was at position 1 or last, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.

Two mechanisms contribute. The first is RoPE (Rotary Position Embeddings), the positional encoding scheme in most modern open-source language models (LLaMA, Mistral, GPT-NeoX). RoPE encodes position by rotating the query and key vectors by angles proportional to their positions. The dot product between a query at position m and a key at position n includes a term that decays with relative distance (m−n) — semantically relevant tokens far from the query position must overcome a rotational penalty to receive attention weight. Tokens near the start of the sequence are close to almost every other position, giving them a structural advantage.

The second mechanism is causal training recency bias. Language models are trained to predict the next token given all previous tokens. This reward signal pushes models to weight recent tokens heavily — the immediately preceding context is almost always the most relevant signal for next-token prediction during training. The middle of a long context rarely dominated training gradients, so models systematically underweight it. This effect was documented in GPT-3.5 era models well before RoPE became standard — it isn't purely an artifact of positional encoding, it's baked into causal pretraining. Both effects run in the same direction: the middle of a long context is structurally disadvantaged.

A 2024 paper from UW, MIT, and Google (Found in the Middle) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, which is not available when calling an API.

Common Mistake: Many teams inject retrieved chunks at the end of the prompt, after a long system prompt and conversation history. This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is immediately before the user's specific question, near the end but not buried in the middle of a long history.

4. The Quadratic Cost Problem

Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales quadratically with sequence length.

To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons. Doubling the context length quadruples the compute required for attention.

Time complexity of full self-attention: O(N² · d)

A 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×.

FlashAttention (Dao et al., 2022) improves the memory profile to O(N) via tiling — it never writes the full N×N matrix to GPU memory. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.

In production, a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone. You are paying more to get worse results.

Part 2: How RAG Fixes It

5. RAG at a Glance: The Core Idea

Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks: "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"

The answer has two phases. First, an offline indexing phase where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an online retrieval phase that happens at query time: the developer's current context is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.

The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand.

The pipeline:

Parse & Chunk — split at function/class boundaries, not arbitrary token counts
Embed Chunks — convert each chunk to a vector with a code embedding model
Build Search Index — ANN index for dense retrieval + BM25 index for lexical retrieval
Embed the Query — convert current cursor context to a query vector
Retrieve Top-k — run hybrid search (dense + BM25), fuse results
Inject & Generate — inject top 3–5 chunks into the LLM prompt, immediately before the user's request

Steps 1–3 happen once (or on incremental file changes). Steps 4–6 happen on every completion request. The parts where most implementations go wrong: chunking (using fixed-size splits instead of AST boundaries), retrieval (using only dense search and missing exact identifier queries), and injection order (burying retrieved context in the middle of the prompt).

6. Chunking for Code: Why Fixed-Size Fails

Code has structure that text does not. A function is a unit of meaning. Fixed-size chunking — splitting every file every 256 tokens — splits in the middle of functions, destroying logical units.

Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that look like:

Chunk A: def process_payment(order_id, amount, currency="USD"):
    """Process a payment..."""
    conn = get_db_connection()
    try:
        txn = conn.begin_transaction(

Chunk B:   order_id=order_id,
  amount=amount,
  currency=currency
)    except DatabaseError as e:
        log_error(e)
        raise PaymentError(str(e))

Neither chunk represents the function accurately. The embedding of Chunk A does not represent "a payment processing function" — it represents a truncated fragment.

AST-based chunking uses tree-sitter to parse each file and extract logical units at language-defined boundaries: function definitions, class bodies, method groups. Each chunk's metadata includes file path, start line, end line, and node type. This metadata is as important as the chunk text itself — it tells the retrieval system where in the codebase this chunk lives.

One practical addition: each chunk can be augmented with a small surrounding context for embedding purposes — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects the chunk's role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text.

The overlap trap in code: Sliding window overlap (copying N tokens from one chunk into the next) is useful in prose. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature alongside any function chunk, rather than copying the previous function's body into the current chunk. The Continue open-source IDE extension uses this approach.

7. Retrieval Strategies: Dense, Sparse, and Why Code Needs Both

Dense retrieval converts query and each chunk to vectors, then finds the most similar by cosine similarity. It can match meaning even when exact words differ — "how do we handle rate limit errors?" surfaces functions named throttle_on_429 or backoff_retry.

The embedding model used matters significantly. Code-specialized models like voyage-code-3 — purpose-built for code retrieval, top-ranked on code retrieval benchmarks (2025) — produce substantially better representations for function bodies, type signatures, and API calls than general-purpose models. text-embedding-3-large is a strong general-purpose embedding model suited for mixed code + documentation retrieval, but it wasn't specifically designed around code.

BM25 (lexical/keyword retrieval) counts words. It excels at exact matches — a developer looking for PaymentGateway.process_refund will find it immediately. Error codes, configuration key names, and exact API method names are better retrieved lexically than semantically. For code, the asymmetry is important: queries for exact identifiers favor BM25. Queries for concepts and behaviors favor dense retrieval. The right system runs both.

8. Hybrid Search and Reciprocal Rank Fusion

Running both methods produces two ranked lists that need combining. BM25 scores and cosine similarity scores live in completely different numerical ranges — you cannot add them directly.

Reciprocal Rank Fusion (RRF) avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list:

RRF_score(d) = Σ_{r ∈ R} 1 / (k + rank_r(d))

Where:

R = set of ranked lists (BM25 list, dense list)
rank_r(d) = position of document d in list r (1-indexed)
k = smoothing constant (default 60, from Cormack, Clarke & Buettcher 2009 — empirically robust across many retrieval tasks). Increasing k makes the formula more conservative, rewarding consistent mid-rank appearances over a single strong rank.
If a document does not appear in a list, its contribution from that list is 0

A document ranked #1 in both lists scores ≈ 0.033. A document ranked #1 in one list but #100 in the other scores ≈ 0.022. Candidates that both BM25 and semantic search agree on float to the top.

9. Reranking: The Final Sorting Pass

After hybrid search and RRF, you have ~20 candidate chunks. A cross-encoder reranker takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts pass through the model together, the model can attend to query-document relationships that a bi-encoder cannot — query and document never interact during bi-encoder encoding.

The practical architecture: use fast bi-encoder retrieval (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on those 20 for final ordering. The top 5 go into the prompt.

Cross-encoder context window limits: Cross-encoders are themselves transformer models with context window limits. General-purpose reranker models like ms-marco-MiniLM-L-12-v2 support 512 subword tokens — which is often enough for a single short function, but not for large class bodies. For retrieval pipelines that surface larger chunks, use a reranker with a larger window: Cohere Rerank 3 supports 4,096 tokens; voyage-rerank-2 supports 16K. If the combined chunk+query still exceeds the limit, truncate the chunk from the bottom — the function signature and docstring are more informative for reranking than the implementation tail.

For code with strong AST chunking and a good code embedding model, hybrid bi-encoder retrieval is often sufficient for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions. It adds 50–200ms of latency, so benchmark before committing.

Part 3: In Production

10. How Cursor Does It: A Reference Architecture

When you open a project in Cursor, it chunks local files and sends them to its servers, where they are embedded (via OpenAI's API or a custom model) and stored in Turbopuffer — its vector store of choice. File paths are obfuscated client-side before any data leaves your machine. Embeddings are cached by chunk hash, making incremental re-indexing fast.

At query time, Cursor monitors the active cursor position and constructs a composite signal: the current file's surrounding code, any open editor tabs, and recent edit history. This signal is embedded into a query vector, sent to Turbopuffer for ANN search, and the top-k results are retrieved. The actual code is read from local disk; the model only sees the retrieved text.

@Codebase in Cursor's chat is the explicit trigger for a full retrieval pass over the indexed codebase. Without it, Cursor uses a lighter heuristic based on open tabs and file imports. @Docs and @Web extend the same pipeline beyond the local codebase.

One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.

GitHub Copilot's context construction follows a similar pattern. For inline completion, it uses the current file content around the cursor plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The @workspace symbol in VS Code triggers a more thorough indexing-based search, analogous to Cursor's @Codebase. Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke — full retrieval is reserved for explicit chat interactions.

11. Tradeoffs and Limits of Code RAG

Scenario	RAG Behavior	Mitigation
Cross-file dependency reasoning	Each retrieved chunk is a fragment; the model may not understand how three retrieved functions compose at the call site	Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies
Newly created files not yet indexed	Invisible to retrieval until the index is rebuilt	Incremental indexing on file-save events; maintain a pending index queue
Query is too vague	"fix the bug" → retrieves generic results	Use cursor position + surrounding error message as primary query signal
Minified or generated code	Lock files, protobuf generated code pollute the index	Maintain a .gitignore-style exclude list for the RAG indexer
Very large monorepos	Recall degrades; indexing is slow	Scope index to current working subdirectory or per-service sub-indices
Schema/type changes	Stale embeddings give the model outdated type signatures	Invalidate embeddings on file write by chunk content hash

Does a larger context window make RAG obsolete? As context windows grow to 1M and beyond — Llama 4 Scout hit 10M tokens in 2025, Gemini 1.5 Pro supported 1M — this question keeps coming up. The practical answer is no, though the reasoning matters. A 200,000-line Python codebase easily exceeds 2 million tokens. Most production monorepos are far larger. More importantly, the attention quality degradation described in Sections 2 and 3 doesn't disappear with a larger nominal window. Those long-context models achieve their range through techniques like NTK-aware RoPE scaling (which extends the effective frequency range of positional encodings) and sparse attention patterns (which skip computation on distant token pairs) — these help with extrapolation but don't eliminate the position bias at extremely long ranges. And practically: a 1M-token prompt is expensive and slow even on state-of-the-art hardware. For interactive code assistance, stuffing the full codebase is off the table regardless of window size.

Large context windows and RAG do different jobs. RAG decides what deserves to be in the context window. The context window determines how much you can fit once you've been selective. A well-tuned system retrieves the right 5,000 tokens from a 10M-token codebase and puts them in a 128K window with room left for conversation history and tool outputs.

12. Building Your Own Code RAG Pipeline

Parsing: Use tree-sitter with a recursive AST walk — iterating only over root_node.children misses deeply nested functions and class methods.

# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)

TARGET_TYPES = {"function_definition", "class_definition"}

def walk_tree(node, source_code: str, file_path: str, chunks: list):
    """Recursively walk the AST to catch nested definitions
    (methods inside classes, functions inside functions, etc.)"""
    if node.type in TARGET_TYPES:
        chunk_text = source_code[node.start_byte:node.end_byte]
        chunks.append({
            "text": chunk_text,
            "file": file_path,
            "start_line": node.start_point[0],
            "end_line": node.end_point[0],
            "type": node.type
        })
        # For class_definition, continue recursing to capture methods.
        # For function_definition, stop — we want the whole function,
        # not its nested helpers as separate chunks.
        if node.type == "class_definition":
            for child in node.children:
                walk_tree(child, source_code, file_path, chunks)
    else:
        for child in node.children:
            walk_tree(child, source_code, file_path, chunks)

def extract_chunks(source_code: str, file_path: str) -> list[dict]:
    tree = parser.parse(source_code.encode())
    chunks = []
    walk_tree(tree.root_node, source_code, file_path, chunks)
    return chunks

Embedding models:

Model	Context window	Strengths	When to use
`voyage-code-3`	16K tokens	Purpose-built for code; top-ranked on code retrieval benchmarks (2025)	Production code assistant, maximum retrieval quality
`text-embedding-3-large`	8K tokens	Strong general performance; well-supported; large community	Mixed code + documentation retrieval; existing OpenAI integrations
`nomic-embed-code`	8K tokens	Open-weight; can run locally; no API cost	Air-gapped environments; cost-sensitive deployments; on-prem

Vector store:

pgvector in Postgres — sufficient for single-developer or small-team tools
Qdrant — supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores

Prompt injection template:

You are a coding assistant for this codebase.

## Relevant context from the codebase:

### [payments/gateway.py · lines 42–87]

python
{chunk_1_text}


### [payments/exceptions.py · lines 1–24]

python
{chunk_2_text}


### [payments/models.py · lines 88–112]

python
{chunk_3_text}


## Current task:
{user_request}

Include file path and line numbers in each chunk header. These cost very few tokens but give the model the module structure needed to generate correct imports and references.

Do not retrieve more than you need. It is tempting to inject 10–15 chunks to "give the model more information." Resist this. Each additional chunk increases context size (paying the quadratic cost from Section 4), increases attention dilution, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.

The Through-Line

The surprising thing about attention dilution is that it isn't a bug you can patch. It's a structural property of softmax normalization — the total attention weight sums to 1.0 regardless of sequence length, so every token you add is competing with every other for a share of that budget. More context doesn't mean more understanding; it means each fact gets a smaller slice. The lost-in-the-middle position bias makes it worse: code injected into the middle of a long prompt is structurally disadvantaged by both RoPE's distance decay and the recency bias that causal pretraining instills. Knowing this changes how you think about the whole problem.

RAG doesn't solve attention dilution — it sidesteps it. Instead of sending everything and hoping the model finds what's relevant, it figures out what's relevant first and sends only that. The context window ends up containing what actually matters for the task: the right type definitions, the right helper functions, the right error handling patterns.

In practice: below roughly 3,000–5,000 lines, context stuffing usually works well enough. Above that, the problems stack up fast. At 50,000+ lines, naive stuffing reliably hurts. At 500,000+ lines, AST chunking, hybrid BM25 + dense retrieval, RRF fusion, and careful prompt injection aren't premature optimization — they're the baseline.

References

Foundational Papers

Vaswani et al. (2017) — Attention Is All You Need. NeurIPS. The original transformer paper introducing scaled dot-product attention.
Liu et al. (2023) — Lost in the Middle: How Language Models Use Long Contexts. Stanford / Berkeley. Empirical study of U-shaped attention bias and the 30% accuracy drop at mid-context positions.
He et al. (2024) — Found in the Middle: Calibrating Positional Attention Bias. UW / MIT / Google. Proposed calibration method that partially corrects RoPE position bias at inference time.
Survey (2025) — Retrieval-Augmented Code Generation: A Survey. Comprehensive survey of RAG approaches specifically for code generation and repository-level tasks.

RAG & Retrieval

Cormack, Clarke & Buettcher (2009) — Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods
Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
RAG vs Large Context Window: Real Trade-offs for AI Apps — Redis Engineering Blog
Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents — Shaped AI
RAG for LLM Code Generation using AST-Based Chunking — Vishnudhat Natarajan
Better Retrieval Beats Better Models for Large Codebases — Stéphane Derosiaux

Code Assistants & Architecture

How Cursor Actually Indexes Your Codebase — Towards Data Science
How GitHub Copilot Works — Quastor Engineering
What is Retrieval-Augmented Generation? — GitHub Blog

Hybrid Search & Ranking

BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production — Ranjan Kumar
Hybrid Search: BM25 and Dense Retrieval Combined — Michael Brenndoerfer

India's DPDP Act 2023 Explained — And How AI Handles Data Principal Requests at Scale

Swapnanil Saha — Thu, 21 May 2026 21:46:43 +0000

This post is for informational purposes only and does not constitute legal advice. The DPDP Act 2023 and its implementing Rules 2025 are relatively new — requirements may evolve through further notifications or guidance. Verify the current position with a qualified data protection lawyer before making compliance decisions.

Your company just received this email:

"I would like to know all personal data your organisation holds about me. This is a formal request under the DPDP Act."

It lands in a shared privacy@yourcompany.com inbox. Someone reads it. Forwards it to legal. Legal forwards it to engineering. Engineering says they need to check three databases. Nobody notes the date it arrived. Three weeks pass. When someone finally circles back, there are nine days left on the 30-day window the DPDP Rules require. Not enough time to locate the data, get legal sign-off, draft a response in the right language, and send it.

On day 32, you're in violation.

That scenario is the default for most Indian companies right now. Not because they're careless — but because nobody built the infrastructure for it.

I built DPDP Copilot to close that gap: a self-hosted operator tool that accepts public data requests, classifies them with Claude, drafts compliant multilingual replies, tracks every action as immutable evidence, and monitors SLA status in real time.

But before the tool, you need to understand what you're actually dealing with. Let's start with the law.

→ Full tool page and live demo

Part 1: What the DPDP Act 2023 Actually Requires

The Digital Personal Data Protection Act 2023 received presidential assent on 11 August 2023 and represents India's first comprehensive data protection legislation. Its structure borrows from GDPR while adapting to India's specific context — a 1.4 billion-person population, 22 scheduled languages, deep mobile penetration, and a digital public infrastructure layer (UPI, Aadhaar, DigiLocker) that most jurisdictions don't have.

The implementing Rules — the Digital Personal Data Protection Rules 2025 — were notified on 13 November 2025, giving the Act its operational teeth.

Here's what the law actually mandates, stripped of legalese, focusing on the parts most engineering and compliance teams get wrong.

The Four Rights Every Data Principal Has

The Act grants every "data principal" — the person whose data is being processed — four actionable rights. When someone exercises any of these, your organisation (as the "data fiduciary") has a legal obligation to respond.

Right of Access (Section 11)

Any person can ask you: what personal data do you hold about me, and for what purpose? You must provide a summary of the data being processed, the processing activities, and the identities of any other data fiduciaries or processors with whom their data has been shared. The Act doesn't specify a format, but silence is not sufficient.

Right of Correction and Completion (Section 12(a))

If a person believes data you hold is inaccurate, incomplete, or misleading, they can demand you correct or complete it. You must either act on the request or explain in writing why you're not.

Right of Erasure (Section 12(b))

A person can request deletion of their personal data from your systems. There are exceptions — data held for legal obligations, fraud prevention, pending litigation — but these exceptions have to be documented and justified, not just asserted.

Right to Grievance Redressal (Section 13)

Any person can file a grievance if they believe their rights under the Act have been violated. You must provide a mechanism to receive and respond to grievances. The Rules 2025 specify this mechanism must be genuinely accessible.

The Response Timelines

The DPDP Rules 2025 (notified November 2025) set specific mandatory windows for responding to data principal requests:

Access, Correction, and Erasure requests (Sections 11–12): Data fiduciaries must respond within 30 days.
Grievance Redressal (Section 13): Grievances must be resolved within a maximum of 90 days from receipt.

These are calendar days. For reference: GDPR (EU) also requires responses within one month for most data subject requests; California's CCPA gives 45 days. India's framework is broadly comparable to GDPR in its demands — but applies at the scale of 1.4 billion people, across 22 scheduled languages. That's where the operational challenge is categorically harder.

30 days sounds like a lot. For a company with no structured process, it evaporates fast. A request that lands in a shared inbox on a Friday, takes three days to be noticed, gets forwarded twice, waits a week for a legal review, and then requires manual drafting in the data principal's language — you're out of time before anyone writes the first sentence.

A note on the DPDP Copilot tool's SLA default: The tool's internal SLA clock defaults to 7 days — intentionally more conservative than the 30-day legal window. Most mature compliance programmes target internal deadlines that are significantly tighter than the regulatory maximum, so that normal delays (review cycles, approvals, language checks) don't push you to the edge. The 7-day default is configurable via orgs.sla_days. When the Rules are read by your legal team and a specific target is agreed, you set it once in the database.

What "Evidence" Actually Means Under DPDP

The Act and the Rules create a documentation burden that most organisations underestimate. You need to be able to prove:

That the request was received on a specific date
That it was handled (classified and routed) in a timely manner
What response you gave and when
Whether the response fulfilled the request or why it couldn't

This is audit evidence. If the Data Protection Board investigates a complaint, you need to produce this trail. A forwarded email chain is not audit evidence. A Slack thread is not audit evidence. An append-only timestamped log — with the original message, the classification, the drafted response, and the send event — is audit evidence.

The Financial Exposure

The Act's First Schedule specifies penalties by category of failure. The two most operationally relevant:

₹250 crore (~$30M USD) — Failure to implement reasonable security safeguards to prevent personal data breaches (Section 8(5)). This is the preventive obligation — having security measures in place. The penalty applies even where a breach subsequently occurs and the fiduciary claims they didn't anticipate it.

₹200 crore — Failure to notify the Data Protection Board and affected data principals when a personal data breach does occur (Section 8(6)). The notification obligation is separate from the security obligation — you can get penalised for both.

Other penalty tiers: ₹200 crore for violations related to children's personal data (Section 9); ₹150 crore for Significant Data Fiduciary obligation failures; ₹50 crore for other provision breaches.

The Data Protection Board, once fully constituted, will have adjudicatory powers to investigate and levy these penalties. Failing to acknowledge or respond to a data principal request, if that person escalates to the Board, creates a documented paper trail of non-compliance before any investigation begins.

Part 2: Why Your Current Process Fails (And Why That's the Default)

Let me describe the most common setup I've seen when talking to Indian companies dealing with DPDP requests:

A privacy@ email address that gets checked sporadically
No clock tracking — the 30-day window doesn't appear anywhere visible until it's almost gone
No classification — the person who reads it decides manually whether it's an access request, deletion request, or complaint
Reply drafted manually, from scratch, in English, by whoever processes it that week
No audit trail beyond the email itself, which may be deleted if an inbox is cleaned

This isn't negligence. It's the logical outcome of a process designed before the Act existed. The process was "email us with your concern" — and it worked fine when data requests were rare. The DPDP Act changes the legal weight of those requests, but most companies haven't updated their infrastructure to match.

The Three Ways Manual Processes Break

1. The deadline blind spot

When a request lands in an email inbox, the 30-day clock doesn't appear anywhere. Nobody stamps the receipt date. Nobody sends an automatic acknowledgement. The request sits until someone opens the inbox. If that takes a week — completely normal for a low-traffic shared inbox — you've already used 23% of your response window without touching the request. Legal review, data location, and drafting will eat most of what's left.

2. Classification inconsistency

"Please delete my data" is an erasure request. "I never gave you permission to use my data" is a grievance. "I want to update my phone number" is a correction request. "Can you send me everything you have on me" is an access request. A trained compliance professional can distinguish these consistently. Your Monday-morning on-call engineer who reads the shared inbox probably cannot — especially for requests written in Hindi, Bengali, or Tamil.

When requests are misclassified, they get routed to the wrong person, get the wrong response template, and sometimes get the wrong legal treatment. An erasure request handled as a grievance will likely produce a response that doesn't fulfil the legal obligation under Section 12(b), even if it sounds polite.

3. Evidence that can't survive audit

An auditor asks: "On what date did you receive and process this erasure request?" If your answer is "let me check the email thread," you have a problem. Email is mutable, searchable by keyword but not by event type, and has no integrity guarantees. An auditor looking for "REQUEST_CREATED at timestamp T" followed by "REPLY_SENT at timestamp T+22 days" needs a structured log, not an inbox.

Part 3: The Role of AI in DPDP Compliance

When I was designing DPDP Copilot, the central question was: where does AI actually add value, and where does it introduce risk?

DPDP compliance has two types of tasks: tasks that require human judgment about legal gray areas, and tasks that require consistent application of known rules to varied inputs. AI is well-suited to the second category and badly suited to the first.

Deciding whether your company has a legal obligation to retain data for a pending investigation? That's human judgment. Classifying an incoming message as an Access request vs. an Erasure request? That's pattern recognition on natural language — exactly what a well-prompted LLM is built for.

Classification: Where LLMs Outperform Rules

Naive rule-based classification for DPDP requests fails quickly. "Please delete my account" is an erasure request. "I want my data removed from your marketing list" is also an erasure request but uses entirely different vocabulary. "Remove me" submitted in a support ticket might be an erasure request or might just be asking to be unsubscribed from emails — context determines which.

A rules-based system that catches "delete my data" literally will miss most real-world submissions. People write in fragments, in their native language, with emotional context, in ways that don't follow a template.

An LLM with a well-structured prompt classifies these correctly without needing exhaustive keyword lists. The DPDP Copilot classification prompt:

Classify this message into exactly one of: Grievance, Access, Rectification, Deletion.
Respond as {"type":"<classification>"}.

Message:
${text}

The system prompt establishes the legal framework — "You are a DPDP compliance assistant classifying data principal requests under India's DPDP Act 2023." The model maps the message to the correct legal category.

The output is constrained to a JSON object with a single key. The application validates that type is one of the four legal categories. If the model returns something outside those four values, it's rejected and retried — the system never persists a classification it can't validate.

Multilingual Reply Drafting: Where AI Eliminates Weeks of Work

This is where AI creates the most leverage in the Indian compliance context.

India has 22 scheduled languages. The DPDP Act creates a right to grievance redressal — and for that mechanism to be genuinely accessible (which the Rules 2025 require), you need to respond in a language the person can understand.

Without AI, producing compliant response templates in Hindi, Bengali, Tamil, and Marathi means hiring translators, reviewing legal language, maintaining version parity across languages, and updating all templates whenever requirements change. That's a significant operational cost — one that most companies defer indefinitely, defaulting to English-only responses that disadvantage non-English speakers.

With a well-prompted LLM, drafting happens at response time. The model understands DPDP legal obligations and drafts a response that:

Acknowledges the specific request type (not a generic "thank you for reaching out")
Confirms receipt and logging with a reference number
States the applicable response timeline
Explains the next step the data principal should expect
Is written in the language they chose

The system prompt for drafting:

You are a DPDP compliance officer drafting replies to data principal requests under 
India's Digital Personal Data Protection Act 2023. Write professional, empathetic 
replies that: acknowledge the request type, confirm receipt and logging, state the 
applicable response timeline, and explain the next step. Keep the tone formal but accessible.

The user message to the model specifies the request type and target language:

Draft a DPDP-compliant reply in ${language} for a ${type} request.

Customer message:
${text}

These are suggested replies — an operator reviews them before sending. The human stays in the loop for all final communications.

Prompt Caching: Making AI Cost-Efficient at Scale

The system prompts for both classification and drafting use cache_control: { type: 'ephemeral' } via the Anthropic SDK, enabling prompt caching.

If you're processing dozens of data principal requests per day, the system prompt — which is identical for every request — gets cached by Anthropic's API after the first call. Subsequent calls are billed at a fraction of the full input token cost. At scale, prompt caching reduces API costs by 50–80% for the classification and drafting steps.

This is a small architectural detail that has no effect on the first request and compounding positive effect on the hundredth. If you're building compliance tooling that processes high volumes, prompt caching is the difference between a sustainable per-request cost and one that makes the tool impractical at production scale.

Retry Logic: Resilience Against Transient Failures

The LLM calls use exponential backoff retry logic:

async function callWithRetry(fn) {
  for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
    try {
      return await fn()
    } catch (err) {
      const isRetryable =
        err instanceof Anthropic.RateLimitError ||
        err instanceof Anthropic.InternalServerError
      if (!isRetryable || attempt === MAX_RETRIES - 1) throw err
      await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000))
    }
  }
}

Only rate limit errors and server errors trigger retries — not client errors (bad API key, invalid request format). The delay doubles with each attempt: 1 second, then 2, then 4. Three attempts total. A transient API hiccup doesn't fail the entire processing pipeline for a data principal's submission.

Part 4: DPDP Copilot — The Tool in Detail

With the legal and AI context established, here's how DPDP Copilot works end to end.

The Public Request Form

The entry point for data principals is /grievance — no login required. Requiring a login to submit a data rights request is a barrier that conflicts with the spirit of the Act. If someone can't easily submit an erasure request, the mechanism isn't truly accessible.

The form collects:

The request message (free text — people write what they mean in their own words)
Preferred response language (English, Hindi, Bengali, Tamil, Marathi)

There's no account creation, no verification code, no CAPTCHA wall. Data principals submit and receive an acknowledgement. The contact information is embedded in the message body — a known limitation of the current implementation, and a deliberate choice for the initial version: forcing a structured contact field requires more UI complexity and doesn't add meaningful compliance value until outbound email delivery is implemented.

What Happens in the Background on Submission

When the form is submitted, a single API call to POST /api/public/requests triggers a multi-step synchronous pipeline:

Step 1: Request creation

The system creates a database record with:

A UUID as the request ID
The raw message text
The chosen language
type: 'PENDING' — not yet classified
sla_due_at: now() + 7 days — the internal SLA clock starts at submission. This 7-day default is configurable via orgs.sla_days and is intentionally conservative relative to the 30-day legal window.
org_id from the active organisation configuration

Step 2: Evidence logging — REQUEST_CREATED

An evidence_events record is written immediately after creation:

{
  "event_type": "REQUEST_CREATED",
  "event_data": { "source": "public_form", "language": "Hindi" },
  "created_at": "2025-05-25T10:00:00.000Z"
}

This is the legal timestamp of receipt. The moment the request hits the database, it's on record. The evidence log is append-only at the application level — there are no delete or update operations on evidence_events.

Step 3: AI classification

The message text goes to Claude for classification. The model returns a JSON object. The application parses it and validates that type is one of { Grievance, Access, Rectification, Deletion }. Any other value throws an error. The request record is updated with the validated type.

Step 4: Evidence logging — REQUEST_CLASSIFIED

{
  "event_type": "REQUEST_CLASSIFIED",
  "event_data": { "type": "Deletion" },
  "created_at": "2025-05-25T10:00:01.342Z"
}

The classification result and timestamp are immutable facts in the evidence record from this point forward.

Step 5: AI reply drafting

Claude drafts a response in the data principal's chosen language, using the classified request type and the original message as context.

Step 6: Evidence logging — REPLY_SUGGESTED

{
  "event_type": "REPLY_SUGGESTED",
  "event_data": { "language": "Hindi", "model": "claude-sonnet-4-6" },
  "created_at": "2025-05-25T10:00:02.891Z"
}

The entire pipeline — creation, classification, drafting — runs in under 5 seconds for a typical request. By the time an operator opens the inbox, the request is already classified, a draft reply exists, and the SLA clock has been running since submission.

The Operator Inbox

The inbox at / is protected by authentication. It shows all requests for the active organisation, each with:

Request type (Grievance, Access, Rectification, Deletion, or PENDING if classification failed)
Message preview
Live SLA status (Within SLA / Due Soon / Overdue)
Creation timestamp

The SLA status is computed at read time — not stored as a cached value. The computeSlaStatus function runs on every page load:

export function computeSlaStatus(slaDueAt) {
  const now = new Date()
  const due = new Date(slaDueAt)
  const diffHours = (due - now) / (1000 * 60 * 60)

  if (diffHours < 0) return 'OVERDUE'
  if (diffHours < 24) return 'DUE_SOON'
  return 'WITHIN_SLA'
}

The status shown in the inbox reflects the current moment — not the status at the last time the record was updated. A request that was WITHIN_SLA yesterday is automatically DUE_SOON or OVERDUE today without any scheduled job or background worker.

The inbox is sorted by SLA urgency by default, so operators see the most at-risk requests first.

The Request Detail Page

Clicking into any request shows everything an operator needs to review, respond, and close:

The original message — verbatim, exactly as submitted. No interpretation layer between the operator and what the data principal actually wrote.

The AI-drafted reply — pre-populated with DPDP-compliant language in the data principal's chosen language. The operator can read it, edit it in the text area, and send it. The draft is a starting point, not a cage.

The resolution checklist — structured prompts for the operator to work through before closing the request:

Has the relevant data been located?
Has the requested action (access/correction/deletion) been taken?
Has the data principal been notified?

The evidence timeline — every event in chronological order with timestamps, event types, and metadata.

The export controls — one click to download the full evidence trail as PDF or CSV.

Marking a Reply as Sent

When an operator sends the response (currently: manually via email or another channel, then clicks "Mark as Sent" in the tool), the system:

Updates the request status to CLOSED
Logs REPLY_SENT to the evidence table:

   {
     "event_type": "REPLY_SENT",
     "event_data": { "operator": "admin", "channel": "manual" },
     "created_at": "2025-05-27T14:22:00.000Z"
   }

The gap between REQUEST_CREATED and REPLY_SENT timestamps is the documented response time. If an auditor asks "how long did you take to respond to this erasure request?" — the answer is computable from the evidence log to the second.

Part 5: The Evidence Architecture

The evidence design is the most important part of DPDP Copilot from a compliance standpoint. Everything else is workflow tooling. The evidence log is what you use when the Data Protection Board comes calling.

Append-Only by Design

The evidence_events table has no update or delete paths in the application. Once an event is written, it stays. There's no "edit evidence" API, no admin panel for removing events, no soft-delete flag.

Audit evidence that can be modified isn't evidence; it's a story you're telling. An append-only log where every event has a database-generated timestamp (not an application-provided one) is as close to tamper-evident as you can get in a PostgreSQL-backed application.

The schema:

CREATE TABLE IF NOT EXISTS evidence_events (
  id          uuid PRIMARY KEY,
  request_id  uuid REFERENCES requests(id),
  event_type  text NOT NULL,
  event_data  jsonb,
  created_at  timestamptz DEFAULT now() NOT NULL,
  org_id      uuid NOT NULL REFERENCES orgs(id)
);

The created_at field uses DEFAULT now() — the database server's timestamp, not the application's Date.now(). Database server clocks in a managed PostgreSQL instance are NTP-synchronized and authoritative. Application clocks can drift.

The Four Event Types

REQUEST_CREATED — logged at the moment of database insertion, before any processing. This is the legal timestamp of receipt.

REQUEST_CLASSIFIED — logged immediately after the AI classification succeeds and the type is validated. Contains the classified type in event_data. If classification fails and retries are exhausted, this event is not logged — the absence of this event tells you classification failed.

REPLY_SUGGESTED — logged when the AI draft is written to the request record. Contains the language and model used.

REPLY_SENT — logged when an operator marks the reply as sent. Contains the operator identity and channel. This closes the request lifecycle in the evidence log.

The presence of all four events, in order, within the applicable window means the request was handled correctly from intake to response. An auditor reviewing the CSV export can verify this in seconds.

Organisation Scoping

Every evidence event carries an org_id. Every query on evidence_events is scoped to the active organisation. A single deployment can serve multiple organisations, and their evidence trails are strictly isolated.

The org_id in evidence events is written by the application using the resolved organisation context — not passed in by the caller. A data principal submitting a request cannot specify or forge the organisation context; it's resolved server-side from the environment configuration.

What the Export Looks Like

CSV export for a complete request:

event_type,created_at
REQUEST_CREATED,2025-05-25T10:00:00.000Z
REQUEST_CLASSIFIED,2025-05-25T10:00:01.342Z
REPLY_SUGGESTED,2025-05-25T10:00:02.891Z
REPLY_SENT,2025-05-27T14:22:00.000Z

Four rows. Auditor reads it: request received Sunday 10:00 AM, responded Tuesday 2:22 PM — 52 hours, well within any reasonable response window.

PDF export includes:

Organisation name and request ID
Request type and creation timestamp
Original message (verbatim)
Suggested reply (the draft that was reviewed and sent)
Full evidence timeline

The PDF is generated server-side using Puppeteer with Chromium. The HTML template is a known XSS risk in the current implementation (user-provided message text is interpolated directly into HTML) — the fix is explicit HTML escaping before interpolation, which is on the roadmap.

Part 6: SLA Architecture — The Compliance Clock

SLA management is where most compliance tools fail. They either track SLA status as a static database field (which becomes stale the moment the clock ticks past the deadline) or they rely on background jobs (which can fail silently and leave the status indicator wrong).

DPDP Copilot takes a third approach: compute SLA status at read time, every time.

How the Internal SLA Clock Works

The sla_due_at timestamp is written once, at request creation: now() + sla_days. The default is 7 days — more conservative than the 30-day legal window, so normal review and approval cycles don't consume the entire legal budget. That's the only mutation to this field — it never changes after the request is created.

On every inbox load, every request detail page load, the computeSlaStatus(slaDueAt) function runs in the API layer:

const diffHours = (new Date(slaDueAt) - new Date()) / (1000 * 60 * 60)

if (diffHours < 0)   return 'OVERDUE'
if (diffHours < 24)  return 'DUE_SOON'
                     return 'WITHIN_SLA'

No database update. No background worker. No scheduled job. The status shown to the operator is always accurate as of the current server time.

DUE_SOON triggers at 24 hours remaining — a one-day warning before the internal deadline. This gives operators a meaningful heads-up without creating false urgency days in advance.

Setting the Right Internal SLA for Your Organisation

The DPDP Rules 2025 set a 30-day legal maximum for access/correction/erasure responses. How you set your internal target depends on your process:

A startup where one person handles requests end-to-end: 7–10 days is achievable and leaves buffer
A mid-size company where requests go through legal review and data lookup across multiple systems: 14–21 days as the internal target, with the legal 30-day window as the backstop
A large enterprise with formal approval workflows: set the SLA to match your internal SLA policy; use the evidence log to track compliance with your own commitments

The configurable orgs.sla_days field in the database — not yet wired to request creation in the current version, but in the roadmap — will let each organisation set its own target without changing code.

The Status vs. SLA Distinction

Early versions of DPDP Copilot conflated two concepts in a single field: the workflow status of the request (open, closed?) and the computed SLA urgency (within deadline?). The second database migration separates these:

-- migration 002_split_status_from_sla.sql
ALTER TABLE requests ADD COLUMN IF NOT EXISTS status text DEFAULT 'OPEN' NOT NULL;

UPDATE requests
SET status = CASE
  WHEN sla_status = 'CLOSED' THEN 'CLOSED'
  ELSE 'OPEN'
END;

After this migration:

status is the workflow state: OPEN or CLOSED. Closed means a reply was sent and the request is resolved.
The live SLA urgency is always computed by computeSlaStatus at read time.

This matters for reporting. You want to answer: "Of all requests that were open during the last month, what percentage were responded to within the internal SLA?" That question requires separating workflow state from deadline state.

Part 7: Multilingual Compliance at Scale

The multilingual capability deserves more attention than it typically gets in discussions of DPDP tooling.

Why Language Matters for DPDP

India's 2011 census (the most recent with detailed language data) recorded 19,569 raw mother tongue entries from respondents — often cited as "over 19,500 languages spoken in some capacity" — which consolidate into 121 languages with more than 10,000 speakers each. The DPDP Act and Rules 2025 require that grievance mechanisms be accessible, which practically means: if your users write to you in Hindi, a response mechanism that only understands English is not accessible.

DPDP Copilot supports five languages in the current version:

English — the default, always available
Hindi — 528 million speakers (2011 census)
Bengali — 97 million speakers
Tamil — 69 million speakers
Marathi — 83 million speakers

The public request form shows these as radio button options. The selection flows into the API request, through the drafting call, and into the AI prompt.

What the AI Draft Looks Like in Practice

Here's the same erasure request processed in two languages.

Input (English):

"I never gave consent for you to sell my data. Please delete everything you have about me immediately."

Suggested reply in English:

Dear Data Principal,

Thank you for your request submitted on 25 May 2025. We have received and logged your request for erasure of personal data under Section 12(b) of the Digital Personal Data Protection Act, 2023.

Your request has been assigned reference number [REF-ID]. Our compliance team will review your request, locate the relevant data, and initiate the erasure process in accordance with applicable legal requirements. We will respond with the outcome within the timeframe required under the DPDP Act.

Please retain this acknowledgement for your records.

Suggested reply in Hindi:

प्रिय डेटा प्रिंसिपल,

25 मई 2025 को प्रस्तुत आपके अनुरोध के लिए धन्यवाद। हमने डिजिटल व्यक्तिगत डेटा संरक्षण अधिनियम, 2023 की धारा 12(ख) के अंतर्गत आपके व्यक्तिगत डेटा के विलोपन के अनुरोध को प्राप्त कर दर्ज किया है।

आपके अनुरोध को संदर्भ संख्या [REF-ID] दी गई है। हमारी अनुपालन टीम आपके अनुरोध की समीक्षा करेगी, संबंधित डेटा का पता लगाएगी और लागू कानूनी आवश्यकताओं के अनुसार विलोपन प्रक्रिया शुरू करेगी। हम डीपीडीपी अधिनियम के तहत निर्धारित समय-सीमा के भीतर आपको परिणाम की सूचना देंगे।

कृपया इस पावती को अपने रिकॉर्ड के लिए सुरक्षित रखें।

The structure is identical. The legal references are consistent. The tone is professional but accessible. An operator who reviews the Hindi draft can run it through a translation tool to verify quality before sending — the AI draft is a starting point, not a blindly trusted final output.

Why Separate Prompts Per Language Matter

A naive approach would translate a fixed English template into other languages once, then serve those static translations. This works for simple acknowledgements but fails for personalised responses that need to reference the specific request content.

Because DPDP Copilot drafts replies by passing the original message to the model, the suggested reply can acknowledge specific details the data principal mentioned — not just their request type. If someone writes "I asked you to stop sending me SMS messages three months ago and you're still doing it," a good response acknowledges that history. A static template can't.

The LLM approach generates a response that's contextually appropriate in the data principal's language — which is a qualitatively different outcome from translation.

Part 8: The Data Architecture

Schema Design for Compliance

The database schema is designed around compliance requirements first, application convenience second.

-- Three tables, three responsibilities

CREATE TABLE orgs (
  id         uuid PRIMARY KEY,
  name       text NOT NULL,
  created_at timestamptz DEFAULT now(),
  sla_days   integer DEFAULT 7 NOT NULL
);

CREATE TABLE requests (
  id              uuid PRIMARY KEY,
  message         text NOT NULL,
  type            text NOT NULL,
  status          text DEFAULT 'OPEN' NOT NULL,
  suggested_reply text,
  sla_due_at      timestamptz,
  org_id          uuid NOT NULL REFERENCES orgs(id),
  created_at      timestamptz DEFAULT now() NOT NULL
);

CREATE TABLE evidence_events (
  id          uuid PRIMARY KEY,
  request_id  uuid REFERENCES requests(id),
  event_type  text NOT NULL,
  event_data  jsonb,
  created_at  timestamptz DEFAULT now() NOT NULL,
  org_id      uuid NOT NULL REFERENCES orgs(id)
);

The orgs.sla_days field exists and is populated but not yet wired to request creation — the 7-day hardcode is the current implementation. When that field is connected, different organisations can run different internal SLA targets. The schema is ready for that; the application code isn't yet.

The evidence_events.event_data field is jsonb — flexible enough to store different metadata per event type without schema changes. As the tool evolves (new event types, operator attribution, channel tracking), existing rows aren't invalidated.

Index Strategy

Two composite indexes:

CREATE INDEX requests_org_created_idx
  ON requests (org_id, created_at DESC);

CREATE INDEX evidence_events_request_org_created_idx
  ON evidence_events (request_id, org_id, created_at);

The first index supports the inbox query: "give me all requests for this org, sorted by most recent." The second supports the request detail query: "give me all evidence events for this request in this org, in chronological order."

Both indexes include org_id as the leading column because every query in the application is org-scoped. An index that starts with org_id is used by the query planner even for queries that also filter by request_id — the org scope eliminates most of the table before the planner looks at other columns.

Part 9: Deployment Architecture

Self-Hosted by Design

DPDP Copilot is self-hosted. That's a deliberate product decision, not an oversight.

DPDP requests often contain sensitive personal data — names, contact details, account information, and sometimes sensitive categories of data like health information or financial details. The organisation processing these requests is the data fiduciary. Routing that data through a third-party SaaS for classification and storage creates its own compliance risk: you're a data processor, processing data principal requests by sending them to another data processor, with all the consent and data transfer implications that entails.

Running the tool in your own infrastructure — whether on-premises or in a cloud account you control — keeps the data principal's message in your trust boundary. The only data that leaves your environment is the message text sent to Anthropic's API for classification and drafting. That's a single, scoped, auditable data transfer that you control.

Docker Compose Quickstart

# Clone and configure
git clone https://github.com/swapnanil/dpdp-copilot
cd dpdp-copilot
cp .env.example .env

# .env minimum required:
# ANTHROPIC_API_KEY=sk-ant-...
# DATABASE_URL=postgresql://user:pass@db:5432/dpdp
# ADMIN_USER=compliance_admin
# ADMIN_PASS=your_secure_password
# DEFAULT_ORG_ID=          # fill after running seed
# ADMIN_SESSION_SECRET=    # openssl rand -hex 32

# Start database
docker compose up db -d

# Run migrations
docker compose run --rm migrate

# Seed initial org (note the UUID it prints)
docker compose run --rm seed

# Start the application
docker compose up app

Open http://localhost:3000 for the operator inbox.
Open http://localhost:3000/grievance for the public form.

Environment Configuration Reference

Variable	Required	Description
`ANTHROPIC_API_KEY`	Yes	Your Anthropic API key for Claude
`DATABASE_URL`	Yes	PostgreSQL connection string
`ADMIN_USER`	Yes	Operator login username
`ADMIN_PASS`	Yes	Operator login password
`DEFAULT_ORG_ID`	Yes	UUID of the active organisation (from seed)
`ADMIN_SESSION_SECRET`	Production	Signs session cookies — `openssl rand -hex 32`
`MODEL`	No	Claude model (default: `claude-sonnet-4-6`)
`MAX_TOKENS`	No	Reply draft length (default: 1024)
`PUPPETEER_EXECUTABLE_PATH`	Docker	Chromium path — set automatically in Docker

Production Considerations

Session signing: Generate ADMIN_SESSION_SECRET with openssl rand -hex 32. In development you can skip this; in production the session cookie must be signed or it's trivially forgeable.

Database: The Docker Compose setup runs Postgres in a container. For production, use a managed database (AWS RDS, Google Cloud SQL, Supabase) with automated backups. The evidence table is your legal record — you want it on infrastructure with point-in-time recovery.

HTTPS: Run behind a reverse proxy (nginx, Caddy) that terminates TLS. Session cookies should have Secure and SameSite=Strict — these aren't set in the current implementation but are straightforward to add in a production nginx config.

Rate limiting: The public /grievance form has no rate limiting in the current version. A reverse proxy rate limit on the public intake endpoint prevents abuse without touching the application code.

Part 10: Known Limitations and What's Next

Honesty about limitations is part of useful tooling documentation. Here's what DPDP Copilot currently doesn't do and what the roadmap looks like.

Current Limitations

No outbound delivery: The "send reply" workflow doesn't actually send anything. It marks the reply as sent in the evidence log and sets the request to CLOSED. The operator is responsible for actually sending the drafted reply via their existing channel (email, portal, etc.). This is a limitation of the MVP, not the design — real outbound email delivery is the obvious next step.

Single-admin authentication: The current auth model is a single username/password pair from environment variables. There's no user table, no role model, no per-operator audit trail. Multiple operators can't be tracked individually. This is fine for a team of one; it's a problem for a compliance team of five.

Static org configuration: The active organisation is selected via DEFAULT_ORG_ID in the environment. There's no UI for switching organisations or a multi-tenant router. The database schema supports multiple orgs; the application routing doesn't.

No structured contact data: Contact information is embedded in the free-form message. There's no contact_email or contact_phone field. This means there's no reliable way to programmatically address the data principal in the reply or route the response to them.

PDF XSS risk: The PDF template interpolates user-provided text directly into HTML without escaping. A malicious actor could potentially inject HTML into the generated PDF. This is a known issue and is the highest-priority security fix.

No notifications: Operators have no way to be alerted when a new request comes in or when a request is approaching its internal SLA deadline. Checking the inbox manually is the only current mechanism.

The Roadmap

Outbound reply delivery: Send the drafted reply via email (SendGrid, AWS SES, or SMTP) directly from the tool. Logs the delivery event to the evidence table. The operator reviews the draft, edits if needed, and clicks Send — not "Copy this and email it manually."

SLA alerts: Email or Slack notification when a request enters DUE_SOON status. Optional daily digest of all open requests with their current SLA status.

Multi-operator support: A users table, per-operator login, and role assignment (reviewer vs. approver). Evidence events attributed to specific operators. Audit trail for who touched what.

Structured contact fields: Separate contact_email from the message body at intake. Validate format. Apply retention controls — contact data should be deletable when the request is closed without deleting the evidence trail.

Configurable SLA: Wire orgs.sla_days to request creation. Different organisations have different internal SLA commitments — the schema already supports this.

Approval workflow: A draft reply requires supervisor approval before it can be sent. The evidence log records who approved and when. This is an operational pattern for organisations where a junior compliance analyst drafts but a senior officer approves.

Analytics dashboard: How many requests per week? What types? Average response time? What percentage are within the internal SLA? This is a reporting requirement for any compliance programme worth its name.

Part 11: How DPDP Copilot Fits Into a Broader Compliance Programme

DPDP Copilot handles the data principal rights workflow. That's one piece of a complete DPDP compliance programme. Here's how it fits:

What DPDP Copilot Covers

Receiving data principal requests (Access, Rectification, Deletion, Grievance)
Classifying them correctly and consistently
Drafting multilingual responses
Tracking internal SLA deadlines
Generating the audit evidence trail
Exporting evidence for regulatory review

What It Doesn't Cover

Data discovery: Finding where a person's data actually lives across your systems. DPDP Copilot receives and tracks the request but doesn't automate the underlying data lookup. That's a data catalogue problem.
Consent management: Recording and tracking what data was collected under what consent. That's a separate consent registry.
Privacy notices: Generating or maintaining the notice required under Section 5 of the Act. That's a legal document workflow.
Data breach notification: Section 8(6) requires prompt notification of significant breaches to the Data Protection Board and affected persons. That's a separate incident response workflow.
Cross-border transfer compliance: The Act restricts transfers of personal data to certain countries. That's a data governance and infrastructure question.

A full DPDP compliance programme needs all of these. DPDP Copilot handles the rights management piece — the part that creates the most immediate operational urgency because it has a hard deadline on individual transactions and a direct escalation path to the Data Protection Board.

The Risk Reduction Calculation

Before DPDP Copilot:

Time to acknowledge a request: hours to days (depends on inbox monitoring)
Time to classify a request: manual, inconsistent, language-dependent
Time to draft a response: hours (finding a template, adapting it, translating it)
Deadline tracking: none — someone has to remember
Evidence: none — email threads that can be deleted

After DPDP Copilot:

Time to acknowledge: seconds (the evidence log records receipt immediately on submission)
Time to classify: 1–2 seconds (LLM call)
Time to draft a response: 2–3 seconds (LLM call)
Deadline tracking: automatic, live-computed, visible in the operator inbox
Evidence: append-only database log, exportable as PDF or CSV in one click

The reduction in time-to-first-action is the most important improvement. The legal clock starts when the request is submitted — not when someone reads it. DPDP Copilot ensures that classification and drafting are done before any human even opens the inbox. The operator's job is review and send, not receive-classify-draft-send.

Part 12: Who Should Use This

Compliance and legal teams at Indian companies processing personal data of Indian residents under the DPDP Act. If you're a data fiduciary — collecting or processing personal data — you have obligations under this Act. If you don't have a structured process for handling data principal requests, you need one.

Engineering teams building privacy infrastructure who need a reference implementation of DPDP request handling. The codebase is open-source. The data model, the API structure, the evidence logging pattern, the SLA computation logic — all of it is readable, runnable, and adaptable.

Startups at the early compliance stage who don't yet have a dedicated compliance team. The tool runs on a single machine. Configuration is a .env file. The public form can be linked from your privacy policy. You don't need a compliance department to run it — you need someone who checks the inbox.

Organisations handling multilingual Indian user bases where an English-only inbox isn't accessible to all the people it's supposed to serve. If your users write to you in Hindi and Tamil, they deserve responses in Hindi and Tamil — and the time cost of manual translation has historically made that impractical. It isn't anymore.

A Complete Example Walkthrough

Let me walk through a real scenario end-to-end, using the tool as a data principal and then as an operator.

As the Data Principal

You purchased something from a company. You're now getting SMS marketing messages you didn't opt in to. You want to file an erasure request and a grievance.

You go to https://yourcompany.com/grievance.

You write:

"I never gave you permission to send me SMS promotions. I want you to delete my phone number and all data you hold about me. I also want to formally complain about this."

You select Hindi as your preferred language and submit.

You receive an acknowledgement: "Your request has been received and logged. Reference: [UUID]. Our compliance team will be in touch with the outcome."

As the Compliance Operator

You open the operator inbox the next morning. You see a new request, classified as Grievance (the model detected the formal complaint language alongside the deletion request), with WITHIN_SLA status.

You click into the request. You read the original message. The suggested reply in Hindi is already drafted. You read it — it acknowledges the complaint, confirms the erasure request has been noted, and explains next steps in Hindi.

You make a small edit to reference your company's specific erasure process. You click "Send Reply" — which in the current version means you copy the draft, send it via your email system, and then click "Mark as Sent" in the tool.

The evidence timeline now shows:

REQUEST_CREATED     2025-05-25 10:00:00
REQUEST_CLASSIFIED  2025-05-25 10:00:01  (Grievance)
REPLY_SUGGESTED     2025-05-25 10:00:02
REPLY_SENT          2025-05-26 09:15:00

Total response time: 23 hours. Well within any reasonable SLA window. The CSV export documents this. If the data principal escalates to the Data Protection Board, you have a timestamped, exportable record of the complete interaction.

Quick Reference

Public form: GET /grievance

API Endpoints:

Method	Path	Auth	Description
`POST`	`/api/public/requests`	None	Submit a data principal request
`POST`	`/api/login`	None	Operator login
`POST`	`/api/logout`	None	Operator logout
`GET`	`/api/requests`	Operator	List all requests with live SLA status
`GET`	`/api/requests/:id`	Operator	Request detail + evidence timeline
`POST`	`/api/requests/:id/send-reply`	Operator	Mark reply sent, close request
`GET`	`/api/requests/:id/export/pdf`	Operator	Download PDF evidence report
`GET`	`/api/requests/:id/export/csv`	Operator	Download CSV evidence export

Request lifecycle:

Public form submission
  → Internal SLA clock starts (configurable, default 7 days)
  → REQUEST_CREATED logged
  → AI classification (Grievance / Access / Rectification / Deletion)
  → REQUEST_CLASSIFIED logged
  → AI reply drafted in chosen language
  → REPLY_SUGGESTED logged
  → Operator reviews in inbox
  → Operator marks reply sent
  → REPLY_SENT logged
  → Request status: CLOSED

Legal response windows under DPDP Rules 2025:

Request type	Section	Legal window
Access	Section 11	30 days
Correction / Erasure	Section 12	30 days
Grievance	Section 13	90 days

Final Thought

The DPDP Act's data principal rights framework isn't complicated. Four rights, two response windows, one evidence requirement. The complexity is operational — handling a high-variance stream of natural language requests, in multiple languages, against a hard time constraint, with an audit trail that has to survive regulatory scrutiny.

Manual processes fail under those conditions not because of negligence but because the requirements are genuinely hard to satisfy with shared inboxes and email chains.

DPDP Copilot automates the classification and drafting — the two tasks that are the most time-consuming and the most error-prone. It makes the internal SLA clock visible before it expires. It generates the audit evidence as a byproduct of normal operation, not as a separate reporting task.

The tool is open-source, self-hosted, and runs on a single Docker Compose command. If you're an Indian company with DPDP obligations and no structured data rights workflow, this is where to start.

→ View the full tool page, docs, live demo, and GitHub repo

Built by Swapnanil Saha — swapnanilsaha.com

How to Stop Evaluating LLM Outputs by Gut Feel

Swapnanil Saha — Thu, 21 May 2026 05:25:31 +0000

The standard workflow for evaluating LLM output quality goes something like this: someone reads Response A, reads Response B, and says "I think A is better." Everyone nods. The prompt ships.

This is a problem for three reasons:

It doesn't scale. You can't manually review 500 eval pairs after every prompt change.
It's inconsistent. The same person evaluating the same pair on different days produces different results.
It doesn't tell you why. "Response A is better" doesn't tell you what to fix when Response B becomes the baseline.

I built LLM Eval Suite to replace gut feel with structured, evidence-backed scoring — for any task type, with CI integration.

→ Full tool page

The Core Insight: Evidence, Not Opinion

Every score in LLM Eval Suite is accompanied by a verbatim quote from the response being evaluated. Not "this response has poor faithfulness" — but:

Faithfulness: 1.0/10
Quote: "30-day return policy, no questions asked"
Reasoning: "Source document specifies 14 days. This is a clear hallucination, not an interpretation."

This changes what you can do with the output. You can show it to a stakeholder. You can track it over time. You can build a regression test from it. You can tell the model what specifically went wrong.

Six Evaluation Capabilities

Multi-Dimensional Scoring

Ten task presets — QA, summarisation, RAG, code generation, creative writing, classification, translation, and more. Each preset activates the dimensions that matter for that task:

Task Type	Key Dimensions
`qa`	Faithfulness, Completeness, Conciseness, Relevance
`summarisation`	Coverage, Compression, Accuracy, Readability
`rag`	Faithfulness, Answer Relevancy, Context Precision, Context Recall
`code`	Correctness, Efficiency, Readability, Security

Every dimension score comes with verbatim evidence from the response text.

docker-compose run cli eval \
  --file examples/eval_qa.json \
  --mode compare \
  --format markdown

Regression Testing

Save any eval report as a named baseline:

docker-compose run cli regression save results.json --id prod-baseline

Run future evals against it:

docker-compose run cli regression run results.json --id prod-baseline --format markdown

Per-dimension deltas are compared against configurable thresholds. Exit code 1 when scores drop below your floor. This is the feature that makes the tool useful in CI.

GitHub Actions Integration

- name: Run LLM eval
  run: |
    docker-compose run cli eval \
      --file evals/suite.json \
      --mode rank \
      --format junit \
      --output results.xml

- uses: mikepenz/action-junit-report@v3
  with:
    report_paths: results.xml

- name: Regression check
  run: |
    docker-compose run cli regression run \
      results.json --id prod-baseline
    # exits 1 if any dimension drops beyond threshold

This gates model upgrades, prompt changes, and fine-tune releases automatically. The JUnit XML output integrates with any CI system that understands test reports.

Hallucination Detection

Claim-level analysis against a source document. Each claim in the response is classified as supported or unsupported — binary, not "mostly faithful."

docker-compose run cli hallucination \
  --response output.txt \
  --source source.txt \
  --format markdown

Risk levels: none / low / moderate / high / critical, with a safe_to_use boolean for downstream gating. This is what you run before using LLM output in a production pipeline where accuracy matters.

Example output:

hallucination_risk: high
safe_to_use: false

Claim: "30-day return policy"
  status: unsupported
  evidence: "Source specifies 14 days"
  severity: critical

Claim: "no questions asked"
  status: unsupported
  evidence: "Source makes no mention of return conditions"
  severity: high

Prompt Sensitivity Analysis

Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile across phrasings and which are stable.

docker-compose run cli sensitivity \
  --file examples/prompt_variants.json \
  --format markdown

Know which prompt phrasings shift your scores before you deploy. High-variance dimensions across prompts signal that your evaluation isn't measuring the response — it's measuring the prompt wording.

Panel Evaluation

Run N independent judge passes on the same evaluation. Mean and variance per dimension expose where judges agree and where they disagree.

docker-compose run cli panel \
  --file examples/eval_qa.json \
  --judges 5 \
  --format markdown

High-variance dimensions are flagged for human review automatically. The panel mode is the right choice when you're evaluating subjective tasks like creative writing where a single judge's opinion is insufficient signal.

RAGAS-Compatible RAG Preset

The rag task type maps the four RAGAS metrics — faithfulness, answer relevancy, context precision, context recall — as first-class evaluation dimensions with equal weighting. The output is compatible with RAGAS reporting conventions, so you can integrate this into existing RAGAS workflows or use it as a drop-in alternative.

Example: Two Responses In, Clear Winner Out

Input:

{
  "task_type": "qa",
  "eval_mode": "compare",
  "source": "Refunds are accepted within 14 days if the item is unused.",
  "responses": [
    {
      "label": "Response A",
      "text": "You can get a refund within 14 days if the item hasn't been used."
    },
    {
      "label": "Response B",
      "text": "Our 30-day return policy means no questions asked."
    }
  ]
}

Output:

winner: Response A
margin: clear

Response B — Faithfulness
  score: 1.0/10
  quote: "30-day return policy, no questions asked"
  reasoning: "Source specifies 14 days. 'No questions asked' is not in the source.
              Two distinct hallucinations in one sentence."

Response A — Faithfulness
  score: 9.5/10
  quote: "within 14 days if the item hasn't been used"
  reasoning: "Accurately paraphrases the source with no additions."

Why This Matters in Production

LLM evaluation is usually treated as a one-time concern — you evaluate before you ship. But models change, prompts drift, data distributions shift, and retrieval quality fluctuates. A system that was 90% faithful in January may be 75% faithful in April because the upstream data changed.

The regression testing and CI integration in LLM Eval Suite are designed for this reality. You run evals continuously, not just at release time. The baseline is the floor — if you drop below it, the pipeline stops.

→ View the full tool page, docs, and GitHub repo

Stop Getting 'It Depends' Answers About RAG Architecture

Swapnanil Saha — Thu, 21 May 2026 05:09:30 +0000

Ask five AI engineers which vector database to use for your RAG system. You'll get five different answers, and they'll all start with "it depends."

It depends on your data volume. It depends on your query patterns. It depends on whether you need GDPR compliance. It depends on your team's infra maturity. It depends on your budget. It depends on whether you're doing hybrid search.

The "it depends" answer is technically correct and operationally useless. It turns an architecture decision into an unbounded research project.

I built RAG Readiness to make one specific recommendation per component — and explain why.

→ Full tool page

The Design Principle: Opinions, Not Options

Most RAG tooling and documentation presents you with a comparison table. Pinecone vs. Weaviate vs. Qdrant vs. Chroma. BM25 vs. dense vs. hybrid. ada-002 vs. text-embedding-3-large.

Comparison tables are useful if you already know which dimensions matter for your use case. They're paralyzing if you don't.

RAG Readiness is opinionated by design. You describe your use case, your data, your constraints. The tool returns one choice per component — with full reasoning.

If GDPR applies, managed cloud vector databases are eliminated from consideration before the LLM is even called. That's a rule, not an LLM judgment. The recommendation you receive is already constraint-filtered.

Six Modes, One Tool

Architecture Recommendation

The core mode. Answer a structured set of questions about your use case — document types, query patterns, scale, compliance requirements, team capabilities. Get back:

Vector database: one specific choice with rationale
Embedding model: one specific choice
Chunking strategy: one specific approach with parameters
Retrieval method: dense / BM25 / hybrid — one answer
Reranker: whether you need one and which

python main.py audit --interactive
# or from file:
python main.py audit --file examples/usecase_legal_contracts.json --with-cost

Architecture Diagnosis

You already have a RAG system. It's not working. This mode takes your existing architecture and the problems you're seeing, and returns a root-cause analysis per component with severity levels and one specific fix.

Not "improve your chunking" — "switch from fixed 512-token chunks to parent-child hierarchical chunking with 512-token child nodes. Your documents have multi-clause structure that fixed chunks split mid-sentence."

python main.py diagnose --file examples/diagnosis_pinecone_fixed.json

Example output:

overall_severity: critical

chunking_strategy — critical
  "Fixed 512-token chunks split mid-clause in long legal documents"
  Fix: Parent-child hierarchical chunking, 512-token child nodes

retrieval_method — high
  "Dense-only misses exact terms like dollar amounts and clause references"
  Fix: Hybrid BM25 + dense with RRF fusion

quick_fix: Enable 10% token overlap today. Takes 20 minutes, reduces
           the worst failures while you implement the full fix.

Multi-Use-Case Session

Run up to 5 parallel audits in a single request — useful when you're scoping a RAG platform that needs to serve multiple internal teams.

The output includes cross-cutting insights: which components can be shared across use cases, where requirements conflict (the legal team needs GDPR-compliant storage; the sales team wants managed cloud), and which use case to build first for the highest return on the shared infrastructure investment.

Implementation Bundle

Once you have an architecture you trust, generate a complete implementation starter kit:

python main.py bundle <session-id>

Output: a requirements.txt, docker-compose.yml, .env.example, and migration guide tailored to the recommended architecture. If you have an existing stack, you get ordered migration steps with rollback notes.

Cost Estimation

Rule-based monthly cost breakdown per component — no LLM call. Lookup tables for vector DB pricing tiers, embedding API costs, reranker inference, and LLM costs at your estimated query volume.

python main.py cost <session-id>

Returns a line-item breakdown, optimization tips (e.g., "switching to a self-hosted embedding model saves ~$800/month at this query volume"), and a hosting model classification (managed vs. self-hosted trade-off at your scale).

RAGAS Eval Dataset Generation

Generate evaluation questions grounded in your actual use case and query patterns — not generic retrieval questions.

python main.py eval-dataset <session-id> --num-questions 20

Output includes easy/medium/hard distribution, RAGAS metric mapping (which questions test faithfulness vs. answer relevancy vs. context precision), an annotation guide, and a time estimate for human review.

Session Persistence and Refinement

Every audit persists to SQLite. You can refine against new constraints:

python main.py refine <session-id> --feedback "Qdrant was too heavy for our infra team"

The tool re-runs with the feedback as an additional constraint. Refinement history is tracked — you can see how the recommendation evolved across iterations.

A Complete Quickstart

git clone https://github.com/swapnanil/rag-readiness
cd rag-readiness
cp .env.example .env  # add your ANTHROPIC_API_KEY
docker-compose up api

# New architecture audit (interactive)
python main.py audit --interactive

# Diagnose a broken stack
python main.py diagnose --interactive

# Multi-use-case session
python main.py multi-audit examples/multi_usecase_lexvault.json

# List sessions and refine
python main.py sessions
python main.py refine <session-id> --feedback "need self-hosted only"

# Cost breakdown and eval dataset
python main.py cost <session-id>
python main.py eval-dataset <session-id> --num-questions 20

The Pre-Scoring Layer

Before any LLM call, a rule-based pre-scorer computes a complexity score (1–10) from the use case inputs. This has two effects:

It calibrates the LLM prompt — a complexity-1 use case gets a simpler, more direct recommendation; a complexity-9 use case gets a recommendation with more explicit trade-off reasoning.
It runs conflict detection — if your inputs contain contradictory constraints (e.g., "GDPR compliant" + "use Pinecone"), the conflict is flagged before the LLM is called, not discovered in the output.

Who This Is For

AI engineers starting a new RAG project who want a structured starting point rather than a blank page
Engineering leads who need to scope a RAG system for a business use case and justify the architecture choices to non-technical stakeholders
Teams with an existing RAG system that isn't performing as expected and need a systematic diagnosis, not a hunch

The tool is open-source, runs locally, and persists everything to SQLite. Your use case details don't leave your environment beyond the single LLM API call per audit.

→ View the full tool page, docs, and GitHub repo

Building Distributed Systems, Backend Infrastructure & AI Platforms — My Engineering Journey

Swapnanil Saha — Tue, 19 May 2026 09:34:58 +0000

Hey everyone 👋

I’m Swapnanil Saha, a backend and distributed systems engineer from Mumbai, India with 9+ years of experience building high-performance infrastructure systems, backend platforms, optimization pipelines, and AI-driven architectures.

🌐 Website: swapnanilsaha.com

💻 GitHub: github.com/swapnanil

🔗 LinkedIn: linkedin.com/in/swapnanil