<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Swapnanil Saha</title>
    <description>The latest articles on DEV Community by Swapnanil Saha (@swapnanilsaha).</description>
    <link>https://dev.to/swapnanilsaha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939906%2F9f37b94e-be6e-42b9-a63e-34b65dca3522.jpeg</url>
      <title>DEV Community: Swapnanil Saha</title>
      <link>https://dev.to/swapnanilsaha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/swapnanilsaha"/>
    <language>en</language>
    <item>
      <title>Building Vectr, Part 1: Why grep Fails When You Don't Know the Keywords</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Tue, 09 Jun 2026 14:10:06 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/building-vectr-part-1-why-grep-fails-when-you-dont-know-the-keywords-17e7</link>
      <guid>https://dev.to/swapnanilsaha/building-vectr-part-1-why-grep-fails-when-you-dont-know-the-keywords-17e7</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 1 of the Building Vectr series (1 of 3).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You get dropped into an unfamiliar codebase. Not a toy project — real production code, 8,000 files, three years of accumulated complexity and clever abstractions. Your job is to fix a bug in the request validation pipeline. What does an AI code editor do next?&lt;/p&gt;

&lt;p&gt;This post is about a problem I kept running into, a tax I kept paying, and the indexing system I built to eliminate it. It covers the technical decisions behind &lt;a href="https://swapnanilsaha.com/tools/vectr/" rel="noopener noreferrer"&gt;Vectr&lt;/a&gt;'s search layer: why naive chunking produces bad embeddings, how tree-sitter solves the code-parsing problem, what BM25 does that vector search can't, and why you need a symbol graph for questions that text search cannot answer at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1 — The Problem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Re-discovery Tax
&lt;/h3&gt;

&lt;p&gt;If you're a human engineer navigating an unfamiliar codebase, here's what you probably do: you ask someone who knows it, or you grep for the error message, or you open the entry point and follow imports until you find the thing. Your brain does semantic compression the whole way — building a model of the system, discarding noise, following intuitions about where complexity tends to live. By the time you've read 20 files, you have a rough map that persists across days and sessions.&lt;/p&gt;

&lt;p&gt;An AI code editor has the same tools — read files, run shell commands, grep — but completely different economics. Every &lt;code&gt;Read&lt;/code&gt; call costs tokens. Every &lt;code&gt;Bash&lt;/code&gt; call for grep costs a turn. Unlike a human who can skim-read at 1,000 words per minute and discard irrelevant content almost for free, an AI editor pays full price for every character it reads: it sits in the context window whether or not it was useful. Read the wrong 500-line file and you've burned context that could have held the answer.&lt;/p&gt;

&lt;p&gt;The result, on unfamiliar codebases, is what I started calling the &lt;strong&gt;re-discovery tax&lt;/strong&gt;: a cluster of navigation calls at the start of every session, before any actual implementation begins, spent on figuring out where things are. And because AI editors have no persistent memory between sessions, they pay this tax again and again — every session, on the same codebase.&lt;/p&gt;

&lt;p&gt;In benchmarks I ran against real open-source codebases (more detail in Part 3), the re-discovery tax on CPython internals ranged from &lt;strong&gt;6 to 23 tool calls per task&lt;/strong&gt; before the first file write. Some sessions spent more turns navigating than implementing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key observation:&lt;/strong&gt; The re-discovery tax is paid every session, not once. A human engineer's mental map of a codebase accumulates and compounds. An AI editor's map is fully rebuilt from scratch at the start of each session. The economic gap widens as the codebase grows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Why grep Fails at the Boundary of Your Knowledge
&lt;/h3&gt;

&lt;p&gt;Before explaining what I built, I want to be precise about where grep breaks down — because "just use grep" is the natural reaction, and it's not obviously wrong until you try to use it systematically on unfamiliar code.&lt;/p&gt;

&lt;p&gt;grep is a brilliant tool for confirming hypotheses you already have. If you know what you're looking for, it's nearly perfect. The problem is the case that isn't really an edge case: &lt;em&gt;you don't know what you're looking for.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Say you're trying to understand how a Django application validates incoming JSON payloads before they hit the ORM layer. You might grep for &lt;code&gt;validate&lt;/code&gt;. You'll get 200 results across 40 files — field validators, form validators, configuration validators, test fixtures. None of them are obviously the thing you want. You grep for &lt;code&gt;json.loads&lt;/code&gt;. You get 30 results. You grep for &lt;code&gt;request.data&lt;/code&gt;. That gets you closer, maybe. But you spent four greps and 15 minutes before you found the right file.&lt;/p&gt;

&lt;p&gt;The deeper problem: grep requires you to already have a mental model of the codebase's naming conventions. An AI editor running on an unfamiliar codebase doesn't know whether payload validation is called &lt;code&gt;validate_payload&lt;/code&gt;, &lt;code&gt;check_request&lt;/code&gt;, &lt;code&gt;parse_input&lt;/code&gt;, or &lt;code&gt;_pre_process&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt; Think of keyword search as asking for directions by street name in a city you've never visited. "Where is Maple Street?" gets a precise answer. But "where is the street with the good coffee shop near the park?" — keyword search has nothing to offer. You need a different kind of index: one that understands &lt;em&gt;what places are for&lt;/em&gt;, not just what they're called.&lt;/p&gt;

&lt;p&gt;Semantic search inverts this. It maps your query and every code chunk into the same high-dimensional vector space, then finds the chunks closest to your query by meaning — regardless of whether they share any words. "JWT validation logic" finds &lt;code&gt;verify_token&lt;/code&gt; even if neither of those words appears in the function body.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 2 — Building the Index
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Chunking Problem: Why Line Windows Break on Code
&lt;/h3&gt;

&lt;p&gt;Prose text has a natural unit of meaning: the paragraph. You can split a Wikipedia article into 200-word chunks, embed each one, and get a reasonable search system. Code doesn't work this way.&lt;/p&gt;

&lt;p&gt;The standard naive approach for code indexing is the same line-window strategy borrowed from document search: take a sliding window of N lines with M lines of overlap, create a chunk, embed it, move the window. A common default might be 150-line windows with 50 lines of overlap. Simple, language-agnostic, works on any file format.&lt;/p&gt;

&lt;p&gt;The problem is what happens at the window boundaries. Consider this function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_workspace_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Database&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ChangeResult&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Process all pending changes in a workspace, optionally forcing re-indexing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_pending_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ChangeKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DELETED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove_chunks_for_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChangeResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;removed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ChangeKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MODIFIED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChangeKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CREATED&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;language_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChangeResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indexed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_changes_processed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a 150-line window happens to cut through this function, neither resulting chunk is independently meaningful. The chunk with just the body is missing the parameter names and return type. The chunk with just the signature has no implementation context. The embedding of a half-function is significantly worse than the embedding of the complete thing.&lt;/p&gt;

&lt;p&gt;The fix: split at semantic boundaries. Functions should be complete units. Classes should contain their methods, or each method should be its own chunk with the class header prepended for context.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why completeness matters:&lt;/strong&gt; An embedding model compresses everything in its context into a single fixed-size vector. A complete function gives the model everything it needs to capture the function's purpose, parameters, return behavior, and side effects in that vector. A half-function forces the model to compress an ambiguous fragment — the resulting vector is a blurred average of possible interpretations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Parsing Code with tree-sitter
&lt;/h3&gt;

&lt;p&gt;tree-sitter is a parser library that produces concrete syntax trees for source code — every construct in the language has a named node with exact byte boundaries in the source. Unlike a regex-based approach, tree-sitter actually parses the grammar and handles edge cases correctly: nested functions, decorators on multiple lines, multiline function signatures, arrow functions in JavaScript, generic bounds in Rust.&lt;/p&gt;

&lt;p&gt;For Python, the tree-sitter query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scheme"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;function_definition&lt;/span&gt;
  &lt;span class="nv"&gt;name:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;identifier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@name&lt;/span&gt;
  &lt;span class="nv"&gt;parameters:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@params&lt;/span&gt;
  &lt;span class="nv"&gt;body:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@function&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;class_definition&lt;/span&gt;
  &lt;span class="nv"&gt;name:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;identifier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@name&lt;/span&gt;
  &lt;span class="nv"&gt;superclasses:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;argument_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nv"&gt;?&lt;/span&gt; &lt;span class="nv"&gt;@bases&lt;/span&gt;
  &lt;span class="nv"&gt;body:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;@class&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matches any function or class definition anywhere in the file and captures the name, parameters, and body as named nodes with precise byte-range positions. You can then slice the original source file at those byte positions to extract complete, syntactically valid chunks.&lt;/p&gt;

&lt;p&gt;For classes, Vectr attaches the full class signature — including the base class list captured by &lt;code&gt;@bases&lt;/code&gt; — as a header to each method chunk. So the chunk for &lt;code&gt;WorkspaceLock.acquire()&lt;/code&gt; includes its inheritance context. A method of &lt;code&gt;AuthenticatedView(LoginRequiredMixin, View)&lt;/code&gt; has a meaningfully different semantic context than a method of a plain &lt;code&gt;View&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A subtlety: very large functions.&lt;/strong&gt; AST-aware chunking breaks down for functions that are genuinely enormous — 500+ lines. Vectr handles this by further splitting large functions at their major control-flow boundaries (default threshold: 200 lines). The resulting sub-chunks each include the function signature as a header to preserve context. Their embedding quality is better than one giant embedding, though still lower than a naturally small function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Code-Specific Embeddings Running Locally
&lt;/h3&gt;

&lt;p&gt;Not all embedding models are equally good at code. Models trained primarily on prose text have learned representations of natural language semantics. Code has different regularities: symbol names, type signatures, control flow patterns, API call chains. Code-aware models routinely outperform general-purpose models by 10–20% on tasks like "find the function that handles X."&lt;/p&gt;

&lt;p&gt;Vectr uses &lt;code&gt;Snowflake/snowflake-arctic-embed-m-v1.5&lt;/code&gt;, a 110-million-parameter model that produces 768-dimensional embedding vectors and runs in under 100ms per batch on a modern laptop CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why local inference instead of an API?&lt;/strong&gt; Two practical constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cost: a tool that fires 20–50 search calls per session would accumulate non-trivial API costs quickly. Local inference is free at query time after the one-time model download.&lt;/li&gt;
&lt;li&gt;Data privacy: many codebases cannot be sent to third-party APIs. Internal tools, proprietary algorithms, customer data models — many organizations have policies or contractual obligations that prohibit sending source code to external services.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tradeoff: the model weighs roughly 440MB and needs to be downloaded on first run. This is a real friction point.&lt;/p&gt;

&lt;p&gt;One critical detail: queries and chunks are embedded with different input prefixes. Queries use &lt;code&gt;Represent this query for searching relevant code:&lt;/code&gt;, chunks use &lt;code&gt;Represent this code snippet:&lt;/code&gt;. arctic-embed-m is a single encoder, but it was trained with different prefixes for query-side and document-side inputs. Using the wrong prefix reduces the cosine similarity between semantically related query-chunk pairs — the vectors for "user authentication" and &lt;code&gt;verify_token&lt;/code&gt; end up further apart in embedding space than they should be. Getting this wrong costs 5–15% in retrieval quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3 — The Search Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hybrid Search: Why BM25 and Vector Search Need Each Other
&lt;/h3&gt;

&lt;p&gt;Vector search handles concept queries well. But if you search for &lt;code&gt;_handle_workspace_lock_conflict&lt;/code&gt; — an exact function name — a vector search might not rank it first. The embedding is just one point in a crowded neighborhood of similar-looking function names. BM25, on the other hand, will find it immediately: exact string matches get the highest possible score.&lt;/p&gt;

&lt;p&gt;The inverse is also true: BM25 cannot find "retry logic with exponential backoff" if the function is called &lt;code&gt;_schedule_attempt_with_delay&lt;/code&gt; and its docstring says nothing about backoff. Zero keyword overlap means zero BM25 score. Vector search finds it because the semantic cluster it belongs to is close to the query in embedding space.&lt;/p&gt;

&lt;p&gt;The right system uses both. Every query in Vectr runs both a vector search and a BM25 search in parallel, then combines the two ranked lists using a weighted formula.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BM25 scoring formula:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score(D, Q) = Σᵢ IDF(qᵢ) · [ tf(qᵢ, D) · (k₁ + 1) ] / [ tf(qᵢ, D) + k₁ · (1 − b + b · |D| / avgdl) ]

IDF(qᵢ) = log( (N − nᵢ + 0.5) / (nᵢ + 0.5) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tf(qᵢ, D)&lt;/code&gt; — term frequency of qᵢ in document D&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;N&lt;/code&gt; — total documents; &lt;code&gt;nᵢ&lt;/code&gt; — documents containing qᵢ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;|D|&lt;/code&gt; — document length in tokens; &lt;code&gt;avgdl&lt;/code&gt; — average document length&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;k₁ = 1.5&lt;/code&gt; (term-frequency saturation), &lt;code&gt;b = 0.75&lt;/code&gt; (length normalization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the Robertson–Sparck Jones variant. Some implementations add +1 inside the IDF log to prevent negative values for very common terms.&lt;/p&gt;

&lt;p&gt;The weight assigned to each approach depends on codebase familiarity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;BM25 weight&lt;/th&gt;
&lt;th&gt;Vector weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Large unfamiliar codebase&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small familiar codebase&lt;/td&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit symbol name in query&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Natural language concept query&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These weights are the actual values used in Vectr's implementation, tuned against the benchmark dataset.&lt;/p&gt;

&lt;p&gt;The benchmark on Apache Camel (58,000+ Java files) showed a &lt;strong&gt;73% reduction in Read+Bash navigation calls&lt;/strong&gt; compared to the baseline AI editor with no index.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Symbol Graph: What Text Search Cannot Answer
&lt;/h3&gt;

&lt;p&gt;Semantic search and BM25 handle "find me the code for this concept" well. But there's a different navigation pattern that neither handles: "find me everything that calls this function."&lt;/p&gt;

&lt;p&gt;Vectr builds a symbol graph during indexing. For each file, tree-sitter extracts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Definitions&lt;/strong&gt; — every function, class, method, and module-level constant with name and line number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call edges&lt;/strong&gt; — every call site, mapping callee name to the calling function's context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Import edges&lt;/strong&gt; — every import statement, mapping the imported symbol to its likely source module&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP routes&lt;/strong&gt; — Flask/FastAPI &lt;code&gt;@router.get()&lt;/code&gt;, Express &lt;code&gt;app.post()&lt;/code&gt;, Spring &lt;code&gt;@GetMapping&lt;/code&gt; — extracted as named symbols&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The resulting graph enables exact lookups. &lt;code&gt;vectr_locate("WorkspaceLock")&lt;/code&gt; returns a file path and line number in under 10ms — no embedding, no ranking, pure symbol table lookup. &lt;code&gt;vectr_trace("acquire_lock")&lt;/code&gt; returns all callers and all callees in one round-trip. These are not search results — they are graph traversals, and they produce exact answers rather than relevance rankings.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Text search vs. graph traversal:&lt;/strong&gt; These are not competing approaches — they answer different questions. "Find code that does X" is a search problem. "Find who calls Y" or "find where Z is defined" is a graph traversal problem. Relying only on text search for definition lookups is like looking up a phone number by describing the person rather than looking them up by name.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Six Fallback Strategies in vectr_locate
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;vectr_locate&lt;/code&gt; runs six fallback strategies in sequence, stopping at the first match:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Exact match&lt;/strong&gt; — direct lookup in the symbol table. Sub-millisecond. Highest confidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suffix match&lt;/strong&gt; — &lt;code&gt;Lock&lt;/code&gt; matches &lt;code&gt;WorkspaceLock&lt;/code&gt;, &lt;code&gt;AcquireLock&lt;/code&gt;, &lt;code&gt;LockManager&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same-module priority&lt;/strong&gt; — if a caller file is provided, search definitions within the same module first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unique name&lt;/strong&gt; — if there is exactly one symbol across the entire codebase whose name contains your query string, return it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Import chain follow&lt;/strong&gt; — follow import statements from a given file to find where the name likely comes from.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy (Levenshtein ≤ 2)&lt;/strong&gt; — edit distance ≤ 2 across all symbol names. Catches typos. Lowest confidence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each strategy produces a &lt;code&gt;LocateResult&lt;/code&gt; with a &lt;code&gt;resolution_strategy&lt;/code&gt; field. An exact match means you can act on the result immediately. A fuzzy match with edit distance 2 means you should verify before relying on it. A silent wrong navigation is worse than no navigation at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4 — The Runtime Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  mtime Cache and Incremental Re-indexing
&lt;/h3&gt;

&lt;p&gt;The first time you run &lt;code&gt;vectr start&lt;/code&gt; on a large codebase, indexing takes time. CPython's 4,000+ files: about 8 minutes. Django's ~1,800 Python files: about 2 minutes. Apache Camel's 58,000+ Java files: closer to 45 minutes.&lt;/p&gt;

&lt;p&gt;During initial indexing, Vectr writes a file at &lt;code&gt;~/.cache/vectr/{hash}/index_cache.json&lt;/code&gt; that stores the modification timestamp of every indexed file. The &lt;code&gt;{hash}&lt;/code&gt; is a short SHA-256 hash of the absolute workspace root path. On subsequent runs, only files whose mtime has changed are re-indexed. On a typical active session where you've modified 5–10 files, subsequent re-indexing takes under 5 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling deletions:&lt;/strong&gt; Vectr also stores the complete set of indexed file paths. At startup, it diffs this set against the current file tree and removes all chunks belonging to deleted files before re-indexing modified ones. Process deletions first, then updates, then new files — this prevents a renamed file from leaving orphaned chunks in the index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The watchdog listener:&lt;/strong&gt; During an active session, Vectr runs a watchdog filesystem listener on the workspace root. When a file is saved, the listener queues it for re-indexing in the background. Events are debounced at 300ms — only the last write in a burst counts. Without debouncing, a single save in a project using aggressive auto-formatting would trigger 3–5 redundant re-index operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  .vectrignore: Keeping the Index Clean
&lt;/h3&gt;

&lt;p&gt;Vectr reads a &lt;code&gt;.vectrignore&lt;/code&gt; file from the workspace root using glob patterns. The syntax follows &lt;code&gt;.gitignore&lt;/code&gt; conventions — trailing slash for directories, &lt;code&gt;*&lt;/code&gt; for single-level wildcard, &lt;code&gt;**&lt;/code&gt; for recursive match (via Python's &lt;code&gt;pathlib.Path.match()&lt;/code&gt;) — but Vectr does not implement the full gitignore specification: the &lt;code&gt;!&lt;/code&gt; negation prefix is not supported.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vendor/
node_modules/
dist/
*.pb.go        # generated protobuf Go files
*.min.js       # minified JavaScript
__pycache__/
.venv/
coverage/
*.snap         # Jest snapshots
migrations/    # Django database migrations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A codebase with &lt;code&gt;node_modules/&lt;/code&gt; will typically contain 5–20x more code from installed packages than from the project itself. Excluding vendor directories before the initial index run is the single most impactful configuration change most users can make.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Happens When You Call vectr_search
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Query string is embedded using arctic-embed-m with query prefix
   → 768-dimensional float vector, ~15ms on CPU

2. Vector similarity search against ChromaDB store
   → Top-20 chunks by cosine similarity, with scores

3. Same query runs through BM25 index (rank-bm25, in-memory)
   → Top-20 chunks by BM25 score, with scores

4. Two ranked lists are merged
   → Weight BM25/vector based on codebase characterization
   → Normalized scores combined; top-N results selected (default N=5)

5. Symbol names in the query are detected (camelCase, snake_case, PascalCase)
   → If found: also run vectr_locate as a side channel
   → Merge symbol lookup results into final output if relevant

6. Final top-N chunks returned with:
   file path, start line, end line, matched text, search method
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result for &lt;code&gt;vectr_search("workspace lock acquisition and release")&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] resolver.rs:214 — WorkspaceLock::acquire()
    Acquires the workspace-scoped lock. Blocks if another process holds it.

[2] resolver.rs:267 — WorkspaceLock::release()
    Releases the workspace-scoped lock. Validates that the current process
    holds the lock before releasing (returns Err if not held).

[3] workspace.py:89 — _acquire_workspace_lock(path)
    Context manager: acquires, yields, releases on exit.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of reading 15 files to find these three functions, the AI editor reads one search result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5 — Design Decisions I'd Make Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Python 3.14 requirement.&lt;/strong&gt; The codebase uses &lt;code&gt;match/case&lt;/code&gt; pattern matching extensively and some &lt;code&gt;asyncio&lt;/code&gt; patterns that behave differently in earlier versions. In retrospect, 3.11 would probably work with a few hours of refactoring. The 3.14 requirement has been the single biggest adoption friction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB as the vector store.&lt;/strong&gt; A vector store handles embedding persistence and similarity search. ChromaDB works, but the full HNSW index with persistence, the Python client layer, and the inter-process communication overhead add about 200ms specifically to ChromaDB's startup contribution — not total Vectr startup (~280ms including mtime diffing and watchdog initialization). For v2, I'd consider a lighter in-process option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The BM25 implementation.&lt;/strong&gt; The &lt;code&gt;rank-bm25&lt;/code&gt; library is pure Python and fast enough for codebases under 50,000 chunks. Beyond that, it starts to show latency. The right long-term solution is integrating BM25 scoring directly into the vector store query pipeline. For current use cases (most codebases are under 20K chunks), it's fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The indexing layer is the foundation, not the product. What it enables is an AI code editor that can navigate a large unfamiliar codebase as efficiently as a human engineer who has worked in it for months — finding the right functions in one or two calls instead of fifteen.&lt;/p&gt;

&lt;p&gt;But the index tells you &lt;em&gt;where things are&lt;/em&gt;. It doesn't tell you &lt;em&gt;why things are the way they are&lt;/em&gt; — the non-obvious invariants, the patterns that emerge from reading 50 files, the bugs that were fixed by changing two lines in a place that looks completely unrelated.&lt;/p&gt;

&lt;p&gt;That's what Part 2 addresses: a note store where an AI editor can save findings in structured, tagged form — "the lock acquisition logic is at resolver.rs:214, and it acquires an exclusive file lock using fcntl.flock, not a threading primitive" — and retrieve them in under 50ms at the start of any future session. When &lt;code&gt;/compact&lt;/code&gt; runs and replaces the conversation with a summary, exact signatures and line numbers evaporate — but notes don't. The indexer tells you where to look. The working memory layer tells you what you already know about what you found.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary of core decisions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AST-aware chunking via tree-sitter&lt;/td&gt;
&lt;td&gt;Complete functions as the unit of meaning. Biggest quality improvement over naive line windows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local embeddings (arctic-embed-m)&lt;/td&gt;
&lt;td&gt;No API cost, no data leaving the machine. One-time 440MB download.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid BM25 + vector search&lt;/td&gt;
&lt;td&gt;Concept queries route to vector. Exact symbol names route to BM25.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Symbol graph&lt;/td&gt;
&lt;td&gt;Definitions, call edges, import edges, HTTP routes — exact graph traversal for questions text search cannot answer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Six fallback strategies in vectr_locate&lt;/td&gt;
&lt;td&gt;Exact → suffix → same_module → unique_name → import_chain → fuzzy. Each result carries its resolution strategy.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mtime cache + watchdog&lt;/td&gt;
&lt;td&gt;Sub-5-second re-indexing on subsequent runs. In-session saves trigger background re-indexing automatically.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>mcp</category>
      <category>semanticsearch</category>
      <category>developertools</category>
      <category>codeindexer</category>
    </item>
    <item>
      <title>Vectr — Code Intelligence AI Tool</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Tue, 26 May 2026 20:02:47 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/vectr-code-intelligence-ai-tool-320m</link>
      <guid>https://dev.to/swapnanilsaha/vectr-code-intelligence-ai-tool-320m</guid>
      <description>&lt;p&gt;You log off for the day after two hours of research. You know the entry point is &lt;code&gt;EvaluateSegments&lt;/code&gt; in &lt;code&gt;targeting/segment/evaluator.go&lt;/code&gt;. You know the nil visitor_id case is unhandled. You know &lt;code&gt;bidder/auction.go&lt;/code&gt; calls this function and can't have its interface changed.&lt;/p&gt;

&lt;p&gt;Next morning, Claude Code knows none of that. It starts fresh. It greps, reads files, consumes 8,000 tokens rediscovering what you already found. Every session is day one.&lt;/p&gt;

&lt;p&gt;This is the actual friction in AI-assisted development — not the quality of code generation, but the complete absence of working memory across session boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with how AI assistants use context
&lt;/h2&gt;

&lt;p&gt;On a codebase with 40,000 files, the AI runs &lt;code&gt;rg -l "authenticate"&lt;/code&gt;, gets 200 results, reads 8 complete files — 12,000 tokens gone for one query. And the next session, it starts over from zero: no memory of what it found, no record of what's still missing.&lt;/p&gt;

&lt;p&gt;A 200,000-token context window sounds vast, but a 40,000-file codebase is vastly larger. Assistants compensate by running grep-style searches, finding matching files, then reading entire files to locate the relevant function. Within a session, experienced users manage this. The real problem is across sessions. Every conversation starts empty. Research done Monday is redone Thursday.&lt;/p&gt;

&lt;p&gt;Humans solve this differently. A developer who worked on a feature last week doesn't remember every line — but they remember that targeting code lives in &lt;code&gt;targeting/&lt;/code&gt;, that segment evaluation has an edge case around nil visitor IDs, and that the auction pipeline calls &lt;code&gt;EvaluateSegments&lt;/code&gt;. They remember at different levels of fidelity, and they can re-read the details in seconds when needed. They can &lt;em&gt;afford&lt;/em&gt; to forget, because retrieval is fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Vectr does
&lt;/h2&gt;

&lt;p&gt;Vectr is a local codebase indexer that gives an AI assistant the same layered recall capability. It provides three kinds of knowledge — and a memory system for working state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Codebase map.&lt;/strong&gt; At startup, Vectr makes one LLM call over the directory structure and README to build a ~300-token plain-English passport. It captures module purposes, tech stack, entry points, and domain vocabulary. Every session, the AI gets this for free via &lt;code&gt;vectr_map&lt;/code&gt; — no file reading required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_map() →
"Go DSP ad server. Main modules: targeting/ (audience matching),
bidder/ (bid logic), tracker/ (event recording).
Entry: bidder/pipeline.go:RunBidPipeline
Domain terms: segment, visitor_id, bid_request, floor_price"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 2: Symbol graph.&lt;/strong&gt; Vectr uses tree-sitter to extract every function, class, and method into a persistent SQLite-backed graph with call relationships. &lt;code&gt;vectr_locate&lt;/code&gt; finds where a symbol is defined — file, line number, kind — without returning any code content. &lt;code&gt;vectr_trace&lt;/code&gt; follows the call graph in either direction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_locate("EvaluateSegments") →
[function] EvaluateSegments  targeting/segment/evaluator.go:45

vectr_trace("EvaluateSegments", direction="callers") →
Called by (2):
  RunBidPipeline  in bidder/pipeline.go:88
  RequestBid      in bidder/auction.go:134
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 3: Content search.&lt;/strong&gt; AST-aware chunks — split at function and class boundaries, never mid-logic — are embedded with &lt;code&gt;Snowflake/snowflake-arctic-embed-m-v1.5&lt;/code&gt; (local, no API key, ~440MB download once). Adaptive hybrid search: vector similarity + BM25 keyword, with weights tuned per codebase fingerprint — small repos lean on BM25, large ones on semantics, static-typed monorepos use graph traversal first. Override with &lt;code&gt;VECTR_EMBED_MODEL=&amp;lt;hf-model-id&amp;gt;&lt;/code&gt; for any sentence-transformers compatible model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_search("nil visitor_id handling segment evaluation") →
[1] targeting/segment/evaluator.go  lines 45-89  score 0.934
    symbol: EvaluateSegments
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The part that's actually new: working memory
&lt;/h2&gt;

&lt;p&gt;The layer that makes Vectr different from every other code search tool is the bidirectional protocol between the AI and the memory store.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;vectr_remember&lt;/code&gt; lets the AI save a working note:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_remember(
  "Implementing segment targeting. Entry: EvaluateSegments() in evaluator.go:45.
   Need to add nil guard for visitor_id before line 61.
   bidder/auction.go calls this — cannot change its interface.
   Missing: integration test for multi-segment visitor with expired segments.",
  tags=["segment-targeting", "wip"],
  priority="high"
)
→ "Stored note #4. Recall with vectr_recall — &amp;lt;50ms, verbatim, any time."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The note survives &lt;code&gt;/compact&lt;/code&gt; and new sessions intact. Vectr returns it in under 50ms on demand.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;vectr_evict_hint&lt;/code&gt; signals which retrieved chunks vectr can re-retrieve without re-reading the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_evict_hint() →
"Vectr has 6 chunks (~3,840 tokens) indexed and instantly retrievable.
Vectr can re-retrieve these in &amp;lt;50ms — no need to re-read them:
  targeting/segment/evaluator.go  [lines 40-110 (EvaluateSegments)]
  bidder/auction.go  [lines 88-134 (RequestBid)]
Recall latency: &amp;lt;50ms. Nothing will be lost."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next morning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_recall("segment targeting") →
[HIGH] [seg, wip] (14h ago)
  Implementing segment targeting. Entry: EvaluateSegments() in evaluator.go:45.
  Need to add nil guard for visitor_id before line 61.
  bidder/auction.go calls this — cannot change its interface.
  Missing: integration test for multi-segment visitor with expired segments.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three MCP calls, roughly five seconds, and the AI is fully context-loaded — without re-reading any code.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run it
&lt;/h2&gt;

&lt;p&gt;Two install options depending on your environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A — pip (recommended for individual developers):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;git+https://github.com/swapnanil/vectr

&lt;span class="nb"&gt;cd&lt;/span&gt; /path/to/your/project
vectr start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B — Docker (for servers and CI pipelines):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/swapnanil/vectr
docker-compose up api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On first run, Vectr downloads the embedding model (~440MB), indexes the workspace, builds the symbol graph, and writes MCP configuration files for Cursor and Claude Code. No configuration files to write, no environment variables required for local-only use.&lt;/p&gt;

&lt;p&gt;Other CLI commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stop and restart on a different workspace&lt;/span&gt;
vectr restart &lt;span class="nt"&gt;--path&lt;/span&gt; /path/to/other/project

&lt;span class="c"&gt;# Write CLAUDE.md + .mcp.json without starting the server&lt;/span&gt;
vectr init

&lt;span class="c"&gt;# Stop the server&lt;/span&gt;
vectr stop

&lt;span class="c"&gt;# Search from the terminal&lt;/span&gt;
vectr search &lt;span class="s2"&gt;"JWT token validation"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you set &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; (or &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; + &lt;code&gt;LLM_MODEL&lt;/code&gt;), Vectr also builds the codebase passport on startup — one LLM call, ~$0.005, cached permanently.&lt;/p&gt;

&lt;p&gt;Once running, any MCP-compatible AI code editor automatically picks up vectr's twelve MCP tools (&lt;code&gt;vectr_map&lt;/code&gt;, &lt;code&gt;vectr_map_save&lt;/code&gt;, &lt;code&gt;vectr_locate&lt;/code&gt;, &lt;code&gt;vectr_trace&lt;/code&gt;, &lt;code&gt;vectr_search&lt;/code&gt;, &lt;code&gt;vectr_remember&lt;/code&gt;, &lt;code&gt;vectr_recall&lt;/code&gt;, &lt;code&gt;vectr_evict_hint&lt;/code&gt;, &lt;code&gt;vectr_snapshot&lt;/code&gt;, &lt;code&gt;vectr_snapshot_list&lt;/code&gt;, &lt;code&gt;vectr_forget&lt;/code&gt;, &lt;code&gt;vectr_status&lt;/code&gt;) without any manual configuration. The MCP server runs at &lt;code&gt;localhost:8765/mcp&lt;/code&gt; — any compatible client connects with two lines of JSON config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark results: Camel Run 2
&lt;/h2&gt;

&lt;p&gt;To measure the cross-session memory benefit, the benchmark uses a two-phase design: Phase 1 explores the codebase and stores notes with &lt;code&gt;vectr_remember&lt;/code&gt;; Phase 2 opens a cold session, calls &lt;code&gt;vectr_recall()&lt;/code&gt;, and implements. Vanilla Phase 2 re-reads from scratch.&lt;/p&gt;

&lt;p&gt;The Camel codebase is 5,856 files of enterprise Java — the kind of thing where the model has no meaningful training coverage.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Vanilla Phase 2&lt;/th&gt;
&lt;th&gt;Vectr Phase 2&lt;/th&gt;
&lt;th&gt;Cost Δ&lt;/th&gt;
&lt;th&gt;Tool calls Δ&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;custom_component&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.56 · 134s · 51 tools&lt;/td&gt;
&lt;td&gt;$0.36 · 195s · 11 tools&lt;/td&gt;
&lt;td&gt;−35%&lt;/td&gt;
&lt;td&gt;−78%&lt;/td&gt;
&lt;td&gt;0 bytes (failure) vs 9,398 bytes (5 files)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;route_policy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$1.15 · 430s · 59 tools&lt;/td&gt;
&lt;td&gt;$0.35 · 177s · 16 tools&lt;/td&gt;
&lt;td&gt;−70%&lt;/td&gt;
&lt;td&gt;−73%&lt;/td&gt;
&lt;td&gt;both 280-line impl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;type_converter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.48 · 187s · 25 tools&lt;/td&gt;
&lt;td&gt;$0.20 · 86s · 11 tools&lt;/td&gt;
&lt;td&gt;−57%&lt;/td&gt;
&lt;td&gt;−56%&lt;/td&gt;
&lt;td&gt;both working&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Totals (Camel)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2.19 · 751s · 135 tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.92 · 458s · 38 tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−58%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−72%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−40% input tokens&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;custom_component&lt;/code&gt; result shows the failure mode most clearly: vanilla ran out of context budget navigating the unfamiliar Java package hierarchy and produced nothing. Vectr's Phase 2 started with structured notes from Phase 1 — ~200 tokens replacing hundreds of re-discovery tool calls — and delivered a complete 5-file implementation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;route_policy&lt;/code&gt; shows the efficiency case where both sides succeeded: 3× cheaper, 2.4× faster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Vectr helps in proportion to how much re-discovery work Phase 2 would otherwise do. Single-session tasks on well-known codebases see minimal benefit. Large unfamiliar codebases and cross-session continuation tasks see the most.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Django results were mixed: complex ORM internals showed −24% tokens, −60% cost; well-known APIs where the model already has training coverage showed no benefit. The mechanism is the same in both cases — Vectr just doesn't help where re-discovery cost is already low.&lt;/p&gt;

&lt;h2&gt;
  
  
  A session with the full stack
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Morning — session start (3 calls, ~5 seconds):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_map()                                          → structural overview (247 tokens)
vectr_recall()                                       → yesterday's notes, verbatim
vectr_locate("EvaluateSegments")                     → file:line, no code read
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;During the session:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_search("visitor_id nil handling")              → 3 chunks, 580 tokens
vectr_trace("EvaluateSegments", direction="callers") → 2 callers identified
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;End of session:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectr_remember("Segment targeting done...")          → note stored
vectr_evict_hint()                                   → marks 3,840 tokens of chunks as re-retrievable in &amp;lt;50ms
vectr_snapshot("segment-targeting-day1")             → full session saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full context in three calls, five seconds. No file reading on reconnect.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Vectr is open source at &lt;a href="https://github.com/swapnanil/vectr" rel="noopener noreferrer"&gt;github.com/swapnanil/vectr&lt;/a&gt;. The current build supports Python, JavaScript, TypeScript, Go, Rust, and Java for AST chunking and symbol extraction. Planned: adaptive retrieval strategy selection based on codebase fingerprint (Java monorepos benefit from graph traversal; dynamic Python codebases respond better to semantic search), and LLM-generated symbol descriptions generated lazily on first access.&lt;/p&gt;

&lt;p&gt;If you work on a large codebase and your AI assistant spends the first five minutes of every session re-reading the same files, try Vectr. The full tool page is at &lt;a href="https://swapnanilsaha.com/tools/vectr/" rel="noopener noreferrer"&gt;swapnanilsaha.com/tools/vectr/&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>softwaredevelopment</category>
      <category>tooling</category>
    </item>
    <item>
      <title>LLM Context Window Token Budget: Why Your Window Fills Up Fast</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Tue, 26 May 2026 19:59:18 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/llm-context-window-token-budget-why-your-window-fills-up-fast-4c05</link>
      <guid>https://dev.to/swapnanilsaha/llm-context-window-token-budget-why-your-window-fills-up-fast-4c05</guid>
      <description>&lt;p&gt;You build something with GPT-4o. The model supports 128,000 tokens. You think: that's enough for a full novel. Then, four or five conversation turns in, the model starts forgetting things that were said earlier. Eight turns in, you hit an error. You check the token count — you've used over 100,000 tokens, and you've typed maybe 400 words.&lt;/p&gt;

&lt;p&gt;This isn't a bug. It's the predictable consequence of not accounting for where those tokens actually go. A context window isn't blank space waiting to be filled with your words. By the time the first user message arrives, it is already partially consumed — by system instructions, by tool definitions, by retrieved documents, by the tokens the model itself generated in earlier turns. In a production AI agent, 30–60% of the context window is gone before a user types anything.&lt;/p&gt;

&lt;p&gt;What follows is a precise accounting of where those tokens go — the four layers that consume the window before users say anything, why the effective limit is substantially lower than the advertised one, what happens to response quality as the window approaches capacity, and which engineering patterns actually manage it at production scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The Problem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Illusion of Abundance
&lt;/h3&gt;

&lt;p&gt;GPT-4o supports 128K tokens. Claude 3.5 supports 200K. Gemini 1.5 Pro has been demonstrated at a million tokens — roughly 750,000 words, about ten average novels. The numbers sound absurdly generous. How could you possibly run out?&lt;/p&gt;

&lt;p&gt;Start with a calibration exercise. What is 128,000 tokens, actually?&lt;/p&gt;

&lt;p&gt;In English prose, one token is roughly four characters — about three-quarters of a word. A 1,000-word article runs to around 1,300 tokens, so 128K tokens can hold close to 96,000 words of clean text. That genuinely is a lot.&lt;/p&gt;

&lt;p&gt;But text in an LLM application is rarely clean English prose. It is JSON payloads from tool calls. It is API responses full of structured data. It is code. It is URLs. It is conversation history with speaker labels, timestamps, and formatting. All of these serialize into tokens at rates much higher than 4 characters per token.&lt;/p&gt;

&lt;p&gt;Then there is the question of performance. The advertised number represents a technical limit — the longest sequence the model can physically process. It does not represent the length at which the model operates at peak accuracy. Research has repeatedly found a significant gap between the two. Long-context benchmarks like RULER (2024) and HELMET (2024) found that in adversarial multi-document tasks, most frontier LLMs showed accuracy drops well before 32K tokens — GPT-4o fell from near-perfect baseline scores to the high-60s percentage range at 32K in some configurations. The technical limit says 128K. The accuracy cliff arrives much earlier.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Effective Limit Is Not the Advertised Limit&lt;/strong&gt;&lt;br&gt;
Models claiming 200K context windows show measurable quality degradation around 130K tokens in practice. Treating the advertised number as your operating budget is how production systems quietly degrade without triggering any explicit error.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cost is the third angle. Every token in the context is a token billed. At GPT-4o's pricing, 128K tokens of input costs several dollars per call — and agents often make dozens of calls per session, each with the full accumulated context. The monthly bill from a badly-managed context window can surprise you well before any error appears in the logs.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. How Tokens Are Counted — and Why the Count Surprises You
&lt;/h3&gt;

&lt;p&gt;An LLM does not read text. It reads a sequence of integers. Before any word reaches the model, it passes through a tokenizer that converts characters into integer IDs from a vocabulary of roughly 50,000–200,000 entries. The tokenizer used by GPT-4 and GPT-4o is called &lt;code&gt;cl100k_base&lt;/code&gt;; it has about 100,000 vocabulary entries. OpenAI's newer models use &lt;code&gt;o200k_base&lt;/code&gt;, with about 200,000.&lt;/p&gt;

&lt;p&gt;The vocabulary is built using &lt;strong&gt;BPE&lt;/strong&gt; — Byte Pair Encoding. The name comes from the construction: you start with individual characters, then repeatedly merge the pair of adjacent symbols that appears most often in your training corpus, replacing each occurrence of that pair with a new combined token. Do this enough times and common English words end up as single tokens. The algorithm learns what to merge entirely from what was common in the training text — mostly English prose on the internet. That's why "the", "is", "running" each become a single token, while "tokenization" becomes &lt;code&gt;["token", "ization"]&lt;/code&gt; — less common as a whole word, so BPE never fully merged it. Characters and raw bytes are the fallback for anything the vocabulary doesn't cover. The consequence is simple: anything that wasn't well-represented in training data — JSON brackets, URL slashes, code indentation — never got merged aggressively, so those sequences remain expensive in tokens relative to the characters they contain.&lt;/p&gt;

&lt;p&gt;The rule-of-thumb of 1 token ≈ 4 characters holds for clean English prose — decent enough for napkin estimates. It falls apart under several conditions that appear constantly in real applications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Numbers tokenize unexpectedly.&lt;/strong&gt; BPE learns tokens from frequency in training data. The number "2023" is common in training data — it became a single token. But "2026" is less common, and "19847" is rare — these get split into per-digit or per-pair tokens. The price "USD 1,234,567.89" produces approximately 10–12 tokens, because the commas, period, digits, and currency symbol may each claim separate tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;URLs are disproportionately expensive.&lt;/strong&gt; A URL like &lt;code&gt;https://api.example.com/v2/users/12345&lt;/code&gt; looks compact — 38 characters, which by the prose rule should be about 9–10 tokens. In practice it is closer to 15–20 tokens. Slashes, dots, hyphens, underscores, and alphanumeric path segments each claim their own tokens or merge into small fragments, because URLs are structurally uncommon in prose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON and structured data use roughly 2x the token count of plain text.&lt;/strong&gt; Consider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Plain text: The user's name is Alice, she is 28 years old, and her account is active.
JSON:       {"user": {"name": "Alice", "age": 28, "status": "active"}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The plain text version: approximately 18 tokens. The JSON version: approximately 22 tokens — and this is a trivially small object. Real API responses with deeply nested keys, repeated field names, and verbose formatting can be far more expensive. Every brace, colon, and comma is a token or part of a token. A 500-word JSON payload can use 800+ tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code tokenizes inefficiently in some languages.&lt;/strong&gt; Research found that Python uses roughly 46% more tokens than equivalent Haskell to express the same computational idea. Python's indentation-based structure requires whitespace tokens, and Python's identifiers were less densely represented in the pre-GPT-4 training corpora.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Analogy: The Luggage Weight Problem&lt;/strong&gt;&lt;br&gt;
Think of the context window as checked baggage with a weight limit, not a size limit. A suitcase full of dense sweaters weighs less than one with foam packing material filling the same volume. Plain prose is the dense sweaters — you pack a lot of meaning into few tokens. JSON, URLs, and code are the foam — structurally bulky, meaning-sparse, yet they count toward the same limit.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 2: The Consumers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3. The Four Layers That Eat Your Context Window
&lt;/h3&gt;

&lt;p&gt;Every LLM API call is a full context payload assembled from four distinct layers. Most developers think about only one: the user's current message. The other three arrive already loaded — silent costs that accumulate before the user types anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: The System Prompt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system prompt is the foundational layer. It is always present, on every API call. A minimal system prompt — "You are a helpful assistant" — costs about 7 tokens. But real production system prompts are not minimal.&lt;/p&gt;

&lt;p&gt;A typical customer-facing chatbot system prompt contains: the model's persona and tone guidelines, a list of topics it should and should not address, instructions about response format, domain-specific knowledge, legal disclaimers, and formatting instructions. Measured in practice, these range from 800 to 2,500 tokens. They are charged on every single API call. A 1,500-token system prompt running 1,000 calls per day costs you 1.5 million input tokens per day before a user says anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Tool Schemas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you give an LLM access to external tools, you must describe each tool to the model in the context window. These descriptions are written in JSON and can be verbose. A single moderately documented tool schema costs roughly 200 tokens. An agent with five tools carries around 1,000 tokens of tool descriptions on every call, before any user input. The JSON structure alone — all those braces, colons, and quoted keys — is part of why the token cost is higher than reading the description would suggest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Retrieved Context (RAG)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many production LLM applications retrieve relevant documents from a database and inject them as supporting material. A typical RAG retrieval returns 3–8 document chunks, each 300–600 tokens. Three chunks at 400 tokens each: 1,200 tokens. Eight chunks at 500 tokens each: 4,000 tokens. In a research assistant with a generous retrieval budget, you might inject 8,000–12,000 tokens of context per query.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Hidden Fixed Cost&lt;/strong&gt;&lt;br&gt;
System prompt + tool schemas is your fixed cost floor. It doesn't change turn-to-turn. It can easily reach 2,000–4,000 tokens in a real agent — charged on every single API call in your fleet.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Conversation History&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model has no persistent memory. You create the illusion of memory by re-sending the full conversation history on every API call. Every turn appends two new entries (a user message and a model response) to a history that is re-sent in its entirety. Model responses can be long — a detailed answer with a code snippet might be 600–800 tokens. After ten exchanges, the conversation history alone can be 8,000–12,000 tokens.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Context Creep — Watching the Window Fill
&lt;/h3&gt;

&lt;p&gt;The process by which a context window fills over a conversation has a name in production systems: &lt;strong&gt;context creep&lt;/strong&gt;. Consider a realistic customer support agent: 1,200-token system prompt, three tool schemas totaling 600 tokens, RAG retrieval returning two chunks (~800 tokens per turn), user messages averaging 60 tokens, model responses averaging 350 tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context budget:
  Fixed overhead: 1,200 + 600 = 1,800 tokens
  Per-turn RAG:   800 tokens
  Per-turn history growth: 60 (user) + 350 (model) = 410 tokens

  Turns until 80% of 128K:
    (1,800 + n × 800 + n × 410) ≥ 102,400
    n × 1,210 ≥ 100,600
    n ≈ 84 turns

  If model reply averages 800 tokens instead:
    Per-turn growth: 60 + 800 = 860
    n × 1,660 ≥ 100,600
    n ≈ 60 turns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Change the model reply length to 800 tokens — a detailed-answer agent — and the window hits 80% around turn 60 rather than 84. Quality degradation begins before you hit the hard limit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The Physics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. KV Cache Memory — Why Context Has a Physical Cost
&lt;/h3&gt;

&lt;p&gt;The context window limit is not an arbitrary policy. It is enforced by physics — GPU memory.&lt;/p&gt;

&lt;p&gt;The transformer's attention mechanism works by comparing every token in the context with every other token. For each token, the model creates a query ("what am I looking for?"), and every other token offers a key ("what do I contain?"). A third vector — the value — carries the actual information that gets passed when attention is high. Assembled across all tokens, these become the matrices &lt;strong&gt;Q&lt;/strong&gt;, &lt;strong&gt;K&lt;/strong&gt;, and &lt;strong&gt;V&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The QKᵀ product is an n × n matrix where n is the sequence length. Doubling n quadruples this computation.&lt;/p&gt;

&lt;p&gt;There are two distinct computational phases in LLM inference. &lt;strong&gt;Prefill&lt;/strong&gt; processes the entire input prompt at once — O(n²) per attention layer. Implementations like FlashAttention reduce the memory bandwidth pressure dramatically via tiled computation, but the asymptotic complexity doesn't change. &lt;strong&gt;Decode&lt;/strong&gt; generates one token at a time, attending only to the current token against the cached history — O(n) per step with the KV cache. Without caching, decode would also be O(n²). The KV cache converts decode from O(n²) to O(n) at the cost of memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV Cache Memory Formula (Multi-Head Attention):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KV_memory = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_param
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a 7B-parameter model with standard MHA (32 layers, 32 heads, head_dim 128) at bfloat16 (2 bytes):&lt;/p&gt;

&lt;p&gt;KV_memory per token ≈ 2 × 32 × 32 × 128 × 1 × 2 = 524,288 bytes ≈ 0.5 MB&lt;/p&gt;

&lt;p&gt;At 128K context: 0.5 MB × 128,000 = &lt;strong&gt;64 GB&lt;/strong&gt; of KV cache alone — more than the model weights at bfloat16 (~14 GB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note on GQA and MLA:&lt;/strong&gt; Most modern models (Llama 3, Mistral, GPT-4o) use Grouped-Query Attention (GQA), which reduces the KV cache by sharing key-value heads across groups of query heads. A model with 32 query heads and 8 KV heads (4× reduction) brings the per-token cache from ~0.5 MB to ~0.125 MB — about 16 GB at 128K context. Still the dominant memory consumer at long contexts. DeepSeek-class models use Multi-head Latent Attention (MLA), which compresses the K and V projections into a low-rank latent space before storing them, achieving 5–10× memory reduction over standard MHA.&lt;/p&gt;

&lt;p&gt;A 70B MHA model (80 layers, 64 heads, head_dim 128, bfloat16) runs to roughly &lt;strong&gt;2.5 MB per token&lt;/strong&gt;: 2 × 80 × 64 × 128 × 2 bytes = 2,621,440 bytes. At 128K context that's ~320 GB — which is why providers either cap context length aggressively for large models, or charge steeply for long-context calls. GQA with 8 KV heads drops it to ~40 GB, still substantial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt; (available from OpenAI, Anthropic, Google) caches the computed KV activations for repeated prompt prefixes. Subsequent calls beginning with the same prefix pay 50–75% less for those cached tokens and benefit from lower latency because the prefill phase for the cached portion is skipped. A stable system prompt is an ideal caching candidate. One practical constraint: both OpenAI and Anthropic require a minimum prefix length of at least 1,024 tokens before caching activates. A 200-token system prompt won't benefit — another reason to consolidate instructions into one substantial block rather than spreading them across multiple small messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV cache quantization&lt;/strong&gt; is an active area of production optimization: storing the K and V tensors in lower-precision formats (int8 or int4) cuts KV cache memory by 2–4× with modest accuracy penalties. Research like KVQuant explores going to 2-bit precision for certain layers while targeting 10M-token contexts on commodity hardware.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Lost in the Middle — Why Performance Collapses Before You Hit the Limit
&lt;/h3&gt;

&lt;p&gt;Memory is the first constraint. Attention quality is the second — and it bites you even when your window is half-empty.&lt;/p&gt;

&lt;p&gt;In 2023, researchers at Stanford and UC Berkeley published "Lost in the Middle." They gave LLMs a task requiring them to find a specific document from a set of twenty documents, all injected into the context window. The position of the relevant document was varied systematically.&lt;/p&gt;

&lt;p&gt;When the relevant document was first or last, models retrieved it accurately. When it was in the middle positions, accuracy dropped by more than 30%. Newer models — Claude 3.5, GPT-4o — have partially mitigated this bias through long-context fine-tuning. "Partially" is doing a lot of work there: independent evaluations continue to find meaningful position-dependent performance gaps in all current models, even at lengths well within their advertised limits.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Analogy: The Lecture Hall Effect&lt;/strong&gt;&lt;br&gt;
Students reliably remember a lecture's opening and closing. What happened in the middle of hour one is murky. LLMs have an analogous concentration pattern: strong attention to the beginning and end of the context, with a trough in the middle.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The mechanism is structural. RoPE (Rotary Position Embedding), used in most modern architectures, encodes position as a rotation applied to query and key vectors. The mathematical property of this rotation is that the similarity score between two vectors naturally decreases as the distance between their positions increases. At short contexts, the decay is a feature. At long contexts, it becomes a bug: tokens in the middle of a 100K-token window are thousands of positions away from both the beginning and from where the model is currently generating, so their similarity scores are systematically suppressed.&lt;/p&gt;

&lt;p&gt;A separate effect, &lt;strong&gt;context dilution&lt;/strong&gt;, compounds this: longer surrounding irrelevant context degrades performance even when the relevant content is guaranteed present. The model's attention distributes across noise, reducing effective attention for the signal — like finding one red marble in a bag of ten thousand, even knowing it's there.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A Subtle RAG Bug&lt;/strong&gt;&lt;br&gt;
If your RAG system retrieves 8 documents and inserts them in the middle of a long conversation history, the most relevant chunks may be in the attention trough. The model generates a response, you see no error, but the answer doesn't reflect those documents. The failure is silent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 4: Solutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7. Token Budget Math — Calculating Your Real Available Space
&lt;/h3&gt;

&lt;p&gt;Every LLM application needs an explicit token budget with five zones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Zone&lt;/th&gt;
&lt;th&gt;Typical token range&lt;/th&gt;
&lt;th&gt;Fixed or variable?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System Prompt&lt;/td&gt;
&lt;td&gt;500–2,500&lt;/td&gt;
&lt;td&gt;Fixed per application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Schemas&lt;/td&gt;
&lt;td&gt;200–400 per tool&lt;/td&gt;
&lt;td&gt;Fixed per agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG Context&lt;/td&gt;
&lt;td&gt;0–12,000&lt;/td&gt;
&lt;td&gt;Variable per turn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conversation History&lt;/td&gt;
&lt;td&gt;0 → grows&lt;/td&gt;
&lt;td&gt;Grows each turn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation Reserve&lt;/td&gt;
&lt;td&gt;500–2,000&lt;/td&gt;
&lt;td&gt;Reserved explicitly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The generation reserve must be reserved explicitly — if your prompt consumes the entire window, the model either generates nothing or truncates its response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A worked example.&lt;/strong&gt; Customer support agent, GPT-4o (128K):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total window:          128,000 tokens
System prompt:          -1,400 tokens  (measured)
Tool schemas (4 tools):   -800 tokens  (measured)
Generation reserve:     -1,500 tokens  (set by us)
─────────────────────────────────────────
Available for dynamic:  124,300 tokens

  Of that:
    RAG budget:           20,000 tokens  (5 chunks × 4,000 avg)
    History budget:       ~104,300 tokens (fills over time)

  ─────────────────────────────────────────
  Turns until 80% full:
    80% of 128K = 102,400 prompt tokens
    Fixed overhead = 1,400 + 800 = 2,200
    Per-turn RAG = 800
    Per-turn growth = user avg (60) + model avg (350) = 410
    Turns until (2,200 + n × 800 + n × 410) ≥ 102,400
    n × 1,210 ≥ 100,200
    n ≈ 82 turns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;82 turns sounds comfortable. But this assumes constant 350-token model replies. A user who triggers several detailed answers can double the history growth rate, cutting that to ~41 turns before the 80% threshold.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Measure, Don't Estimate&lt;/strong&gt;&lt;br&gt;
The system prompt and tool schema token counts must be measured with the actual tokenizer, not estimated from character counts. Log &lt;code&gt;prompt_tokens&lt;/code&gt; and &lt;code&gt;completion_tokens&lt;/code&gt; from every API response. The distribution of &lt;code&gt;prompt_tokens&lt;/code&gt; over time is your context growth curve.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  8. Four Strategies for Managing Context Window Limits
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strategy 1: Sliding Window&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Keep only the most recent turns of conversation verbatim. In production, truncate by token count, not turn count — a 5-turn history could range from 500 to 8,000 tokens depending on response lengths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn-count version — simple, good enough for prototyping
&lt;/span&gt;&lt;span class="n"&gt;MAX_HISTORY_TURNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;trimmed_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;MAX_HISTORY_TURNS&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;context_block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_block&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trimmed_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_message&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;

&lt;span class="c1"&gt;# Production version — truncate by token count, not turn count
# HISTORY_TOKEN_BUDGET = context_limit - fixed_costs - generation_reserve
# Example for 128K window: 128000 - 2200 (sys+tools) - 1500 (reserve) - 20000 (RAG) ≈ 104000
&lt;/span&gt;&lt;span class="n"&gt;HISTORY_TOKEN_BUDGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40_000&lt;/span&gt;  &lt;span class="c1"&gt;# adjust for your application
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_messages_token_bounded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;fixed_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_msg_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HISTORY_TOKEN_BUDGET&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;fixed_tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;new_msg_tokens&lt;/span&gt;

    &lt;span class="c1"&gt;# Walk history from newest to oldest, keep what fits
&lt;/span&gt;    &lt;span class="n"&gt;trimmed_rev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;turn_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;turn_tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;trimmed_rev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;turn_tokens&lt;/span&gt;
    &lt;span class="n"&gt;trimmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trimmed_rev&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_chunks&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_message&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The drawback of the sliding window is abrupt forgetting: when turn 1 drops, any fact established there is simply gone. For short-lived task-completion agents, this is fine. For long-running conversational assistants, it creates visible gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategy 2: Hierarchical Summarization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Keep recent turns verbatim; compress older turns into a rolling summary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;maybe_compress_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;verbatim_turns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;turns_to_summarize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;turns_to_summarize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;

    &lt;span class="n"&gt;new_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Existing summary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New exchanges to incorporate:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;format_turns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns_to_summarize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Update the summary to include these exchanges. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Preserve all concrete facts, decisions, and commitments. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Drop conversational filler. Be dense. Max ~400 tokens.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;verbatim_turns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_summary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cap the summary at 200–400 tokens. Run summarization asynchronously — don't make the user wait for the compression cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategy 3: Token Compression (LLMLingua)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use a compression model to identify and remove low-entropy tokens from prompts, achieving 2–3× compression with minor accuracy loss. The most effective targets are verbose system prompts, RAG context chunks, and few-shot examples.&lt;/p&gt;

&lt;p&gt;Never apply compression to the current user message — compressing user input changes their meaning before the model sees it. Test in your specific domain for tasks where precision matters (legal, medical, code).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategy 4: Embedding-based Retrieval Over History&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Store each conversation turn as a dense vector. At each new turn, embed the current user message and retrieve the most relevant prior turns by similarity. Concretely: as each turn completes, embed the user + assistant text and store it in a vector store alongside the full text. On the next user message, embed it, search for top-k similar turns, inject those into context. Keep only 2–3 verbatim recent turns for coherence.&lt;/p&gt;

&lt;p&gt;The effect: only the conversation history relevant to the current question enters the context window. A user asking "what was the budget we discussed?" triggers retrieval of those turns — even if they happened fifty exchanges ago. This requires an embedding model, a vector store, and a retrieval call per user message (adding roughly 50–150ms round-trip with a managed API, under 10ms with a self-hosted model).&lt;/p&gt;

&lt;p&gt;The four strategies are not mutually exclusive. Production systems often combine them: a sliding window of 5–8 verbatim turns + rolling summary + retrieval from older history covers all distance scales simultaneously.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. The Practical Playbook
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Short task-completion agents (under 20 turns):&lt;/strong&gt; Use a sliding window of 10–15 turns. Reserve optimization effort for fixed-cost reduction: audit your system prompt for redundant language, consider dynamic tool registration (load only the tools relevant to the current turn).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-running conversational assistants:&lt;/strong&gt; Implement hierarchical summarization with 8–12 verbatim turns. Cap summaries at 400 tokens. Run asynchronously. Periodically audit system prompt size — prompt creep through edits is real. A prompt that started at 600 tokens can quietly grow to 3,000 across six months of product changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document-heavy research assistants (heavy RAG):&lt;/strong&gt; Limit retrieval to 3–5 top chunks. Apply token compression to chunks before injection. Sort retrieved chunks so the most relevant appears last in the injected block — adjacent to the user question, within the recency attention peak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production agents with many tools:&lt;/strong&gt; Use dynamic tool registration. A routing classifier (even a keyword matcher) identifies which tools are needed before the main model call and includes only those schemas — reducing 2,000 tokens of tool overhead to ~400 on most turns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context ordering (exploit the attention curve):&lt;/strong&gt; Instead of the framework default (system → history → RAG → user), use: system → recent history (most-recent last) → RAG chunks (most relevant last, adjacent to the user message) → current user message. The most relevant content sits at the end of the context, within the recency attention peak. Older history — the least relevant content — occupies the lower-attention middle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;prompt_tokens / context_limit&lt;/code&gt; — alert above 70%, act above 80%&lt;/li&gt;
&lt;li&gt;Token count by zone per call — when total grows, know which zone is responsible&lt;/li&gt;
&lt;li&gt;Quality signals segmented by context utilization — you may find degradation starts at 60% in your application&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: The Window Is a System Resource
&lt;/h2&gt;

&lt;p&gt;A context window isn't a document store you fill until it overflows. It's a compute and memory resource with hard physical limits, a quality curve that degrades well before those limits, and an inference cost that grows with every token you put in it.&lt;/p&gt;

&lt;p&gt;In a typical agent, the window is 30–60% consumed before the first user message lands. The fix isn't a bigger context window, though headroom helps. It's building a real budget: measure each zone with an actual tokenizer, set hard limits per zone, implement a context manager that enforces those limits on every call, and track utilization in production dashboards the same way you'd track memory or CPU.&lt;/p&gt;

&lt;p&gt;The attention degradation problem — "lost in the middle" — adds a second dimension: even when your window is not full, quality depends on where in the window the important information sits. The primacy bias and recency bias are real, measurable effects that application design can exploit or fall victim to.&lt;/p&gt;

&lt;p&gt;The four strategies aren't competitors — most production systems end up combining them. Sliding window for the recent turns, rolling summary for the older ones, compression for the RAG chunks, and retrieval for anything that needs to survive beyond the window. Start with the simplest thing that doesn't break your use case, and add layers as your traffic and conversation length grow.&lt;/p&gt;

&lt;p&gt;Context engineering doesn't have the glamour of prompt engineering, but it's where most production LLM failures actually live. Missed retrievals, incoherent multi-turn conversations, bloated inference bills — these trace back to context mismanagement more often than they trace back to the wrong model. It fails silently, which is exactly why it's easy to ignore until you can't.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Research Papers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Liu et al., &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;"Lost in the Middle: How Language Models Use Long Contexts"&lt;/a&gt; — Stanford / Berkeley / Samaya AI, 2023. The original paper quantifying the U-shaped attention bias across context positions.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2406.16008" rel="noopener noreferrer"&gt;"Found in the Middle: Calibrating Positional Attention Bias"&lt;/a&gt; — 2024. Proposes an architectural fix to the lost-in-the-middle problem, recovering up to 15pp accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2510.05381" rel="noopener noreferrer"&gt;"Context Length Alone Hurts LLM Performance Despite Perfect Retrieval"&lt;/a&gt; — 2025. Demonstrates context dilution: longer irrelevant context degrades performance even when the relevant content is guaranteed present.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2401.18079" rel="noopener noreferrer"&gt;"KVQuant: Towards 10M Context Length LLM Inference with KV Cache Quantization"&lt;/a&gt; — 2024. Explores per-channel quantization of the KV cache to enable extreme context lengths on commodity hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Technical References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://platform.openai.com/docs/guides/conversation-state" rel="noopener noreferrer"&gt;OpenAI — Managing Conversation State&lt;/a&gt; — Official docs on conversation history management and token counting.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/context-windows" rel="noopener noreferrer"&gt;Anthropic — Context Window Documentation&lt;/a&gt; — Claude context limits, caching strategies, and best practices.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/microsoft/LLMLingua" rel="noopener noreferrer"&gt;LLMLingua — Prompt Compression&lt;/a&gt; — Microsoft Research open-source project for token-level prompt compression.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu" rel="noopener noreferrer"&gt;KV Cache Memory: Calculating GPU Requirements for LLM Inference&lt;/a&gt; — Interactive calculator for KV cache memory requirements given model architecture parameters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Background Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://tianpan.co/blog/2025-11-11-managing-token-budgets-production-llm-systems" rel="noopener noreferrer"&gt;The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems&lt;/a&gt; — TianPan.co, 2025. Production-focused survey of context management challenges.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://redis.io/blog/context-window-management-llm-apps-developer-guide/" rel="noopener noreferrer"&gt;Context Window Management for LLM Apps: Developer Guide&lt;/a&gt; — Redis, 2025. Practical implementation patterns for context management in production.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://swapnanilsaha.com/blog/text-embeddings-llms-rag-complete-guide/" rel="noopener noreferrer"&gt;The Complete Guide to Text Embeddings, Vector Databases &amp;amp; LLMs&lt;/a&gt; — Swapnanil Saha, 2026. Deep background on tokenization, BPE, transformer attention, and RAG pipelines referenced throughout this post.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why AI Code Assistants Waste Context — and How RAG Fixes It</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Tue, 26 May 2026 19:25:33 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/why-ai-code-assistants-waste-context-and-how-rag-fixes-it-1gej</link>
      <guid>https://dev.to/swapnanilsaha/why-ai-code-assistants-waste-context-and-how-rag-fixes-it-1gej</guid>
      <description>&lt;p&gt;Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint in the module-level docstring that it technically "saw." The model didn't forget. The information was technically present in the prompt, but the transformer's attention mechanism never meaningfully focused on it. That's a different kind of failure, and it doesn't get better with a bigger context window.&lt;/p&gt;

&lt;p&gt;There's a persistent intuition in this industry that more context is always better. Send the whole file. Send the whole codebase. This intuition breaks in a specific and measurable way. The mechanism is called &lt;strong&gt;attention dilution&lt;/strong&gt; — softmax normalization means that every token in the context competes for a fixed budget of attention weight, and as the sequence grows longer, any given piece of information gets a smaller share of that budget.&lt;/p&gt;

&lt;p&gt;This post walks through the transformer attention math to explain exactly why the naive approach fails, then covers how RAG (Retrieval-Augmented Generation) addresses it — by retrieving only the specific code chunks relevant to the current task and injecting those into the context window instead of dumping everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The Problem with Stuffing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Naive Approach: Just Send Everything
&lt;/h3&gt;

&lt;p&gt;The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Throw that in too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?&lt;/p&gt;

&lt;p&gt;This is called &lt;strong&gt;context window stuffing&lt;/strong&gt;. Three things go wrong with it, and each gets worse as the codebase grows. The first is attention dilution — the focus of this section. The second is position bias (Section 3). The third is raw cost (Section 4). To understand why these happen, you need a concrete model of how a transformer actually reads a prompt.&lt;/p&gt;

&lt;p&gt;A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens &lt;em&gt;simultaneously&lt;/em&gt;, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.&lt;/p&gt;

&lt;p&gt;The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's &lt;em&gt;query&lt;/em&gt; vector against every other token's &lt;em&gt;key&lt;/em&gt; vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt; Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  2. Why Attention Dilutes: The Math
&lt;/h3&gt;

&lt;p&gt;The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attention(Q, K, V) = softmax( QK^T / √d_k ) · V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Q&lt;/code&gt; — the query matrix (what each token is "asking for")&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;K&lt;/code&gt; — the key matrix (what each token "offers" for comparison)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;V&lt;/code&gt; — the value matrix (the actual content passed forward if selected)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;d_k&lt;/code&gt; — the dimension of the key vectors (scales to prevent extreme dot products)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;softmax&lt;/code&gt; — converts a vector of raw scores into a probability distribution that sums to 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The notation &lt;code&gt;QK^T&lt;/code&gt; means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix &lt;code&gt;K^T&lt;/code&gt; does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by &lt;code&gt;√d_k&lt;/code&gt; prevents those scores from becoming so large that softmax saturates.&lt;/p&gt;

&lt;p&gt;The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the &lt;em&gt;average&lt;/em&gt; attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.&lt;/p&gt;

&lt;p&gt;This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset if the dot-product scores separate those tokens sharply from the rest. Softmax is non-linear and can be quite aggressive when there is a large score gap between relevant and irrelevant tokens. But in a real codebase, that gap is rarely clean. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products — they're not completely irrelevant, they just aren't what you need right now. These tokens collectively consume most of the softmax budget. The useful signal must compete against this crowd, and as N grows, the signal's share degrades continuously. It isn't a cliff; it's a steady erosion that compounds with each additional file you stuff in.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model &lt;em&gt;cannot read&lt;/em&gt; long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  3. Lost in the Middle: Position Bias
&lt;/h3&gt;

&lt;p&gt;Attention dilution is one problem. A second, independent problem compounds it: &lt;strong&gt;position bias&lt;/strong&gt;. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.&lt;/p&gt;

&lt;p&gt;This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled &lt;em&gt;Lost in the Middle: How Language Models Use Long Contexts&lt;/em&gt;. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. When the answer document was at position 1 or last, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.&lt;/p&gt;

&lt;p&gt;Two mechanisms contribute. The first is &lt;strong&gt;RoPE&lt;/strong&gt; (Rotary Position Embeddings), the positional encoding scheme in most modern open-source language models (LLaMA, Mistral, GPT-NeoX). RoPE encodes position by rotating the query and key vectors by angles proportional to their positions. The dot product between a query at position &lt;em&gt;m&lt;/em&gt; and a key at position &lt;em&gt;n&lt;/em&gt; includes a term that decays with relative distance (m−n) — semantically relevant tokens far from the query position must overcome a rotational penalty to receive attention weight. Tokens near the start of the sequence are close to almost every other position, giving them a structural advantage.&lt;/p&gt;

&lt;p&gt;The second mechanism is &lt;strong&gt;causal training recency bias&lt;/strong&gt;. Language models are trained to predict the next token given all previous tokens. This reward signal pushes models to weight recent tokens heavily — the immediately preceding context is almost always the most relevant signal for next-token prediction during training. The middle of a long context rarely dominated training gradients, so models systematically underweight it. This effect was documented in GPT-3.5 era models well before RoPE became standard — it isn't purely an artifact of positional encoding, it's baked into causal pretraining. Both effects run in the same direction: the middle of a long context is structurally disadvantaged.&lt;/p&gt;

&lt;p&gt;A 2024 paper from UW, MIT, and Google (&lt;em&gt;Found in the Middle&lt;/em&gt;) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, which is not available when calling an API.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Common Mistake:&lt;/strong&gt; Many teams inject retrieved chunks at the &lt;em&gt;end&lt;/em&gt; of the prompt, after a long system prompt and conversation history. This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is &lt;strong&gt;immediately before the user's specific question&lt;/strong&gt;, near the end but not buried in the middle of a long history.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  4. The Quadratic Cost Problem
&lt;/h3&gt;

&lt;p&gt;Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales &lt;strong&gt;quadratically&lt;/strong&gt; with sequence length.&lt;/p&gt;

&lt;p&gt;To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons. Doubling the context length quadruples the compute required for attention.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Time complexity of full self-attention: O(N² · d)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×.&lt;/p&gt;

&lt;p&gt;FlashAttention (Dao et al., 2022) improves the &lt;em&gt;memory&lt;/em&gt; profile to O(N) via tiling — it never writes the full N×N matrix to GPU memory. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.&lt;/p&gt;

&lt;p&gt;In production, a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone. You are paying more to get worse results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: How RAG Fixes It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. RAG at a Glance: The Core Idea
&lt;/h3&gt;

&lt;p&gt;Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks: "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"&lt;/p&gt;

&lt;p&gt;The answer has two phases. First, an &lt;strong&gt;offline indexing phase&lt;/strong&gt; where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an &lt;strong&gt;online retrieval phase&lt;/strong&gt; that happens at query time: the developer's current context is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.&lt;/p&gt;

&lt;p&gt;The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Parse &amp;amp; Chunk&lt;/strong&gt; — split at function/class boundaries, not arbitrary token counts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embed Chunks&lt;/strong&gt; — convert each chunk to a vector with a code embedding model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build Search Index&lt;/strong&gt; — ANN index for dense retrieval + BM25 index for lexical retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embed the Query&lt;/strong&gt; — convert current cursor context to a query vector&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve Top-k&lt;/strong&gt; — run hybrid search (dense + BM25), fuse results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject &amp;amp; Generate&lt;/strong&gt; — inject top 3–5 chunks into the LLM prompt, immediately before the user's request&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1–3 happen once (or on incremental file changes). Steps 4–6 happen on every completion request. The parts where most implementations go wrong: chunking (using fixed-size splits instead of AST boundaries), retrieval (using only dense search and missing exact identifier queries), and injection order (burying retrieved context in the middle of the prompt).&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Chunking for Code: Why Fixed-Size Fails
&lt;/h3&gt;

&lt;p&gt;Code has structure that text does not. A function is a unit of meaning. &lt;strong&gt;Fixed-size chunking&lt;/strong&gt; — splitting every file every 256 tokens — splits in the middle of functions, destroying logical units.&lt;/p&gt;

&lt;p&gt;Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk A: def process_payment(order_id, amount, currency="USD"):
    """Process a payment..."""
    conn = get_db_connection()
    try:
        txn = conn.begin_transaction(

Chunk B:   order_id=order_id,
  amount=amount,
  currency=currency
)    except DatabaseError as e:
        log_error(e)
        raise PaymentError(str(e))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Neither chunk represents the function accurately. The embedding of Chunk A does not represent "a payment processing function" — it represents a truncated fragment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AST-based chunking&lt;/strong&gt; uses tree-sitter to parse each file and extract logical units at language-defined boundaries: function definitions, class bodies, method groups. Each chunk's metadata includes file path, start line, end line, and node type. This metadata is as important as the chunk text itself — it tells the retrieval system where in the codebase this chunk lives.&lt;/p&gt;

&lt;p&gt;One practical addition: each chunk can be augmented with a small surrounding context for embedding purposes — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects the chunk's role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The overlap trap in code:&lt;/strong&gt; Sliding window overlap (copying N tokens from one chunk into the next) is useful in prose. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature alongside any function chunk, rather than copying the previous function's body into the current chunk. The &lt;code&gt;Continue&lt;/code&gt; open-source IDE extension uses this approach.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  7. Retrieval Strategies: Dense, Sparse, and Why Code Needs Both
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Dense retrieval&lt;/strong&gt; converts query and each chunk to vectors, then finds the most similar by cosine similarity. It can match meaning even when exact words differ — "how do we handle rate limit errors?" surfaces functions named &lt;code&gt;throttle_on_429&lt;/code&gt; or &lt;code&gt;backoff_retry&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The embedding model used matters significantly. Code-specialized models like &lt;code&gt;voyage-code-3&lt;/code&gt; — purpose-built for code retrieval, top-ranked on code retrieval benchmarks (2025) — produce substantially better representations for function bodies, type signatures, and API calls than general-purpose models. &lt;code&gt;text-embedding-3-large&lt;/code&gt; is a strong general-purpose embedding model suited for mixed code + documentation retrieval, but it wasn't specifically designed around code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BM25 (lexical/keyword retrieval)&lt;/strong&gt; counts words. It excels at exact matches — a developer looking for &lt;code&gt;PaymentGateway.process_refund&lt;/code&gt; will find it immediately. Error codes, configuration key names, and exact API method names are better retrieved lexically than semantically. For code, the asymmetry is important: queries for exact identifiers favor BM25. Queries for concepts and behaviors favor dense retrieval. The right system runs both.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Hybrid Search and Reciprocal Rank Fusion
&lt;/h3&gt;

&lt;p&gt;Running both methods produces two ranked lists that need combining. BM25 scores and cosine similarity scores live in completely different numerical ranges — you cannot add them directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt; avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RRF_score(d) = Σ_{r ∈ R} 1 / (k + rank_r(d))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;R&lt;/code&gt; = set of ranked lists (BM25 list, dense list)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_r(d)&lt;/code&gt; = position of document &lt;code&gt;d&lt;/code&gt; in list &lt;code&gt;r&lt;/code&gt; (1-indexed)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;k&lt;/code&gt; = smoothing constant (default 60, from Cormack, Clarke &amp;amp; Buettcher 2009 — empirically robust across many retrieval tasks). Increasing k makes the formula more conservative, rewarding consistent mid-rank appearances over a single strong rank.&lt;/li&gt;
&lt;li&gt;If a document does not appear in a list, its contribution from that list is 0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A document ranked #1 in both lists scores ≈ 0.033. A document ranked #1 in one list but #100 in the other scores ≈ 0.022. Candidates that both BM25 and semantic search agree on float to the top.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Reranking: The Final Sorting Pass
&lt;/h3&gt;

&lt;p&gt;After hybrid search and RRF, you have ~20 candidate chunks. A &lt;strong&gt;cross-encoder reranker&lt;/strong&gt; takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts pass through the model together, the model can attend to query-document relationships that a bi-encoder cannot — query and document never interact during bi-encoder encoding.&lt;/p&gt;

&lt;p&gt;The practical architecture: use fast bi-encoder retrieval (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on those 20 for final ordering. The top 5 go into the prompt.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-encoder context window limits:&lt;/strong&gt; Cross-encoders are themselves transformer models with context window limits. General-purpose reranker models like &lt;code&gt;ms-marco-MiniLM-L-12-v2&lt;/code&gt; support 512 subword tokens — which is often enough for a single short function, but not for large class bodies. For retrieval pipelines that surface larger chunks, use a reranker with a larger window: Cohere Rerank 3 supports 4,096 tokens; voyage-rerank-2 supports 16K. If the combined chunk+query still exceeds the limit, truncate the chunk from the bottom — the function signature and docstring are more informative for reranking than the implementation tail.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For code with strong AST chunking and a good code embedding model, hybrid bi-encoder retrieval is often sufficient for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions. It adds 50–200ms of latency, so benchmark before committing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: In Production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  10. How Cursor Does It: A Reference Architecture
&lt;/h3&gt;

&lt;p&gt;When you open a project in Cursor, it chunks local files and sends them to its servers, where they are embedded (via OpenAI's API or a custom model) and stored in &lt;strong&gt;Turbopuffer&lt;/strong&gt; — its vector store of choice. File paths are obfuscated client-side before any data leaves your machine. Embeddings are cached by chunk hash, making incremental re-indexing fast.&lt;/p&gt;

&lt;p&gt;At query time, Cursor monitors the active cursor position and constructs a composite signal: the current file's surrounding code, any open editor tabs, and recent edit history. This signal is embedded into a query vector, sent to Turbopuffer for ANN search, and the top-k results are retrieved. The actual code is read from local disk; the model only sees the retrieved text.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@Codebase&lt;/code&gt; in Cursor's chat is the explicit trigger for a full retrieval pass over the indexed codebase. Without it, Cursor uses a lighter heuristic based on open tabs and file imports. &lt;code&gt;@Docs&lt;/code&gt; and &lt;code&gt;@Web&lt;/code&gt; extend the same pipeline beyond the local codebase.&lt;/p&gt;

&lt;p&gt;One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.&lt;/p&gt;

&lt;p&gt;GitHub Copilot's context construction follows a similar pattern. For inline completion, it uses the current file content around the cursor plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The &lt;code&gt;@workspace&lt;/code&gt; symbol in VS Code triggers a more thorough indexing-based search, analogous to Cursor's &lt;code&gt;@Codebase&lt;/code&gt;. Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke — full retrieval is reserved for explicit chat interactions.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Tradeoffs and Limits of Code RAG
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RAG Behavior&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cross-file dependency reasoning&lt;/td&gt;
&lt;td&gt;Each retrieved chunk is a fragment; the model may not understand how three retrieved functions compose at the call site&lt;/td&gt;
&lt;td&gt;Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Newly created files not yet indexed&lt;/td&gt;
&lt;td&gt;Invisible to retrieval until the index is rebuilt&lt;/td&gt;
&lt;td&gt;Incremental indexing on file-save events; maintain a pending index queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query is too vague&lt;/td&gt;
&lt;td&gt;"fix the bug" → retrieves generic results&lt;/td&gt;
&lt;td&gt;Use cursor position + surrounding error message as primary query signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minified or generated code&lt;/td&gt;
&lt;td&gt;Lock files, protobuf generated code pollute the index&lt;/td&gt;
&lt;td&gt;Maintain a .gitignore-style exclude list for the RAG indexer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very large monorepos&lt;/td&gt;
&lt;td&gt;Recall degrades; indexing is slow&lt;/td&gt;
&lt;td&gt;Scope index to current working subdirectory or per-service sub-indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema/type changes&lt;/td&gt;
&lt;td&gt;Stale embeddings give the model outdated type signatures&lt;/td&gt;
&lt;td&gt;Invalidate embeddings on file write by chunk content hash&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Does a larger context window make RAG obsolete?&lt;/strong&gt; As context windows grow to 1M and beyond — Llama 4 Scout hit 10M tokens in 2025, Gemini 1.5 Pro supported 1M — this question keeps coming up. The practical answer is no, though the reasoning matters. A 200,000-line Python codebase easily exceeds 2 million tokens. Most production monorepos are far larger. More importantly, the attention quality degradation described in Sections 2 and 3 doesn't disappear with a larger nominal window. Those long-context models achieve their range through techniques like NTK-aware RoPE scaling (which extends the effective frequency range of positional encodings) and sparse attention patterns (which skip computation on distant token pairs) — these help with extrapolation but don't eliminate the position bias at extremely long ranges. And practically: a 1M-token prompt is expensive and slow even on state-of-the-art hardware. For interactive code assistance, stuffing the full codebase is off the table regardless of window size.&lt;/p&gt;

&lt;p&gt;Large context windows and RAG do different jobs. RAG decides &lt;em&gt;what&lt;/em&gt; deserves to be in the context window. The context window determines &lt;em&gt;how much&lt;/em&gt; you can fit once you've been selective. A well-tuned system retrieves the right 5,000 tokens from a 10M-token codebase and puts them in a 128K window with room left for conversation history and tool outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  12. Building Your Own Code RAG Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Parsing:&lt;/strong&gt; Use tree-sitter with a recursive AST walk — iterating only over &lt;code&gt;root_node.children&lt;/code&gt; misses deeply nested functions and class methods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tree_sitter_python&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tspython&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tree_sitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Parser&lt;/span&gt;

&lt;span class="n"&gt;PY_LANGUAGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tspython&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;language&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Parser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PY_LANGUAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;TARGET_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_definition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_definition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;walk_tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Recursively walk the AST to catch nested definitions
    (methods inside classes, functions inside functions, etc.)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TARGET_TYPES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunk_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_byte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_byte&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_point&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_point&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="c1"&gt;# For class_definition, continue recursing to capture methods.
&lt;/span&gt;        &lt;span class="c1"&gt;# For function_definition, stop — we want the whole function,
&lt;/span&gt;        &lt;span class="c1"&gt;# not its nested helpers as separate chunks.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_definition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;walk_tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;walk_tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="nf"&gt;walk_tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Embedding models:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context window&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;voyage-code-3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16K tokens&lt;/td&gt;
&lt;td&gt;Purpose-built for code; top-ranked on code retrieval benchmarks (2025)&lt;/td&gt;
&lt;td&gt;Production code assistant, maximum retrieval quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8K tokens&lt;/td&gt;
&lt;td&gt;Strong general performance; well-supported; large community&lt;/td&gt;
&lt;td&gt;Mixed code + documentation retrieval; existing OpenAI integrations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nomic-embed-code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8K tokens&lt;/td&gt;
&lt;td&gt;Open-weight; can run locally; no API cost&lt;/td&gt;
&lt;td&gt;Air-gapped environments; cost-sensitive deployments; on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Vector store:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pgvector in Postgres — sufficient for single-developer or small-team tools&lt;/li&gt;
&lt;li&gt;Qdrant — supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection template:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a coding assistant for this codebase.

## Relevant context from the codebase:

### [payments/gateway.py · lines 42–87]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
{chunk_1_text}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### [payments/exceptions.py · lines 1–24]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
{chunk_2_text}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### [payments/models.py · lines 88–112]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
{chunk_3_text}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
## Current task:
{user_request}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Include file path and line numbers in each chunk header. These cost very few tokens but give the model the module structure needed to generate correct imports and references.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Do not retrieve more than you need.&lt;/strong&gt; It is tempting to inject 10–15 chunks to "give the model more information." Resist this. Each additional chunk increases context size (paying the quadratic cost from Section 4), increases attention dilution, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Through-Line
&lt;/h2&gt;

&lt;p&gt;The surprising thing about attention dilution is that it isn't a bug you can patch. It's a structural property of softmax normalization — the total attention weight sums to 1.0 regardless of sequence length, so every token you add is competing with every other for a share of that budget. More context doesn't mean more understanding; it means each fact gets a smaller slice. The lost-in-the-middle position bias makes it worse: code injected into the middle of a long prompt is structurally disadvantaged by both RoPE's distance decay and the recency bias that causal pretraining instills. Knowing this changes how you think about the whole problem.&lt;/p&gt;

&lt;p&gt;RAG doesn't solve attention dilution — it sidesteps it. Instead of sending everything and hoping the model finds what's relevant, it figures out what's relevant first and sends only that. The context window ends up containing what actually matters for the task: the right type definitions, the right helper functions, the right error handling patterns.&lt;/p&gt;

&lt;p&gt;In practice: below roughly 3,000–5,000 lines, context stuffing usually works well enough. Above that, the problems stack up fast. At 50,000+ lines, naive stuffing reliably hurts. At 500,000+ lines, AST chunking, hybrid BM25 + dense retrieval, RRF fusion, and careful prompt injection aren't premature optimization — they're the baseline.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Foundational Papers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vaswani et al. (2017) — &lt;em&gt;Attention Is All You Need&lt;/em&gt;. NeurIPS. The original transformer paper introducing scaled dot-product attention.&lt;/li&gt;
&lt;li&gt;Liu et al. (2023) — &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;&lt;em&gt;Lost in the Middle: How Language Models Use Long Contexts&lt;/em&gt;&lt;/a&gt;. Stanford / Berkeley. Empirical study of U-shaped attention bias and the 30% accuracy drop at mid-context positions.&lt;/li&gt;
&lt;li&gt;He et al. (2024) — &lt;a href="https://arxiv.org/abs/2406.16008" rel="noopener noreferrer"&gt;&lt;em&gt;Found in the Middle: Calibrating Positional Attention Bias&lt;/em&gt;&lt;/a&gt;. UW / MIT / Google. Proposed calibration method that partially corrects RoPE position bias at inference time.&lt;/li&gt;
&lt;li&gt;Survey (2025) — &lt;a href="https://arxiv.org/abs/2510.04905" rel="noopener noreferrer"&gt;&lt;em&gt;Retrieval-Augmented Code Generation: A Survey&lt;/em&gt;&lt;/a&gt;. Comprehensive survey of RAG approaches specifically for code generation and repository-level tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RAG &amp;amp; Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cormack, Clarke &amp;amp; Buettcher (2009) — &lt;em&gt;Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Dao et al. (2022) — &lt;em&gt;FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://redis.io/blog/rag-vs-large-context-window-ai-apps/" rel="noopener noreferrer"&gt;RAG vs Large Context Window: Real Trade-offs for AI Apps&lt;/a&gt; — Redis Engineering Blog&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.shaped.ai/blog/context-window-optimization-why-ranking-not-stuffing-is-the-scaling-law-for-agents" rel="noopener noreferrer"&gt;Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents&lt;/a&gt; — Shaped AI&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@vishnudhat/rag-for-llm-code-generation-using-ast-based-chunking-for-codebase-c55bbd60836e" rel="noopener noreferrer"&gt;RAG for LLM Code Generation using AST-Based Chunking&lt;/a&gt; — Vishnudhat Natarajan&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sderosiaux.substack.com/p/better-retrieval-beats-better-models" rel="noopener noreferrer"&gt;Better Retrieval Beats Better Models for Large Codebases&lt;/a&gt; — Stéphane Derosiaux&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Code Assistants &amp;amp; Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://towardsdatascience.com/how-cursor-actually-indexes-your-codebase/" rel="noopener noreferrer"&gt;How Cursor Actually Indexes Your Codebase&lt;/a&gt; — Towards Data Science&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.quastor.org/p/github-copilot-works" rel="noopener noreferrer"&gt;How GitHub Copilot Works&lt;/a&gt; — Quastor Engineering&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" rel="noopener noreferrer"&gt;What is Retrieval-Augmented Generation?&lt;/a&gt; — GitHub Blog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Search &amp;amp; Ranking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ranjankumar.in/bm25-vs-dense-retrieval-for-rag-engineers" rel="noopener noreferrer"&gt;BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production&lt;/a&gt; — Ranjan Kumar&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://mbrenndoerfer.com/writing/hybrid-search-bm25-dense-retrieval-fusion" rel="noopener noreferrer"&gt;Hybrid Search: BM25 and Dense Retrieval Combined&lt;/a&gt; — Michael Brenndoerfer&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>codeassistants</category>
      <category>llm</category>
      <category>rag</category>
      <category>productivity</category>
    </item>
    <item>
      <title>India's DPDP Act 2023 Explained — And How AI Handles Data Principal Requests at Scale</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Thu, 21 May 2026 21:46:43 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/indias-dpdp-act-2023-explained-and-how-ai-handles-data-principal-requests-at-scale-38ib</link>
      <guid>https://dev.to/swapnanilsaha/indias-dpdp-act-2023-explained-and-how-ai-handles-data-principal-requests-at-scale-38ib</guid>
      <description>&lt;p&gt;&lt;em&gt;This post is for informational purposes only and does not constitute legal advice. The DPDP Act 2023 and its implementing Rules 2025 are relatively new — requirements may evolve through further notifications or guidance. Verify the current position with a qualified data protection lawyer before making compliance decisions.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Your company just received this email:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I would like to know all personal data your organisation holds about me. This is a formal request under the DPDP Act."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It lands in a shared &lt;code&gt;privacy@yourcompany.com&lt;/code&gt; inbox. Someone reads it. Forwards it to legal. Legal forwards it to engineering. Engineering says they need to check three databases. Nobody notes the date it arrived. Three weeks pass. When someone finally circles back, there are nine days left on the 30-day window the DPDP Rules require. Not enough time to locate the data, get legal sign-off, draft a response in the right language, and send it.&lt;/p&gt;

&lt;p&gt;On day 32, you're in violation.&lt;/p&gt;

&lt;p&gt;That scenario is the default for most Indian companies right now. Not because they're careless — but because nobody built the infrastructure for it.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;DPDP Copilot&lt;/strong&gt; to close that gap: a self-hosted operator tool that accepts public data requests, classifies them with Claude, drafts compliant multilingual replies, tracks every action as immutable evidence, and monitors SLA status in real time.&lt;/p&gt;

&lt;p&gt;But before the tool, you need to understand what you're actually dealing with. Let's start with the law.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/dpdp-copilot/" rel="noopener noreferrer"&gt;→ Full tool page and live demo&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: What the DPDP Act 2023 Actually Requires
&lt;/h2&gt;

&lt;p&gt;The Digital Personal Data Protection Act 2023 received presidential assent on 11 August 2023 and represents India's first comprehensive data protection legislation. Its structure borrows from GDPR while adapting to India's specific context — a 1.4 billion-person population, 22 scheduled languages, deep mobile penetration, and a digital public infrastructure layer (UPI, Aadhaar, DigiLocker) that most jurisdictions don't have.&lt;/p&gt;

&lt;p&gt;The implementing Rules — the Digital Personal Data Protection Rules 2025 — were notified on 13 November 2025, giving the Act its operational teeth.&lt;/p&gt;

&lt;p&gt;Here's what the law actually mandates, stripped of legalese, focusing on the parts most engineering and compliance teams get wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Four Rights Every Data Principal Has
&lt;/h3&gt;

&lt;p&gt;The Act grants every "data principal" — the person whose data is being processed — four actionable rights. When someone exercises any of these, your organisation (as the "data fiduciary") has a legal obligation to respond.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right of Access (Section 11)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Any person can ask you: what personal data do you hold about me, and for what purpose? You must provide a summary of the data being processed, the processing activities, and the identities of any other data fiduciaries or processors with whom their data has been shared. The Act doesn't specify a format, but silence is not sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right of Correction and Completion (Section 12(a))&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a person believes data you hold is inaccurate, incomplete, or misleading, they can demand you correct or complete it. You must either act on the request or explain in writing why you're not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right of Erasure (Section 12(b))&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A person can request deletion of their personal data from your systems. There are exceptions — data held for legal obligations, fraud prevention, pending litigation — but these exceptions have to be documented and justified, not just asserted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right to Grievance Redressal (Section 13)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Any person can file a grievance if they believe their rights under the Act have been violated. You must provide a mechanism to receive and respond to grievances. The Rules 2025 specify this mechanism must be genuinely accessible.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Response Timelines
&lt;/h3&gt;

&lt;p&gt;The DPDP Rules 2025 (notified November 2025) set specific mandatory windows for responding to data principal requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Access, Correction, and Erasure requests (Sections 11–12)&lt;/strong&gt;: Data fiduciaries must respond within &lt;strong&gt;30 days&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grievance Redressal (Section 13)&lt;/strong&gt;: Grievances must be resolved within a maximum of &lt;strong&gt;90 days&lt;/strong&gt; from receipt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are calendar days. For reference: GDPR (EU) also requires responses within one month for most data subject requests; California's CCPA gives 45 days. India's framework is broadly comparable to GDPR in its demands — but applies at the scale of 1.4 billion people, across 22 scheduled languages. That's where the operational challenge is categorically harder.&lt;/p&gt;

&lt;p&gt;30 days sounds like a lot. For a company with no structured process, it evaporates fast. A request that lands in a shared inbox on a Friday, takes three days to be noticed, gets forwarded twice, waits a week for a legal review, and then requires manual drafting in the data principal's language — you're out of time before anyone writes the first sentence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on the DPDP Copilot tool's SLA default&lt;/strong&gt;: The tool's internal SLA clock defaults to 7 days — intentionally more conservative than the 30-day legal window. Most mature compliance programmes target internal deadlines that are significantly tighter than the regulatory maximum, so that normal delays (review cycles, approvals, language checks) don't push you to the edge. The 7-day default is configurable via &lt;code&gt;orgs.sla_days&lt;/code&gt;. When the Rules are read by your legal team and a specific target is agreed, you set it once in the database.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "Evidence" Actually Means Under DPDP
&lt;/h3&gt;

&lt;p&gt;The Act and the Rules create a documentation burden that most organisations underestimate. You need to be able to prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That the request was received on a specific date&lt;/li&gt;
&lt;li&gt;That it was handled (classified and routed) in a timely manner&lt;/li&gt;
&lt;li&gt;What response you gave and when&lt;/li&gt;
&lt;li&gt;Whether the response fulfilled the request or why it couldn't&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is audit evidence. If the Data Protection Board investigates a complaint, you need to produce this trail. A forwarded email chain is not audit evidence. A Slack thread is not audit evidence. An append-only timestamped log — with the original message, the classification, the drafted response, and the send event — is audit evidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Financial Exposure
&lt;/h3&gt;

&lt;p&gt;The Act's First Schedule specifies penalties by category of failure. The two most operationally relevant:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;₹250 crore (~$30M USD)&lt;/strong&gt; — Failure to implement &lt;strong&gt;reasonable security safeguards&lt;/strong&gt; to prevent personal data breaches (Section 8(5)). This is the preventive obligation — having security measures in place. The penalty applies even where a breach subsequently occurs and the fiduciary claims they didn't anticipate it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;₹200 crore&lt;/strong&gt; — Failure to &lt;strong&gt;notify the Data Protection Board and affected data principals&lt;/strong&gt; when a personal data breach does occur (Section 8(6)). The notification obligation is separate from the security obligation — you can get penalised for both.&lt;/p&gt;

&lt;p&gt;Other penalty tiers: ₹200 crore for violations related to children's personal data (Section 9); ₹150 crore for Significant Data Fiduciary obligation failures; ₹50 crore for other provision breaches.&lt;/p&gt;

&lt;p&gt;The Data Protection Board, once fully constituted, will have adjudicatory powers to investigate and levy these penalties. Failing to acknowledge or respond to a data principal request, if that person escalates to the Board, creates a documented paper trail of non-compliance before any investigation begins.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Why Your Current Process Fails (And Why That's the Default)
&lt;/h2&gt;

&lt;p&gt;Let me describe the most common setup I've seen when talking to Indian companies dealing with DPDP requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;privacy@&lt;/code&gt; email address that gets checked sporadically&lt;/li&gt;
&lt;li&gt;No clock tracking — the 30-day window doesn't appear anywhere visible until it's almost gone&lt;/li&gt;
&lt;li&gt;No classification — the person who reads it decides manually whether it's an access request, deletion request, or complaint&lt;/li&gt;
&lt;li&gt;Reply drafted manually, from scratch, in English, by whoever processes it that week&lt;/li&gt;
&lt;li&gt;No audit trail beyond the email itself, which may be deleted if an inbox is cleaned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't negligence. It's the logical outcome of a process designed before the Act existed. The process was "email us with your concern" — and it worked fine when data requests were rare. The DPDP Act changes the legal weight of those requests, but most companies haven't updated their infrastructure to match.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Ways Manual Processes Break
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. The deadline blind spot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a request lands in an email inbox, the 30-day clock doesn't appear anywhere. Nobody stamps the receipt date. Nobody sends an automatic acknowledgement. The request sits until someone opens the inbox. If that takes a week — completely normal for a low-traffic shared inbox — you've already used 23% of your response window without touching the request. Legal review, data location, and drafting will eat most of what's left.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Classification inconsistency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Please delete my data" is an erasure request. "I never gave you permission to use my data" is a grievance. "I want to update my phone number" is a correction request. "Can you send me everything you have on me" is an access request. A trained compliance professional can distinguish these consistently. Your Monday-morning on-call engineer who reads the shared inbox probably cannot — especially for requests written in Hindi, Bengali, or Tamil.&lt;/p&gt;

&lt;p&gt;When requests are misclassified, they get routed to the wrong person, get the wrong response template, and sometimes get the wrong legal treatment. An erasure request handled as a grievance will likely produce a response that doesn't fulfil the legal obligation under Section 12(b), even if it sounds polite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Evidence that can't survive audit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An auditor asks: "On what date did you receive and process this erasure request?" If your answer is "let me check the email thread," you have a problem. Email is mutable, searchable by keyword but not by event type, and has no integrity guarantees. An auditor looking for "REQUEST_CREATED at timestamp T" followed by "REPLY_SENT at timestamp T+22 days" needs a structured log, not an inbox.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The Role of AI in DPDP Compliance
&lt;/h2&gt;

&lt;p&gt;When I was designing DPDP Copilot, the central question was: &lt;strong&gt;where does AI actually add value, and where does it introduce risk?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DPDP compliance has two types of tasks: tasks that require human judgment about legal gray areas, and tasks that require consistent application of known rules to varied inputs. AI is well-suited to the second category and badly suited to the first.&lt;/p&gt;

&lt;p&gt;Deciding whether your company has a legal obligation to retain data for a pending investigation? That's human judgment. Classifying an incoming message as an Access request vs. an Erasure request? That's pattern recognition on natural language — exactly what a well-prompted LLM is built for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Classification: Where LLMs Outperform Rules
&lt;/h3&gt;

&lt;p&gt;Naive rule-based classification for DPDP requests fails quickly. "Please delete my account" is an erasure request. "I want my data removed from your marketing list" is also an erasure request but uses entirely different vocabulary. "Remove me" submitted in a support ticket might be an erasure request or might just be asking to be unsubscribed from emails — context determines which.&lt;/p&gt;

&lt;p&gt;A rules-based system that catches "delete my data" literally will miss most real-world submissions. People write in fragments, in their native language, with emotional context, in ways that don't follow a template.&lt;/p&gt;

&lt;p&gt;An LLM with a well-structured prompt classifies these correctly without needing exhaustive keyword lists. The DPDP Copilot classification prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Classify&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;message&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;into&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;exactly&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;one&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;of:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Grievance,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Access,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Rectification,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Deletion.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;Respond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;classification&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Message:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt establishes the legal framework — "You are a DPDP compliance assistant classifying data principal requests under India's DPDP Act 2023." The model maps the message to the correct legal category.&lt;/p&gt;

&lt;p&gt;The output is constrained to a JSON object with a single key. The application validates that &lt;code&gt;type&lt;/code&gt; is one of the four legal categories. If the model returns something outside those four values, it's rejected and retried — the system never persists a classification it can't validate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multilingual Reply Drafting: Where AI Eliminates Weeks of Work
&lt;/h3&gt;

&lt;p&gt;This is where AI creates the most leverage in the Indian compliance context.&lt;/p&gt;

&lt;p&gt;India has 22 scheduled languages. The DPDP Act creates a right to grievance redressal — and for that mechanism to be genuinely accessible (which the Rules 2025 require), you need to respond in a language the person can understand.&lt;/p&gt;

&lt;p&gt;Without AI, producing compliant response templates in Hindi, Bengali, Tamil, and Marathi means hiring translators, reviewing legal language, maintaining version parity across languages, and updating all templates whenever requirements change. That's a significant operational cost — one that most companies defer indefinitely, defaulting to English-only responses that disadvantage non-English speakers.&lt;/p&gt;

&lt;p&gt;With a well-prompted LLM, drafting happens at response time. The model understands DPDP legal obligations and drafts a response that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Acknowledges the specific request type (not a generic "thank you for reaching out")&lt;/li&gt;
&lt;li&gt;Confirms receipt and logging with a reference number&lt;/li&gt;
&lt;li&gt;States the applicable response timeline&lt;/li&gt;
&lt;li&gt;Explains the next step the data principal should expect&lt;/li&gt;
&lt;li&gt;Is written in the language they chose&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system prompt for drafting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a DPDP compliance officer drafting replies to data principal requests under 
India's Digital Personal Data Protection Act 2023. Write professional, empathetic 
replies that: acknowledge the request type, confirm receipt and logging, state the 
applicable response timeline, and explain the next step. Keep the tone formal but accessible.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user message to the model specifies the request type and target language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Draft a DPDP-compliant reply in ${language} for a ${type} request.

Customer message:
${text}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are &lt;strong&gt;suggested&lt;/strong&gt; replies — an operator reviews them before sending. The human stays in the loop for all final communications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt Caching: Making AI Cost-Efficient at Scale
&lt;/h3&gt;

&lt;p&gt;The system prompts for both classification and drafting use &lt;code&gt;cache_control: { type: 'ephemeral' }&lt;/code&gt; via the Anthropic SDK, enabling prompt caching.&lt;/p&gt;

&lt;p&gt;If you're processing dozens of data principal requests per day, the system prompt — which is identical for every request — gets cached by Anthropic's API after the first call. Subsequent calls are billed at a fraction of the full input token cost. At scale, prompt caching reduces API costs by 50–80% for the classification and drafting steps.&lt;/p&gt;

&lt;p&gt;This is a small architectural detail that has no effect on the first request and compounding positive effect on the hundredth. If you're building compliance tooling that processes high volumes, prompt caching is the difference between a sustainable per-request cost and one that makes the tool impractical at production scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry Logic: Resilience Against Transient Failures
&lt;/h3&gt;

&lt;p&gt;The LLM calls use exponential backoff retry logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;callWithRetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isRetryable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RateLimitError&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
        &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;InternalServerError&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;isRetryable&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only rate limit errors and server errors trigger retries — not client errors (bad API key, invalid request format). The delay doubles with each attempt: 1 second, then 2, then 4. Three attempts total. A transient API hiccup doesn't fail the entire processing pipeline for a data principal's submission.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: DPDP Copilot — The Tool in Detail
&lt;/h2&gt;

&lt;p&gt;With the legal and AI context established, here's how DPDP Copilot works end to end.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Public Request Form
&lt;/h3&gt;

&lt;p&gt;The entry point for data principals is &lt;code&gt;/grievance&lt;/code&gt; — no login required. Requiring a login to submit a data rights request is a barrier that conflicts with the spirit of the Act. If someone can't easily submit an erasure request, the mechanism isn't truly accessible.&lt;/p&gt;

&lt;p&gt;The form collects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The request message (free text — people write what they mean in their own words)&lt;/li&gt;
&lt;li&gt;Preferred response language (English, Hindi, Bengali, Tamil, Marathi)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's no account creation, no verification code, no CAPTCHA wall. Data principals submit and receive an acknowledgement. The contact information is embedded in the message body — a known limitation of the current implementation, and a deliberate choice for the initial version: forcing a structured contact field requires more UI complexity and doesn't add meaningful compliance value until outbound email delivery is implemented.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Happens in the Background on Submission
&lt;/h3&gt;

&lt;p&gt;When the form is submitted, a single API call to &lt;code&gt;POST /api/public/requests&lt;/code&gt; triggers a multi-step synchronous pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Request creation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system creates a database record with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A UUID as the request ID&lt;/li&gt;
&lt;li&gt;The raw message text&lt;/li&gt;
&lt;li&gt;The chosen language&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;type: 'PENDING'&lt;/code&gt; — not yet classified&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sla_due_at: now() + 7 days&lt;/code&gt; — the internal SLA clock starts at submission. This 7-day default is configurable via &lt;code&gt;orgs.sla_days&lt;/code&gt; and is intentionally conservative relative to the 30-day legal window.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;org_id&lt;/code&gt; from the active organisation configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Evidence logging — REQUEST_CREATED&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;evidence_events&lt;/code&gt; record is written immediately after creation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"REQUEST_CREATED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"public_form"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hindi"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-05-25T10:00:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the legal timestamp of receipt. The moment the request hits the database, it's on record. The evidence log is append-only at the application level — there are no delete or update operations on &lt;code&gt;evidence_events&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: AI classification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The message text goes to Claude for classification. The model returns a JSON object. The application parses it and validates that &lt;code&gt;type&lt;/code&gt; is one of &lt;code&gt;{ Grievance, Access, Rectification, Deletion }&lt;/code&gt;. Any other value throws an error. The request record is updated with the validated type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Evidence logging — REQUEST_CLASSIFIED&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"REQUEST_CLASSIFIED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deletion"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-05-25T10:00:01.342Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The classification result and timestamp are immutable facts in the evidence record from this point forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: AI reply drafting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude drafts a response in the data principal's chosen language, using the classified request type and the original message as context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Evidence logging — REPLY_SUGGESTED&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"REPLY_SUGGESTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hindi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-05-25T10:00:02.891Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire pipeline — creation, classification, drafting — runs in under 5 seconds for a typical request. By the time an operator opens the inbox, the request is already classified, a draft reply exists, and the SLA clock has been running since submission.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Operator Inbox
&lt;/h3&gt;

&lt;p&gt;The inbox at &lt;code&gt;/&lt;/code&gt; is protected by authentication. It shows all requests for the active organisation, each with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request type (Grievance, Access, Rectification, Deletion, or PENDING if classification failed)&lt;/li&gt;
&lt;li&gt;Message preview&lt;/li&gt;
&lt;li&gt;Live SLA status (Within SLA / Due Soon / Overdue)&lt;/li&gt;
&lt;li&gt;Creation timestamp&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The SLA status is computed at read time — not stored as a cached value. The &lt;code&gt;computeSlaStatus&lt;/code&gt; function runs on every page load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;computeSlaStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slaDueAt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;due&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slaDueAt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;due&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OVERDUE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DUE_SOON&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;WITHIN_SLA&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The status shown in the inbox reflects the current moment — not the status at the last time the record was updated. A request that was &lt;code&gt;WITHIN_SLA&lt;/code&gt; yesterday is automatically &lt;code&gt;DUE_SOON&lt;/code&gt; or &lt;code&gt;OVERDUE&lt;/code&gt; today without any scheduled job or background worker.&lt;/p&gt;

&lt;p&gt;The inbox is sorted by SLA urgency by default, so operators see the most at-risk requests first.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request Detail Page
&lt;/h3&gt;

&lt;p&gt;Clicking into any request shows everything an operator needs to review, respond, and close:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The original message&lt;/strong&gt; — verbatim, exactly as submitted. No interpretation layer between the operator and what the data principal actually wrote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI-drafted reply&lt;/strong&gt; — pre-populated with DPDP-compliant language in the data principal's chosen language. The operator can read it, edit it in the text area, and send it. The draft is a starting point, not a cage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The resolution checklist&lt;/strong&gt; — structured prompts for the operator to work through before closing the request:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has the relevant data been located?&lt;/li&gt;
&lt;li&gt;Has the requested action (access/correction/deletion) been taken?&lt;/li&gt;
&lt;li&gt;Has the data principal been notified?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The evidence timeline&lt;/strong&gt; — every event in chronological order with timestamps, event types, and metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The export controls&lt;/strong&gt; — one click to download the full evidence trail as PDF or CSV.&lt;/p&gt;

&lt;h3&gt;
  
  
  Marking a Reply as Sent
&lt;/h3&gt;

&lt;p&gt;When an operator sends the response (currently: manually via email or another channel, then clicks "Mark as Sent" in the tool), the system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Updates the request &lt;code&gt;status&lt;/code&gt; to &lt;code&gt;CLOSED&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Logs &lt;code&gt;REPLY_SENT&lt;/code&gt; to the evidence table:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"REPLY_SENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"event_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"admin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"channel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"manual"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-05-27T14:22:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gap between &lt;code&gt;REQUEST_CREATED&lt;/code&gt; and &lt;code&gt;REPLY_SENT&lt;/code&gt; timestamps is the documented response time. If an auditor asks "how long did you take to respond to this erasure request?" — the answer is computable from the evidence log to the second.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: The Evidence Architecture
&lt;/h2&gt;

&lt;p&gt;The evidence design is the most important part of DPDP Copilot from a compliance standpoint. Everything else is workflow tooling. The evidence log is what you use when the Data Protection Board comes calling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Append-Only by Design
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;evidence_events&lt;/code&gt; table has no update or delete paths in the application. Once an event is written, it stays. There's no "edit evidence" API, no admin panel for removing events, no soft-delete flag.&lt;/p&gt;

&lt;p&gt;Audit evidence that can be modified isn't evidence; it's a story you're telling. An append-only log where every event has a database-generated timestamp (not an application-provided one) is as close to tamper-evident as you can get in a PostgreSQL-backed application.&lt;/p&gt;

&lt;p&gt;The schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;evidence_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;  &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;event_type&lt;/span&gt;  &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;event_data&lt;/span&gt;  &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;org_id&lt;/span&gt;      &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;created_at&lt;/code&gt; field uses &lt;code&gt;DEFAULT now()&lt;/code&gt; — the database server's timestamp, not the application's &lt;code&gt;Date.now()&lt;/code&gt;. Database server clocks in a managed PostgreSQL instance are NTP-synchronized and authoritative. Application clocks can drift.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Four Event Types
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;REQUEST_CREATED&lt;/strong&gt; — logged at the moment of database insertion, before any processing. This is the legal timestamp of receipt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REQUEST_CLASSIFIED&lt;/strong&gt; — logged immediately after the AI classification succeeds and the type is validated. Contains the classified type in &lt;code&gt;event_data&lt;/code&gt;. If classification fails and retries are exhausted, this event is not logged — the absence of this event tells you classification failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REPLY_SUGGESTED&lt;/strong&gt; — logged when the AI draft is written to the request record. Contains the language and model used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REPLY_SENT&lt;/strong&gt; — logged when an operator marks the reply as sent. Contains the operator identity and channel. This closes the request lifecycle in the evidence log.&lt;/p&gt;

&lt;p&gt;The presence of all four events, in order, within the applicable window means the request was handled correctly from intake to response. An auditor reviewing the CSV export can verify this in seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Organisation Scoping
&lt;/h3&gt;

&lt;p&gt;Every evidence event carries an &lt;code&gt;org_id&lt;/code&gt;. Every query on &lt;code&gt;evidence_events&lt;/code&gt; is scoped to the active organisation. A single deployment can serve multiple organisations, and their evidence trails are strictly isolated.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;org_id&lt;/code&gt; in evidence events is written by the application using the resolved organisation context — not passed in by the caller. A data principal submitting a request cannot specify or forge the organisation context; it's resolved server-side from the environment configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Export Looks Like
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;CSV export&lt;/strong&gt; for a complete request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;event&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;created&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;
&lt;span class="k"&gt;REQUEST&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;CREATED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="ld"&gt;2025-05-25T10:00:00&lt;/span&gt;&lt;span class="mf"&gt;.000&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;
&lt;span class="k"&gt;REQUEST&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;CLASSIFIED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="ld"&gt;2025-05-25T10:00:01&lt;/span&gt;&lt;span class="mf"&gt;.342&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;
&lt;span class="k"&gt;REPLY&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;SUGGESTED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="ld"&gt;2025-05-25T10:00:02&lt;/span&gt;&lt;span class="mf"&gt;.891&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;
&lt;span class="k"&gt;REPLY&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;SENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="ld"&gt;2025-05-27T14:22:00&lt;/span&gt;&lt;span class="mf"&gt;.000&lt;/span&gt;&lt;span class="k"&gt;Z&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four rows. Auditor reads it: request received Sunday 10:00 AM, responded Tuesday 2:22 PM — 52 hours, well within any reasonable response window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PDF export&lt;/strong&gt; includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organisation name and request ID&lt;/li&gt;
&lt;li&gt;Request type and creation timestamp&lt;/li&gt;
&lt;li&gt;Original message (verbatim)&lt;/li&gt;
&lt;li&gt;Suggested reply (the draft that was reviewed and sent)&lt;/li&gt;
&lt;li&gt;Full evidence timeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The PDF is generated server-side using Puppeteer with Chromium. The HTML template is a known XSS risk in the current implementation (user-provided message text is interpolated directly into HTML) — the fix is explicit HTML escaping before interpolation, which is on the roadmap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: SLA Architecture — The Compliance Clock
&lt;/h2&gt;

&lt;p&gt;SLA management is where most compliance tools fail. They either track SLA status as a static database field (which becomes stale the moment the clock ticks past the deadline) or they rely on background jobs (which can fail silently and leave the status indicator wrong).&lt;/p&gt;

&lt;p&gt;DPDP Copilot takes a third approach: &lt;strong&gt;compute SLA status at read time, every time.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How the Internal SLA Clock Works
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;sla_due_at&lt;/code&gt; timestamp is written once, at request creation: &lt;code&gt;now() + sla_days&lt;/code&gt;. The default is 7 days — more conservative than the 30-day legal window, so normal review and approval cycles don't consume the entire legal budget. That's the only mutation to this field — it never changes after the request is created.&lt;/p&gt;

&lt;p&gt;On every inbox load, every request detail page load, the &lt;code&gt;computeSlaStatus(slaDueAt)&lt;/code&gt; function runs in the API layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slaDueAt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OVERDUE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;diffHours&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DUE_SOON&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
                     &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;WITHIN_SLA&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No database update. No background worker. No scheduled job. The status shown to the operator is always accurate as of the current server time.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DUE_SOON&lt;/code&gt; triggers at 24 hours remaining — a one-day warning before the internal deadline. This gives operators a meaningful heads-up without creating false urgency days in advance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting the Right Internal SLA for Your Organisation
&lt;/h3&gt;

&lt;p&gt;The DPDP Rules 2025 set a 30-day legal maximum for access/correction/erasure responses. How you set your internal target depends on your process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A startup where one person handles requests end-to-end: 7–10 days is achievable and leaves buffer&lt;/li&gt;
&lt;li&gt;A mid-size company where requests go through legal review and data lookup across multiple systems: 14–21 days as the internal target, with the legal 30-day window as the backstop&lt;/li&gt;
&lt;li&gt;A large enterprise with formal approval workflows: set the SLA to match your internal SLA policy; use the evidence log to track compliance with your own commitments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The configurable &lt;code&gt;orgs.sla_days&lt;/code&gt; field in the database — not yet wired to request creation in the current version, but in the roadmap — will let each organisation set its own target without changing code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Status vs. SLA Distinction
&lt;/h3&gt;

&lt;p&gt;Early versions of DPDP Copilot conflated two concepts in a single field: the workflow status of the request (open, closed?) and the computed SLA urgency (within deadline?). The second database migration separates these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- migration 002_split_status_from_sla.sql&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'OPEN'&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CASE&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;sla_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'CLOSED'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'CLOSED'&lt;/span&gt;
  &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'OPEN'&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this migration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;status&lt;/code&gt; is the workflow state: &lt;code&gt;OPEN&lt;/code&gt; or &lt;code&gt;CLOSED&lt;/code&gt;. Closed means a reply was sent and the request is resolved.&lt;/li&gt;
&lt;li&gt;The live SLA urgency is always computed by &lt;code&gt;computeSlaStatus&lt;/code&gt; at read time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters for reporting. You want to answer: "Of all requests that were open during the last month, what percentage were responded to within the internal SLA?" That question requires separating workflow state from deadline state.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7: Multilingual Compliance at Scale
&lt;/h2&gt;

&lt;p&gt;The multilingual capability deserves more attention than it typically gets in discussions of DPDP tooling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Language Matters for DPDP
&lt;/h3&gt;

&lt;p&gt;India's 2011 census (the most recent with detailed language data) recorded 19,569 raw mother tongue entries from respondents — often cited as "over 19,500 languages spoken in some capacity" — which consolidate into 121 languages with more than 10,000 speakers each. The DPDP Act and Rules 2025 require that grievance mechanisms be accessible, which practically means: if your users write to you in Hindi, a response mechanism that only understands English is not accessible.&lt;/p&gt;

&lt;p&gt;DPDP Copilot supports five languages in the current version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English&lt;/strong&gt; — the default, always available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi&lt;/strong&gt; — 528 million speakers (2011 census)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bengali&lt;/strong&gt; — 97 million speakers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tamil&lt;/strong&gt; — 69 million speakers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marathi&lt;/strong&gt; — 83 million speakers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The public request form shows these as radio button options. The selection flows into the API request, through the drafting call, and into the AI prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the AI Draft Looks Like in Practice
&lt;/h3&gt;

&lt;p&gt;Here's the same erasure request processed in two languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input (English):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I never gave consent for you to sell my data. Please delete everything you have about me immediately."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Suggested reply in English:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Dear Data Principal,&lt;/p&gt;

&lt;p&gt;Thank you for your request submitted on 25 May 2025. We have received and logged your request for erasure of personal data under Section 12(b) of the Digital Personal Data Protection Act, 2023.&lt;/p&gt;

&lt;p&gt;Your request has been assigned reference number [REF-ID]. Our compliance team will review your request, locate the relevant data, and initiate the erasure process in accordance with applicable legal requirements. We will respond with the outcome within the timeframe required under the DPDP Act.&lt;/p&gt;

&lt;p&gt;Please retain this acknowledgement for your records.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Suggested reply in Hindi:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;प्रिय डेटा प्रिंसिपल,&lt;/p&gt;

&lt;p&gt;25 मई 2025 को प्रस्तुत आपके अनुरोध के लिए धन्यवाद। हमने डिजिटल व्यक्तिगत डेटा संरक्षण अधिनियम, 2023 की धारा 12(ख) के अंतर्गत आपके व्यक्तिगत डेटा के विलोपन के अनुरोध को प्राप्त कर दर्ज किया है।&lt;/p&gt;

&lt;p&gt;आपके अनुरोध को संदर्भ संख्या [REF-ID] दी गई है। हमारी अनुपालन टीम आपके अनुरोध की समीक्षा करेगी, संबंधित डेटा का पता लगाएगी और लागू कानूनी आवश्यकताओं के अनुसार विलोपन प्रक्रिया शुरू करेगी। हम डीपीडीपी अधिनियम के तहत निर्धारित समय-सीमा के भीतर आपको परिणाम की सूचना देंगे।&lt;/p&gt;

&lt;p&gt;कृपया इस पावती को अपने रिकॉर्ड के लिए सुरक्षित रखें।&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The structure is identical. The legal references are consistent. The tone is professional but accessible. An operator who reviews the Hindi draft can run it through a translation tool to verify quality before sending — the AI draft is a starting point, not a blindly trusted final output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Separate Prompts Per Language Matter
&lt;/h3&gt;

&lt;p&gt;A naive approach would translate a fixed English template into other languages once, then serve those static translations. This works for simple acknowledgements but fails for personalised responses that need to reference the specific request content.&lt;/p&gt;

&lt;p&gt;Because DPDP Copilot drafts replies by passing the original message to the model, the suggested reply can acknowledge specific details the data principal mentioned — not just their request type. If someone writes "I asked you to stop sending me SMS messages three months ago and you're still doing it," a good response acknowledges that history. A static template can't.&lt;/p&gt;

&lt;p&gt;The LLM approach generates a response that's contextually appropriate in the data principal's language — which is a qualitatively different outcome from translation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 8: The Data Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Schema Design for Compliance
&lt;/h3&gt;

&lt;p&gt;The database schema is designed around compliance requirements first, application convenience second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Three tables, three responsibilities&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;         &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;       &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="n"&gt;sla_days&lt;/span&gt;   &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;message&lt;/span&gt;         &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;type&lt;/span&gt;            &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt;          &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'OPEN'&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;suggested_reply&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sla_due_at&lt;/span&gt;      &lt;span class="n"&gt;timestamptz&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;org_id&lt;/span&gt;          &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;      &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;evidence_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;  &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;event_type&lt;/span&gt;  &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;event_data&lt;/span&gt;  &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;org_id&lt;/span&gt;      &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;orgs.sla_days&lt;/code&gt; field exists and is populated but not yet wired to request creation — the 7-day hardcode is the current implementation. When that field is connected, different organisations can run different internal SLA targets. The schema is ready for that; the application code isn't yet.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;evidence_events.event_data&lt;/code&gt; field is &lt;code&gt;jsonb&lt;/code&gt; — flexible enough to store different metadata per event type without schema changes. As the tool evolves (new event types, operator attribution, channel tracking), existing rows aren't invalidated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Strategy
&lt;/h3&gt;

&lt;p&gt;Two composite indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;requests_org_created_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;evidence_events_request_org_created_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;evidence_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first index supports the inbox query: "give me all requests for this org, sorted by most recent." The second supports the request detail query: "give me all evidence events for this request in this org, in chronological order."&lt;/p&gt;

&lt;p&gt;Both indexes include &lt;code&gt;org_id&lt;/code&gt; as the leading column because every query in the application is org-scoped. An index that starts with &lt;code&gt;org_id&lt;/code&gt; is used by the query planner even for queries that also filter by &lt;code&gt;request_id&lt;/code&gt; — the org scope eliminates most of the table before the planner looks at other columns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 9: Deployment Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Self-Hosted by Design
&lt;/h3&gt;

&lt;p&gt;DPDP Copilot is self-hosted. That's a deliberate product decision, not an oversight.&lt;/p&gt;

&lt;p&gt;DPDP requests often contain sensitive personal data — names, contact details, account information, and sometimes sensitive categories of data like health information or financial details. The organisation processing these requests is the data fiduciary. Routing that data through a third-party SaaS for classification and storage creates its own compliance risk: you're a data processor, processing data principal requests by sending them to another data processor, with all the consent and data transfer implications that entails.&lt;/p&gt;

&lt;p&gt;Running the tool in your own infrastructure — whether on-premises or in a cloud account you control — keeps the data principal's message in your trust boundary. The only data that leaves your environment is the message text sent to Anthropic's API for classification and drafting. That's a single, scoped, auditable data transfer that you control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker Compose Quickstart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and configure&lt;/span&gt;
git clone https://github.com/swapnanil/dpdp-copilot
&lt;span class="nb"&gt;cd &lt;/span&gt;dpdp-copilot
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env

&lt;span class="c"&gt;# .env minimum required:&lt;/span&gt;
&lt;span class="c"&gt;# ANTHROPIC_API_KEY=sk-ant-...&lt;/span&gt;
&lt;span class="c"&gt;# DATABASE_URL=postgresql://user:pass@db:5432/dpdp&lt;/span&gt;
&lt;span class="c"&gt;# ADMIN_USER=compliance_admin&lt;/span&gt;
&lt;span class="c"&gt;# ADMIN_PASS=your_secure_password&lt;/span&gt;
&lt;span class="c"&gt;# DEFAULT_ORG_ID=          # fill after running seed&lt;/span&gt;
&lt;span class="c"&gt;# ADMIN_SESSION_SECRET=    # openssl rand -hex 32&lt;/span&gt;

&lt;span class="c"&gt;# Start database&lt;/span&gt;
docker compose up db &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Run migrations&lt;/span&gt;
docker compose run &lt;span class="nt"&gt;--rm&lt;/span&gt; migrate

&lt;span class="c"&gt;# Seed initial org (note the UUID it prints)&lt;/span&gt;
docker compose run &lt;span class="nt"&gt;--rm&lt;/span&gt; seed

&lt;span class="c"&gt;# Start the application&lt;/span&gt;
docker compose up app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:3000&lt;/code&gt; for the operator inbox.&lt;br&gt;
Open &lt;code&gt;http://localhost:3000/grievance&lt;/code&gt; for the public form.&lt;/p&gt;
&lt;h3&gt;
  
  
  Environment Configuration Reference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Required&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Your Anthropic API key for Claude&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DATABASE_URL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;PostgreSQL connection string&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ADMIN_USER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Operator login username&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ADMIN_PASS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Operator login password&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DEFAULT_ORG_ID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;UUID of the active organisation (from seed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ADMIN_SESSION_SECRET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Production&lt;/td&gt;
&lt;td&gt;Signs session cookies — &lt;code&gt;openssl rand -hex 32&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MODEL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Claude model (default: &lt;code&gt;claude-sonnet-4-6&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAX_TOKENS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Reply draft length (default: 1024)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PUPPETEER_EXECUTABLE_PATH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Chromium path — set automatically in Docker&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Production Considerations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Session signing&lt;/strong&gt;: Generate &lt;code&gt;ADMIN_SESSION_SECRET&lt;/code&gt; with &lt;code&gt;openssl rand -hex 32&lt;/code&gt;. In development you can skip this; in production the session cookie must be signed or it's trivially forgeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;: The Docker Compose setup runs Postgres in a container. For production, use a managed database (AWS RDS, Google Cloud SQL, Supabase) with automated backups. The evidence table is your legal record — you want it on infrastructure with point-in-time recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTPS&lt;/strong&gt;: Run behind a reverse proxy (nginx, Caddy) that terminates TLS. Session cookies should have &lt;code&gt;Secure&lt;/code&gt; and &lt;code&gt;SameSite=Strict&lt;/code&gt; — these aren't set in the current implementation but are straightforward to add in a production nginx config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt;: The public &lt;code&gt;/grievance&lt;/code&gt; form has no rate limiting in the current version. A reverse proxy rate limit on the public intake endpoint prevents abuse without touching the application code.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 10: Known Limitations and What's Next
&lt;/h2&gt;

&lt;p&gt;Honesty about limitations is part of useful tooling documentation. Here's what DPDP Copilot currently doesn't do and what the roadmap looks like.&lt;/p&gt;
&lt;h3&gt;
  
  
  Current Limitations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;No outbound delivery&lt;/strong&gt;: The "send reply" workflow doesn't actually send anything. It marks the reply as sent in the evidence log and sets the request to &lt;code&gt;CLOSED&lt;/code&gt;. The operator is responsible for actually sending the drafted reply via their existing channel (email, portal, etc.). This is a limitation of the MVP, not the design — real outbound email delivery is the obvious next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-admin authentication&lt;/strong&gt;: The current auth model is a single username/password pair from environment variables. There's no user table, no role model, no per-operator audit trail. Multiple operators can't be tracked individually. This is fine for a team of one; it's a problem for a compliance team of five.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static org configuration&lt;/strong&gt;: The active organisation is selected via &lt;code&gt;DEFAULT_ORG_ID&lt;/code&gt; in the environment. There's no UI for switching organisations or a multi-tenant router. The database schema supports multiple orgs; the application routing doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No structured contact data&lt;/strong&gt;: Contact information is embedded in the free-form message. There's no &lt;code&gt;contact_email&lt;/code&gt; or &lt;code&gt;contact_phone&lt;/code&gt; field. This means there's no reliable way to programmatically address the data principal in the reply or route the response to them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PDF XSS risk&lt;/strong&gt;: The PDF template interpolates user-provided text directly into HTML without escaping. A malicious actor could potentially inject HTML into the generated PDF. This is a known issue and is the highest-priority security fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No notifications&lt;/strong&gt;: Operators have no way to be alerted when a new request comes in or when a request is approaching its internal SLA deadline. Checking the inbox manually is the only current mechanism.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Roadmap
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Outbound reply delivery&lt;/strong&gt;: Send the drafted reply via email (SendGrid, AWS SES, or SMTP) directly from the tool. Logs the delivery event to the evidence table. The operator reviews the draft, edits if needed, and clicks Send — not "Copy this and email it manually."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLA alerts&lt;/strong&gt;: Email or Slack notification when a request enters &lt;code&gt;DUE_SOON&lt;/code&gt; status. Optional daily digest of all open requests with their current SLA status.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-operator support&lt;/strong&gt;: A &lt;code&gt;users&lt;/code&gt; table, per-operator login, and role assignment (reviewer vs. approver). Evidence events attributed to specific operators. Audit trail for who touched what.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured contact fields&lt;/strong&gt;: Separate &lt;code&gt;contact_email&lt;/code&gt; from the message body at intake. Validate format. Apply retention controls — contact data should be deletable when the request is closed without deleting the evidence trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configurable SLA&lt;/strong&gt;: Wire &lt;code&gt;orgs.sla_days&lt;/code&gt; to request creation. Different organisations have different internal SLA commitments — the schema already supports this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval workflow&lt;/strong&gt;: A draft reply requires supervisor approval before it can be sent. The evidence log records who approved and when. This is an operational pattern for organisations where a junior compliance analyst drafts but a senior officer approves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analytics dashboard&lt;/strong&gt;: How many requests per week? What types? Average response time? What percentage are within the internal SLA? This is a reporting requirement for any compliance programme worth its name.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 11: How DPDP Copilot Fits Into a Broader Compliance Programme
&lt;/h2&gt;

&lt;p&gt;DPDP Copilot handles the data principal rights workflow. That's one piece of a complete DPDP compliance programme. Here's how it fits:&lt;/p&gt;
&lt;h3&gt;
  
  
  What DPDP Copilot Covers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Receiving data principal requests (Access, Rectification, Deletion, Grievance)&lt;/li&gt;
&lt;li&gt;Classifying them correctly and consistently&lt;/li&gt;
&lt;li&gt;Drafting multilingual responses&lt;/li&gt;
&lt;li&gt;Tracking internal SLA deadlines&lt;/li&gt;
&lt;li&gt;Generating the audit evidence trail&lt;/li&gt;
&lt;li&gt;Exporting evidence for regulatory review&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  What It Doesn't Cover
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data discovery&lt;/strong&gt;: Finding where a person's data actually lives across your systems. DPDP Copilot receives and tracks the request but doesn't automate the underlying data lookup. That's a data catalogue problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consent management&lt;/strong&gt;: Recording and tracking what data was collected under what consent. That's a separate consent registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy notices&lt;/strong&gt;: Generating or maintaining the notice required under Section 5 of the Act. That's a legal document workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data breach notification&lt;/strong&gt;: Section 8(6) requires prompt notification of significant breaches to the Data Protection Board and affected persons. That's a separate incident response workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-border transfer compliance&lt;/strong&gt;: The Act restricts transfers of personal data to certain countries. That's a data governance and infrastructure question.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A full DPDP compliance programme needs all of these. DPDP Copilot handles the rights management piece — the part that creates the most immediate operational urgency because it has a hard deadline on individual transactions and a direct escalation path to the Data Protection Board.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Risk Reduction Calculation
&lt;/h3&gt;

&lt;p&gt;Before DPDP Copilot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to acknowledge a request: hours to days (depends on inbox monitoring)&lt;/li&gt;
&lt;li&gt;Time to classify a request: manual, inconsistent, language-dependent&lt;/li&gt;
&lt;li&gt;Time to draft a response: hours (finding a template, adapting it, translating it)&lt;/li&gt;
&lt;li&gt;Deadline tracking: none — someone has to remember&lt;/li&gt;
&lt;li&gt;Evidence: none — email threads that can be deleted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After DPDP Copilot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to acknowledge: seconds (the evidence log records receipt immediately on submission)&lt;/li&gt;
&lt;li&gt;Time to classify: 1–2 seconds (LLM call)&lt;/li&gt;
&lt;li&gt;Time to draft a response: 2–3 seconds (LLM call)&lt;/li&gt;
&lt;li&gt;Deadline tracking: automatic, live-computed, visible in the operator inbox&lt;/li&gt;
&lt;li&gt;Evidence: append-only database log, exportable as PDF or CSV in one click&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reduction in time-to-first-action is the most important improvement. The legal clock starts when the request is submitted — not when someone reads it. DPDP Copilot ensures that classification and drafting are done before any human even opens the inbox. The operator's job is review and send, not receive-classify-draft-send.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 12: Who Should Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Compliance and legal teams at Indian companies&lt;/strong&gt; processing personal data of Indian residents under the DPDP Act. If you're a data fiduciary — collecting or processing personal data — you have obligations under this Act. If you don't have a structured process for handling data principal requests, you need one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineering teams building privacy infrastructure&lt;/strong&gt; who need a reference implementation of DPDP request handling. The codebase is open-source. The data model, the API structure, the evidence logging pattern, the SLA computation logic — all of it is readable, runnable, and adaptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Startups at the early compliance stage&lt;/strong&gt; who don't yet have a dedicated compliance team. The tool runs on a single machine. Configuration is a &lt;code&gt;.env&lt;/code&gt; file. The public form can be linked from your privacy policy. You don't need a compliance department to run it — you need someone who checks the inbox.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Organisations handling multilingual Indian user bases&lt;/strong&gt; where an English-only inbox isn't accessible to all the people it's supposed to serve. If your users write to you in Hindi and Tamil, they deserve responses in Hindi and Tamil — and the time cost of manual translation has historically made that impractical. It isn't anymore.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Complete Example Walkthrough
&lt;/h2&gt;

&lt;p&gt;Let me walk through a real scenario end-to-end, using the tool as a data principal and then as an operator.&lt;/p&gt;
&lt;h3&gt;
  
  
  As the Data Principal
&lt;/h3&gt;

&lt;p&gt;You purchased something from a company. You're now getting SMS marketing messages you didn't opt in to. You want to file an erasure request and a grievance.&lt;/p&gt;

&lt;p&gt;You go to &lt;code&gt;https://yourcompany.com/grievance&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You write:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I never gave you permission to send me SMS promotions. I want you to delete my phone number and all data you hold about me. I also want to formally complain about this."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You select &lt;strong&gt;Hindi&lt;/strong&gt; as your preferred language and submit.&lt;/p&gt;

&lt;p&gt;You receive an acknowledgement: "Your request has been received and logged. Reference: [UUID]. Our compliance team will be in touch with the outcome."&lt;/p&gt;
&lt;h3&gt;
  
  
  As the Compliance Operator
&lt;/h3&gt;

&lt;p&gt;You open the operator inbox the next morning. You see a new request, classified as &lt;strong&gt;Grievance&lt;/strong&gt; (the model detected the formal complaint language alongside the deletion request), with &lt;code&gt;WITHIN_SLA&lt;/code&gt; status.&lt;/p&gt;

&lt;p&gt;You click into the request. You read the original message. The suggested reply in Hindi is already drafted. You read it — it acknowledges the complaint, confirms the erasure request has been noted, and explains next steps in Hindi.&lt;/p&gt;

&lt;p&gt;You make a small edit to reference your company's specific erasure process. You click "Send Reply" — which in the current version means you copy the draft, send it via your email system, and then click "Mark as Sent" in the tool.&lt;/p&gt;

&lt;p&gt;The evidence timeline now shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REQUEST_CREATED     2025-05-25 10:00:00
REQUEST_CLASSIFIED  2025-05-25 10:00:01  (Grievance)
REPLY_SUGGESTED     2025-05-25 10:00:02
REPLY_SENT          2025-05-26 09:15:00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total response time: 23 hours. Well within any reasonable SLA window. The CSV export documents this. If the data principal escalates to the Data Protection Board, you have a timestamped, exportable record of the complete interaction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Public form&lt;/strong&gt;: &lt;code&gt;GET /grievance&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Endpoints&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Auth&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/public/requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Submit a data principal request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/login&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Operator login&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/logout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Operator logout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;List all requests with live SLA status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/requests/:id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;Request detail + evidence timeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/requests/:id/send-reply&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;Mark reply sent, close request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/requests/:id/export/pdf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;Download PDF evidence report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api/requests/:id/export/csv&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;Download CSV evidence export&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Request lifecycle&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Public form submission
  → Internal SLA clock starts (configurable, default 7 days)
  → REQUEST_CREATED logged
  → AI classification (Grievance / Access / Rectification / Deletion)
  → REQUEST_CLASSIFIED logged
  → AI reply drafted in chosen language
  → REPLY_SUGGESTED logged
  → Operator reviews in inbox
  → Operator marks reply sent
  → REPLY_SENT logged
  → Request status: CLOSED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Legal response windows under DPDP Rules 2025&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Request type&lt;/th&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Legal window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access&lt;/td&gt;
&lt;td&gt;Section 11&lt;/td&gt;
&lt;td&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Correction / Erasure&lt;/td&gt;
&lt;td&gt;Section 12&lt;/td&gt;
&lt;td&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grievance&lt;/td&gt;
&lt;td&gt;Section 13&lt;/td&gt;
&lt;td&gt;90 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The DPDP Act's data principal rights framework isn't complicated. Four rights, two response windows, one evidence requirement. The complexity is operational — handling a high-variance stream of natural language requests, in multiple languages, against a hard time constraint, with an audit trail that has to survive regulatory scrutiny.&lt;/p&gt;

&lt;p&gt;Manual processes fail under those conditions not because of negligence but because the requirements are genuinely hard to satisfy with shared inboxes and email chains.&lt;/p&gt;

&lt;p&gt;DPDP Copilot automates the classification and drafting — the two tasks that are the most time-consuming and the most error-prone. It makes the internal SLA clock visible before it expires. It generates the audit evidence as a byproduct of normal operation, not as a separate reporting task.&lt;/p&gt;

&lt;p&gt;The tool is open-source, self-hosted, and runs on a single Docker Compose command. If you're an Indian company with DPDP obligations and no structured data rights workflow, this is where to start.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/dpdp-copilot/" rel="noopener noreferrer"&gt;→ View the full tool page, docs, live demo, and GitHub repo&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Swapnanil Saha — &lt;a href="https://swapnanilsaha.com" rel="noopener noreferrer"&gt;swapnanilsaha.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dpdp</category>
      <category>compliance</category>
      <category>dataprotection</category>
      <category>ai</category>
    </item>
    <item>
      <title>How to Stop Evaluating LLM Outputs by Gut Feel</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Thu, 21 May 2026 05:25:31 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/how-to-stop-evaluating-llm-outputs-by-gut-feel-ml9</link>
      <guid>https://dev.to/swapnanilsaha/how-to-stop-evaluating-llm-outputs-by-gut-feel-ml9</guid>
      <description>&lt;p&gt;The standard workflow for evaluating LLM output quality goes something like this: someone reads Response A, reads Response B, and says "I think A is better." Everyone nods. The prompt ships.&lt;/p&gt;

&lt;p&gt;This is a problem for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't scale.&lt;/strong&gt; You can't manually review 500 eval pairs after every prompt change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's inconsistent.&lt;/strong&gt; The same person evaluating the same pair on different days produces different results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't tell you why.&lt;/strong&gt; "Response A is better" doesn't tell you what to fix when Response B becomes the baseline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built &lt;strong&gt;LLM Eval Suite&lt;/strong&gt; to replace gut feel with structured, evidence-backed scoring — for any task type, with CI integration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/llm-eval-suite/" rel="noopener noreferrer"&gt;→ Full tool page&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Insight: Evidence, Not Opinion
&lt;/h2&gt;

&lt;p&gt;Every score in LLM Eval Suite is accompanied by a verbatim quote from the response being evaluated. Not "this response has poor faithfulness" — but:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Faithfulness: 1.0/10
Quote: "30-day return policy, no questions asked"
Reasoning: "Source document specifies 14 days. This is a clear hallucination, not an interpretation."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This changes what you can do with the output. You can show it to a stakeholder. You can track it over time. You can build a regression test from it. You can tell the model what specifically went wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Six Evaluation Capabilities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Dimensional Scoring
&lt;/h3&gt;

&lt;p&gt;Ten task presets — QA, summarisation, RAG, code generation, creative writing, classification, translation, and more. Each preset activates the dimensions that matter for that task:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Key Dimensions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Faithfulness, Completeness, Conciseness, Relevance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;summarisation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Coverage, Compression, Accuracy, Readability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Faithfulness, Answer Relevancy, Context Precision, Context Recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Correctness, Efficiency, Readability, Security&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every dimension score comes with verbatim evidence from the response text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt; examples/eval_qa.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--mode&lt;/span&gt; compare &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Regression Testing
&lt;/h3&gt;

&lt;p&gt;Save any eval report as a named baseline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli regression save results.json &lt;span class="nt"&gt;--id&lt;/span&gt; prod-baseline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run future evals against it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli regression run results.json &lt;span class="nt"&gt;--id&lt;/span&gt; prod-baseline &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per-dimension deltas are compared against configurable thresholds. &lt;strong&gt;Exit code 1 when scores drop below your floor.&lt;/strong&gt; This is the feature that makes the tool useful in CI.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Actions Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run LLM eval&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;docker-compose run cli eval \&lt;/span&gt;
      &lt;span class="s"&gt;--file evals/suite.json \&lt;/span&gt;
      &lt;span class="s"&gt;--mode rank \&lt;/span&gt;
      &lt;span class="s"&gt;--format junit \&lt;/span&gt;
      &lt;span class="s"&gt;--output results.xml&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mikepenz/action-junit-report@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;report_paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;results.xml&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Regression check&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;docker-compose run cli regression run \&lt;/span&gt;
      &lt;span class="s"&gt;results.json --id prod-baseline&lt;/span&gt;
    &lt;span class="s"&gt;# exits 1 if any dimension drops beyond threshold&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gates model upgrades, prompt changes, and fine-tune releases automatically. The JUnit XML output integrates with any CI system that understands test reports.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucination Detection
&lt;/h3&gt;

&lt;p&gt;Claim-level analysis against a source document. Each claim in the response is classified as supported or unsupported — binary, not "mostly faithful."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli hallucination &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--response&lt;/span&gt; output.txt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; source.txt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Risk levels: none / low / moderate / high / critical, with a &lt;code&gt;safe_to_use&lt;/code&gt; boolean for downstream gating. This is what you run before using LLM output in a production pipeline where accuracy matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;hallucination_risk&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
&lt;span class="na"&gt;safe_to_use&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="na"&gt;Claim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30-day&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;return&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policy"&lt;/span&gt;
  &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unsupported&lt;/span&gt;
  &lt;span class="na"&gt;evidence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;specifies&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;14&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

&lt;span class="na"&gt;Claim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;questions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asked"&lt;/span&gt;
  &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unsupported&lt;/span&gt;
  &lt;span class="na"&gt;evidence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;makes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;return&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;conditions"&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prompt Sensitivity Analysis
&lt;/h3&gt;

&lt;p&gt;Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile across phrasings and which are stable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli sensitivity &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt; examples/prompt_variants.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Know which prompt phrasings shift your scores before you deploy. High-variance dimensions across prompts signal that your evaluation isn't measuring the response — it's measuring the prompt wording.&lt;/p&gt;

&lt;h3&gt;
  
  
  Panel Evaluation
&lt;/h3&gt;

&lt;p&gt;Run N independent judge passes on the same evaluation. Mean and variance per dimension expose where judges agree and where they disagree.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose run cli panel &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt; examples/eval_qa.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--judges&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;High-variance dimensions are flagged for human review automatically. The panel mode is the right choice when you're evaluating subjective tasks like creative writing where a single judge's opinion is insufficient signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAGAS-Compatible RAG Preset
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;rag&lt;/code&gt; task type maps the four RAGAS metrics — &lt;strong&gt;faithfulness, answer relevancy, context precision, context recall&lt;/strong&gt; — as first-class evaluation dimensions with equal weighting. The output is compatible with RAGAS reporting conventions, so you can integrate this into existing RAGAS workflows or use it as a drop-in alternative.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example: Two Responses In, Clear Winner Out
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qa"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"eval_mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"compare"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Refunds are accepted within 14 days if the item is unused."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"responses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Response A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You can get a refund within 14 days if the item hasn't been used."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Response B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Our 30-day return policy means no questions asked."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;winner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Response A&lt;/span&gt;
&lt;span class="na"&gt;margin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clear&lt;/span&gt;

&lt;span class="s"&gt;Response B — Faithfulness&lt;/span&gt;
  &lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0/10&lt;/span&gt;
  &lt;span class="s"&gt;quote&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30-day&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;return&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policy,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;questions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asked"&lt;/span&gt;
  &lt;span class="na"&gt;reasoning&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;specifies&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;14&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'No&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;questions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asked'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;source.&lt;/span&gt;
              &lt;span class="s"&gt;Two&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;distinct&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hallucinations&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;one&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sentence."&lt;/span&gt;

&lt;span class="s"&gt;Response A — Faithfulness&lt;/span&gt;
  &lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;9.5/10&lt;/span&gt;
  &lt;span class="s"&gt;quote&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;within&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;14&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;if&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;item&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hasn't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;been&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;used"&lt;/span&gt;
  &lt;span class="na"&gt;reasoning&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accurately&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;paraphrases&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;additions."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why This Matters in Production
&lt;/h2&gt;

&lt;p&gt;LLM evaluation is usually treated as a one-time concern — you evaluate before you ship. But models change, prompts drift, data distributions shift, and retrieval quality fluctuates. A system that was 90% faithful in January may be 75% faithful in April because the upstream data changed.&lt;/p&gt;

&lt;p&gt;The regression testing and CI integration in LLM Eval Suite are designed for this reality. You run evals continuously, not just at release time. The baseline is the floor — if you drop below it, the pipeline stops.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/llm-eval-suite/" rel="noopener noreferrer"&gt;→ View the full tool page, docs, and GitHub repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>testing</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Stop Getting 'It Depends' Answers About RAG Architecture</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Thu, 21 May 2026 05:09:30 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/stop-getting-it-depends-answers-about-rag-architecture-1em7</link>
      <guid>https://dev.to/swapnanilsaha/stop-getting-it-depends-answers-about-rag-architecture-1em7</guid>
      <description>&lt;p&gt;Ask five AI engineers which vector database to use for your RAG system. You'll get five different answers, and they'll all start with "it depends."&lt;/p&gt;

&lt;p&gt;It depends on your data volume. It depends on your query patterns. It depends on whether you need GDPR compliance. It depends on your team's infra maturity. It depends on your budget. It depends on whether you're doing hybrid search.&lt;/p&gt;

&lt;p&gt;The "it depends" answer is technically correct and operationally useless. It turns an architecture decision into an unbounded research project.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;RAG Readiness&lt;/strong&gt; to make one specific recommendation per component — and explain why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/rag-readiness/" rel="noopener noreferrer"&gt;→ Full tool page&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Design Principle: Opinions, Not Options
&lt;/h2&gt;

&lt;p&gt;Most RAG tooling and documentation presents you with a comparison table. Pinecone vs. Weaviate vs. Qdrant vs. Chroma. BM25 vs. dense vs. hybrid. ada-002 vs. text-embedding-3-large.&lt;/p&gt;

&lt;p&gt;Comparison tables are useful if you already know which dimensions matter for your use case. They're paralyzing if you don't.&lt;/p&gt;

&lt;p&gt;RAG Readiness is opinionated by design. You describe your use case, your data, your constraints. The tool returns &lt;strong&gt;one choice per component&lt;/strong&gt; — with full reasoning.&lt;/p&gt;

&lt;p&gt;If GDPR applies, managed cloud vector databases are eliminated from consideration before the LLM is even called. That's a rule, not an LLM judgment. The recommendation you receive is already constraint-filtered.&lt;/p&gt;




&lt;h2&gt;
  
  
  Six Modes, One Tool
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture Recommendation
&lt;/h3&gt;

&lt;p&gt;The core mode. Answer a structured set of questions about your use case — document types, query patterns, scale, compliance requirements, team capabilities. Get back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector database&lt;/strong&gt;: one specific choice with rationale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model&lt;/strong&gt;: one specific choice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking strategy&lt;/strong&gt;: one specific approach with parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval method&lt;/strong&gt;: dense / BM25 / hybrid — one answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranker&lt;/strong&gt;: whether you need one and which
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py audit &lt;span class="nt"&gt;--interactive&lt;/span&gt;
&lt;span class="c"&gt;# or from file:&lt;/span&gt;
python main.py audit &lt;span class="nt"&gt;--file&lt;/span&gt; examples/usecase_legal_contracts.json &lt;span class="nt"&gt;--with-cost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Architecture Diagnosis
&lt;/h3&gt;

&lt;p&gt;You already have a RAG system. It's not working. This mode takes your existing architecture and the problems you're seeing, and returns a root-cause analysis per component with severity levels and one specific fix.&lt;/p&gt;

&lt;p&gt;Not "improve your chunking" — "switch from fixed 512-token chunks to parent-child hierarchical chunking with 512-token child nodes. Your documents have multi-clause structure that fixed chunks split mid-sentence."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py diagnose &lt;span class="nt"&gt;--file&lt;/span&gt; examples/diagnosis_pinecone_fixed.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;overall_severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

&lt;span class="s"&gt;chunking_strategy — critical&lt;/span&gt;
  &lt;span class="s"&gt;"Fixed 512-token chunks split mid-clause in long legal documents"&lt;/span&gt;
  &lt;span class="s"&gt;Fix&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Parent-child hierarchical chunking, 512-token child nodes&lt;/span&gt;

&lt;span class="s"&gt;retrieval_method — high&lt;/span&gt;
  &lt;span class="s"&gt;"Dense-only misses exact terms like dollar amounts and clause references"&lt;/span&gt;
  &lt;span class="s"&gt;Fix&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Hybrid BM25 + dense with RRF fusion&lt;/span&gt;

&lt;span class="na"&gt;quick_fix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enable 10% token overlap today. Takes 20 minutes, reduces&lt;/span&gt;
           &lt;span class="s"&gt;the worst failures while you implement the full fix.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multi-Use-Case Session
&lt;/h3&gt;

&lt;p&gt;Run up to 5 parallel audits in a single request — useful when you're scoping a RAG platform that needs to serve multiple internal teams.&lt;/p&gt;

&lt;p&gt;The output includes cross-cutting insights: which components can be shared across use cases, where requirements conflict (the legal team needs GDPR-compliant storage; the sales team wants managed cloud), and which use case to build first for the highest return on the shared infrastructure investment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Bundle
&lt;/h3&gt;

&lt;p&gt;Once you have an architecture you trust, generate a complete implementation starter kit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py bundle &amp;lt;session-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: a &lt;code&gt;requirements.txt&lt;/code&gt;, &lt;code&gt;docker-compose.yml&lt;/code&gt;, &lt;code&gt;.env.example&lt;/code&gt;, and migration guide tailored to the recommended architecture. If you have an existing stack, you get ordered migration steps with rollback notes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Estimation
&lt;/h3&gt;

&lt;p&gt;Rule-based monthly cost breakdown per component — &lt;strong&gt;no LLM call&lt;/strong&gt;. Lookup tables for vector DB pricing tiers, embedding API costs, reranker inference, and LLM costs at your estimated query volume.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py cost &amp;lt;session-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns a line-item breakdown, optimization tips (e.g., "switching to a self-hosted embedding model saves ~$800/month at this query volume"), and a hosting model classification (managed vs. self-hosted trade-off at your scale).&lt;/p&gt;

&lt;h3&gt;
  
  
  RAGAS Eval Dataset Generation
&lt;/h3&gt;

&lt;p&gt;Generate evaluation questions grounded in your actual use case and query patterns — not generic retrieval questions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py eval-dataset &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--num-questions&lt;/span&gt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output includes easy/medium/hard distribution, RAGAS metric mapping (which questions test faithfulness vs. answer relevancy vs. context precision), an annotation guide, and a time estimate for human review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Session Persistence and Refinement
&lt;/h2&gt;

&lt;p&gt;Every audit persists to SQLite. You can refine against new constraints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py refine &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--feedback&lt;/span&gt; &lt;span class="s2"&gt;"Qdrant was too heavy for our infra team"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool re-runs with the feedback as an additional constraint. Refinement history is tracked — you can see how the recommendation evolved across iterations.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Complete Quickstart
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/swapnanil/rag-readiness
&lt;span class="nb"&gt;cd &lt;/span&gt;rag-readiness
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# add your ANTHROPIC_API_KEY&lt;/span&gt;
docker-compose up api

&lt;span class="c"&gt;# New architecture audit (interactive)&lt;/span&gt;
python main.py audit &lt;span class="nt"&gt;--interactive&lt;/span&gt;

&lt;span class="c"&gt;# Diagnose a broken stack&lt;/span&gt;
python main.py diagnose &lt;span class="nt"&gt;--interactive&lt;/span&gt;

&lt;span class="c"&gt;# Multi-use-case session&lt;/span&gt;
python main.py multi-audit examples/multi_usecase_lexvault.json

&lt;span class="c"&gt;# List sessions and refine&lt;/span&gt;
python main.py sessions
python main.py refine &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--feedback&lt;/span&gt; &lt;span class="s2"&gt;"need self-hosted only"&lt;/span&gt;

&lt;span class="c"&gt;# Cost breakdown and eval dataset&lt;/span&gt;
python main.py cost &amp;lt;session-id&amp;gt;
python main.py eval-dataset &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--num-questions&lt;/span&gt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Pre-Scoring Layer
&lt;/h2&gt;

&lt;p&gt;Before any LLM call, a rule-based pre-scorer computes a complexity score (1–10) from the use case inputs. This has two effects:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It calibrates the LLM prompt — a complexity-1 use case gets a simpler, more direct recommendation; a complexity-9 use case gets a recommendation with more explicit trade-off reasoning.&lt;/li&gt;
&lt;li&gt;It runs conflict detection — if your inputs contain contradictory constraints (e.g., "GDPR compliant" + "use Pinecone"), the conflict is flagged before the LLM is called, not discovered in the output.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI engineers&lt;/strong&gt; starting a new RAG project who want a structured starting point rather than a blank page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering leads&lt;/strong&gt; who need to scope a RAG system for a business use case and justify the architecture choices to non-technical stakeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams with an existing RAG system&lt;/strong&gt; that isn't performing as expected and need a systematic diagnosis, not a hunch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tool is open-source, runs locally, and persists everything to SQLite. Your use case details don't leave your environment beyond the single LLM API call per audit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://swapnanilsaha.com/tools/rag-readiness/" rel="noopener noreferrer"&gt;→ View the full tool page, docs, and GitHub repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building Distributed Systems, Backend Infrastructure &amp; AI Platforms — My Engineering Journey</title>
      <dc:creator>Swapnanil Saha</dc:creator>
      <pubDate>Tue, 19 May 2026 09:34:58 +0000</pubDate>
      <link>https://dev.to/swapnanilsaha/building-distributed-systems-backend-infrastructure-ai-platforms-my-engineering-journey-54p3</link>
      <guid>https://dev.to/swapnanilsaha/building-distributed-systems-backend-infrastructure-ai-platforms-my-engineering-journey-54p3</guid>
      <description>&lt;p&gt;Hey everyone 👋&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="https://swapnanilsaha.com" rel="noopener noreferrer"&gt;Swapnanil Saha&lt;/a&gt;, a backend and distributed systems engineer from Mumbai, India with 9+ years of experience building high-performance infrastructure systems, backend platforms, optimization pipelines, and AI-driven architectures.&lt;/p&gt;

&lt;p&gt;🌐 Website: &lt;a href="https://swapnanilsaha.com" rel="noopener noreferrer"&gt;swapnanilsaha.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💻 GitHub: &lt;a href="https://github.com/swapnanil" rel="noopener noreferrer"&gt;github.com/swapnanil&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 LinkedIn: &lt;a href="https://www.linkedin.com/in/swapnanil/" rel="noopener noreferrer"&gt;linkedin.com/in/swapnanil&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Work On
&lt;/h2&gt;

&lt;p&gt;Over the years, I’ve worked on systems involving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large-scale distributed backend systems&lt;/li&gt;
&lt;li&gt;Low-latency request processing&lt;/li&gt;
&lt;li&gt;High-QPS APIs&lt;/li&gt;
&lt;li&gt;Real-time optimization systems&lt;/li&gt;
&lt;li&gt;ML/CVR prediction pipelines&lt;/li&gt;
&lt;li&gt;AI infrastructure and automation tooling&lt;/li&gt;
&lt;li&gt;Performance engineering and infra optimization&lt;/li&gt;
&lt;li&gt;Data-intensive backend architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of my work has focused on solving problems where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scalability matters,&lt;/li&gt;
&lt;li&gt;latency matters,&lt;/li&gt;
&lt;li&gt;reliability matters,&lt;/li&gt;
&lt;li&gt;and infrastructure efficiency matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I enjoy working on systems that operate under real production constraints and require practical engineering tradeoffs at scale.&lt;/p&gt;




&lt;h1&gt;
  
  
  Areas I’m Currently Exploring
&lt;/h1&gt;

&lt;p&gt;Lately, I’ve been spending more time exploring the intersection of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI systems&lt;/li&gt;
&lt;li&gt;distributed infrastructure&lt;/li&gt;
&lt;li&gt;backend optimization&lt;/li&gt;
&lt;li&gt;and production-scale ML platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some areas I’m currently interested in:&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Infrastructure
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;inference systems&lt;/li&gt;
&lt;li&gt;orchestration&lt;/li&gt;
&lt;li&gt;model serving&lt;/li&gt;
&lt;li&gt;prompt pipelines&lt;/li&gt;
&lt;li&gt;scalable AI tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ML &amp;amp; Prediction Systems
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CVR prediction systems&lt;/li&gt;
&lt;li&gt;feature pipelines&lt;/li&gt;
&lt;li&gt;production inference flows&lt;/li&gt;
&lt;li&gt;experimentation frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Backend &amp;amp; Distributed Systems
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;concurrency&lt;/li&gt;
&lt;li&gt;caching systems&lt;/li&gt;
&lt;li&gt;event-driven architectures&lt;/li&gt;
&lt;li&gt;reliability engineering&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;scaling strategies&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Tech Stack
&lt;/h1&gt;

&lt;p&gt;Some technologies I frequently work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Java&lt;/li&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;SQL&lt;/li&gt;
&lt;li&gt;Linux systems&lt;/li&gt;
&lt;li&gt;REST/gRPC architectures&lt;/li&gt;
&lt;li&gt;Distributed caching systems&lt;/li&gt;
&lt;li&gt;Cloud infrastructure&lt;/li&gt;
&lt;li&gt;Performance optimization tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently spending more time improving my:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go&lt;/li&gt;
&lt;li&gt;Rust&lt;/li&gt;
&lt;li&gt;AI systems engineering skills&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Why I Built My Portfolio Website
&lt;/h1&gt;

&lt;p&gt;I recently launched my personal website:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://swapnanilsaha.com" rel="noopener noreferrer"&gt;https://swapnanilsaha.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is to create a central place for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;projects&lt;/li&gt;
&lt;li&gt;engineering writeups&lt;/li&gt;
&lt;li&gt;architecture ideas&lt;/li&gt;
&lt;li&gt;experiments&lt;/li&gt;
&lt;li&gt;future AI infrastructure work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also wanted a clean place to document lessons from building production systems and exploring modern AI infrastructure.&lt;/p&gt;




&lt;h1&gt;
  
  
  Topics I’ll Be Writing About
&lt;/h1&gt;

&lt;p&gt;Going forward, I plan to write about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scaling backend systems&lt;/li&gt;
&lt;li&gt;low-latency engineering&lt;/li&gt;
&lt;li&gt;distributed systems architecture&lt;/li&gt;
&lt;li&gt;AI infra tooling&lt;/li&gt;
&lt;li&gt;production ML systems&lt;/li&gt;
&lt;li&gt;performance optimization&lt;/li&gt;
&lt;li&gt;backend engineering patterns&lt;/li&gt;
&lt;li&gt;practical engineering lessons from real systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Open to Connecting
&lt;/h1&gt;

&lt;p&gt;I enjoy discussing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;distributed systems&lt;/li&gt;
&lt;li&gt;backend architecture&lt;/li&gt;
&lt;li&gt;AI infrastructure&lt;/li&gt;
&lt;li&gt;performance engineering&lt;/li&gt;
&lt;li&gt;scalable systems design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to connect through my website or social profiles.&lt;/p&gt;

&lt;p&gt;🌐 Website: &lt;a href="https://swapnanilsaha.com" rel="noopener noreferrer"&gt;swapnanilsaha.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💻 GitHub: &lt;a href="https://github.com/swapnanil" rel="noopener noreferrer"&gt;github.com/swapnanil&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 LinkedIn: &lt;a href="https://www.linkedin.com/in/swapnanil/" rel="noopener noreferrer"&gt;linkedin.com/in/swapnanil&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks for reading 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
