Swapnanil Saha

Posted on May 26 • Originally published at swapnanilsaha.com

Why AI Code Assistants Waste Context — and How RAG Fixes It

#codeassistants #llm #rag #productivity

Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint in the module-level docstring that it technically "saw." The model didn't forget. The information was technically present in the prompt, but the transformer's attention mechanism never meaningfully focused on it. That's a different kind of failure, and it doesn't get better with a bigger context window.

There's a persistent intuition in this industry that more context is always better. Send the whole file. Send the whole codebase. This intuition breaks in a specific and measurable way. The mechanism is called attention dilution — softmax normalization means that every token in the context competes for a fixed budget of attention weight, and as the sequence grows longer, any given piece of information gets a smaller share of that budget.

This post walks through the transformer attention math to explain exactly why the naive approach fails, then covers how RAG (Retrieval-Augmented Generation) addresses it — by retrieving only the specific code chunks relevant to the current task and injecting those into the context window instead of dumping everything.

Part 1: The Problem with Stuffing

1. The Naive Approach: Just Send Everything

The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Throw that in too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?

This is called context window stuffing. Three things go wrong with it, and each gets worse as the codebase grows. The first is attention dilution — the focus of this section. The second is position bias (Section 3). The third is raw cost (Section 4). To understand why these happen, you need a concrete model of how a transformer actually reads a prompt.

A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens simultaneously, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.

The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's query vector against every other token's key vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.

Analogy: Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.

2. Why Attention Dilutes: The Math

The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

Where:

Q — the query matrix (what each token is "asking for")
K — the key matrix (what each token "offers" for comparison)
V — the value matrix (the actual content passed forward if selected)
d_k — the dimension of the key vectors (scales to prevent extreme dot products)
softmax — converts a vector of raw scores into a probability distribution that sums to 1

The notation QK^T means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix K^T does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by √d_k prevents those scores from becoming so large that softmax saturates.

The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the average attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.

This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset if the dot-product scores separate those tokens sharply from the rest. Softmax is non-linear and can be quite aggressive when there is a large score gap between relevant and irrelevant tokens. But in a real codebase, that gap is rarely clean. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products — they're not completely irrelevant, they just aren't what you need right now. These tokens collectively consume most of the softmax budget. The useful signal must compete against this crowd, and as N grows, the signal's share degrades continuously. It isn't a cliff; it's a steady erosion that compounds with each additional file you stuff in.

Key Insight: The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model cannot read long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.

3. Lost in the Middle: Position Bias

Attention dilution is one problem. A second, independent problem compounds it: position bias. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.

This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled Lost in the Middle: How Language Models Use Long Contexts. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. When the answer document was at position 1 or last, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.

Two mechanisms contribute. The first is RoPE (Rotary Position Embeddings), the positional encoding scheme in most modern open-source language models (LLaMA, Mistral, GPT-NeoX). RoPE encodes position by rotating the query and key vectors by angles proportional to their positions. The dot product between a query at position m and a key at position n includes a term that decays with relative distance (m−n) — semantically relevant tokens far from the query position must overcome a rotational penalty to receive attention weight. Tokens near the start of the sequence are close to almost every other position, giving them a structural advantage.

The second mechanism is causal training recency bias. Language models are trained to predict the next token given all previous tokens. This reward signal pushes models to weight recent tokens heavily — the immediately preceding context is almost always the most relevant signal for next-token prediction during training. The middle of a long context rarely dominated training gradients, so models systematically underweight it. This effect was documented in GPT-3.5 era models well before RoPE became standard — it isn't purely an artifact of positional encoding, it's baked into causal pretraining. Both effects run in the same direction: the middle of a long context is structurally disadvantaged.

A 2024 paper from UW, MIT, and Google (Found in the Middle) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, which is not available when calling an API.

Common Mistake: Many teams inject retrieved chunks at the end of the prompt, after a long system prompt and conversation history. This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is immediately before the user's specific question, near the end but not buried in the middle of a long history.

4. The Quadratic Cost Problem

Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales quadratically with sequence length.

To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons. Doubling the context length quadruples the compute required for attention.

Time complexity of full self-attention: O(N² · d)

A 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×.

FlashAttention (Dao et al., 2022) improves the memory profile to O(N) via tiling — it never writes the full N×N matrix to GPU memory. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.

In production, a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone. You are paying more to get worse results.

Part 2: How RAG Fixes It

5. RAG at a Glance: The Core Idea

Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks: "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"

The answer has two phases. First, an offline indexing phase where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an online retrieval phase that happens at query time: the developer's current context is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.

The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand.

The pipeline:

Parse & Chunk — split at function/class boundaries, not arbitrary token counts
Embed Chunks — convert each chunk to a vector with a code embedding model
Build Search Index — ANN index for dense retrieval + BM25 index for lexical retrieval
Embed the Query — convert current cursor context to a query vector
Retrieve Top-k — run hybrid search (dense + BM25), fuse results
Inject & Generate — inject top 3–5 chunks into the LLM prompt, immediately before the user's request

Steps 1–3 happen once (or on incremental file changes). Steps 4–6 happen on every completion request. The parts where most implementations go wrong: chunking (using fixed-size splits instead of AST boundaries), retrieval (using only dense search and missing exact identifier queries), and injection order (burying retrieved context in the middle of the prompt).

6. Chunking for Code: Why Fixed-Size Fails

Code has structure that text does not. A function is a unit of meaning. Fixed-size chunking — splitting every file every 256 tokens — splits in the middle of functions, destroying logical units.

Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that look like:

Chunk A: def process_payment(order_id, amount, currency="USD"):
    """Process a payment..."""
    conn = get_db_connection()
    try:
        txn = conn.begin_transaction(

Chunk B:   order_id=order_id,
  amount=amount,
  currency=currency
)    except DatabaseError as e:
        log_error(e)
        raise PaymentError(str(e))

Neither chunk represents the function accurately. The embedding of Chunk A does not represent "a payment processing function" — it represents a truncated fragment.

AST-based chunking uses tree-sitter to parse each file and extract logical units at language-defined boundaries: function definitions, class bodies, method groups. Each chunk's metadata includes file path, start line, end line, and node type. This metadata is as important as the chunk text itself — it tells the retrieval system where in the codebase this chunk lives.

One practical addition: each chunk can be augmented with a small surrounding context for embedding purposes — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects the chunk's role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text.

The overlap trap in code: Sliding window overlap (copying N tokens from one chunk into the next) is useful in prose. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature alongside any function chunk, rather than copying the previous function's body into the current chunk. The Continue open-source IDE extension uses this approach.

7. Retrieval Strategies: Dense, Sparse, and Why Code Needs Both

Dense retrieval converts query and each chunk to vectors, then finds the most similar by cosine similarity. It can match meaning even when exact words differ — "how do we handle rate limit errors?" surfaces functions named throttle_on_429 or backoff_retry.

The embedding model used matters significantly. Code-specialized models like voyage-code-3 — purpose-built for code retrieval, top-ranked on code retrieval benchmarks (2025) — produce substantially better representations for function bodies, type signatures, and API calls than general-purpose models. text-embedding-3-large is a strong general-purpose embedding model suited for mixed code + documentation retrieval, but it wasn't specifically designed around code.

BM25 (lexical/keyword retrieval) counts words. It excels at exact matches — a developer looking for PaymentGateway.process_refund will find it immediately. Error codes, configuration key names, and exact API method names are better retrieved lexically than semantically. For code, the asymmetry is important: queries for exact identifiers favor BM25. Queries for concepts and behaviors favor dense retrieval. The right system runs both.

8. Hybrid Search and Reciprocal Rank Fusion

Running both methods produces two ranked lists that need combining. BM25 scores and cosine similarity scores live in completely different numerical ranges — you cannot add them directly.

Reciprocal Rank Fusion (RRF) avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list:

RRF_score(d) = Σ_{r ∈ R} 1 / (k + rank_r(d))

Where:

R = set of ranked lists (BM25 list, dense list)
rank_r(d) = position of document d in list r (1-indexed)
k = smoothing constant (default 60, from Cormack, Clarke & Buettcher 2009 — empirically robust across many retrieval tasks). Increasing k makes the formula more conservative, rewarding consistent mid-rank appearances over a single strong rank.
If a document does not appear in a list, its contribution from that list is 0

A document ranked #1 in both lists scores ≈ 0.033. A document ranked #1 in one list but #100 in the other scores ≈ 0.022. Candidates that both BM25 and semantic search agree on float to the top.

9. Reranking: The Final Sorting Pass

After hybrid search and RRF, you have ~20 candidate chunks. A cross-encoder reranker takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts pass through the model together, the model can attend to query-document relationships that a bi-encoder cannot — query and document never interact during bi-encoder encoding.

The practical architecture: use fast bi-encoder retrieval (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on those 20 for final ordering. The top 5 go into the prompt.

Cross-encoder context window limits: Cross-encoders are themselves transformer models with context window limits. General-purpose reranker models like ms-marco-MiniLM-L-12-v2 support 512 subword tokens — which is often enough for a single short function, but not for large class bodies. For retrieval pipelines that surface larger chunks, use a reranker with a larger window: Cohere Rerank 3 supports 4,096 tokens; voyage-rerank-2 supports 16K. If the combined chunk+query still exceeds the limit, truncate the chunk from the bottom — the function signature and docstring are more informative for reranking than the implementation tail.

For code with strong AST chunking and a good code embedding model, hybrid bi-encoder retrieval is often sufficient for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions. It adds 50–200ms of latency, so benchmark before committing.

Part 3: In Production

10. How Cursor Does It: A Reference Architecture

When you open a project in Cursor, it chunks local files and sends them to its servers, where they are embedded (via OpenAI's API or a custom model) and stored in Turbopuffer — its vector store of choice. File paths are obfuscated client-side before any data leaves your machine. Embeddings are cached by chunk hash, making incremental re-indexing fast.

At query time, Cursor monitors the active cursor position and constructs a composite signal: the current file's surrounding code, any open editor tabs, and recent edit history. This signal is embedded into a query vector, sent to Turbopuffer for ANN search, and the top-k results are retrieved. The actual code is read from local disk; the model only sees the retrieved text.

@Codebase in Cursor's chat is the explicit trigger for a full retrieval pass over the indexed codebase. Without it, Cursor uses a lighter heuristic based on open tabs and file imports. @Docs and @Web extend the same pipeline beyond the local codebase.

One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.

GitHub Copilot's context construction follows a similar pattern. For inline completion, it uses the current file content around the cursor plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The @workspace symbol in VS Code triggers a more thorough indexing-based search, analogous to Cursor's @Codebase. Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke — full retrieval is reserved for explicit chat interactions.

11. Tradeoffs and Limits of Code RAG

Scenario	RAG Behavior	Mitigation
Cross-file dependency reasoning	Each retrieved chunk is a fragment; the model may not understand how three retrieved functions compose at the call site	Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies
Newly created files not yet indexed	Invisible to retrieval until the index is rebuilt	Incremental indexing on file-save events; maintain a pending index queue
Query is too vague	"fix the bug" → retrieves generic results	Use cursor position + surrounding error message as primary query signal
Minified or generated code	Lock files, protobuf generated code pollute the index	Maintain a .gitignore-style exclude list for the RAG indexer
Very large monorepos	Recall degrades; indexing is slow	Scope index to current working subdirectory or per-service sub-indices
Schema/type changes	Stale embeddings give the model outdated type signatures	Invalidate embeddings on file write by chunk content hash

Does a larger context window make RAG obsolete? As context windows grow to 1M and beyond — Llama 4 Scout hit 10M tokens in 2025, Gemini 1.5 Pro supported 1M — this question keeps coming up. The practical answer is no, though the reasoning matters. A 200,000-line Python codebase easily exceeds 2 million tokens. Most production monorepos are far larger. More importantly, the attention quality degradation described in Sections 2 and 3 doesn't disappear with a larger nominal window. Those long-context models achieve their range through techniques like NTK-aware RoPE scaling (which extends the effective frequency range of positional encodings) and sparse attention patterns (which skip computation on distant token pairs) — these help with extrapolation but don't eliminate the position bias at extremely long ranges. And practically: a 1M-token prompt is expensive and slow even on state-of-the-art hardware. For interactive code assistance, stuffing the full codebase is off the table regardless of window size.

Large context windows and RAG do different jobs. RAG decides what deserves to be in the context window. The context window determines how much you can fit once you've been selective. A well-tuned system retrieves the right 5,000 tokens from a 10M-token codebase and puts them in a 128K window with room left for conversation history and tool outputs.

12. Building Your Own Code RAG Pipeline

Parsing: Use tree-sitter with a recursive AST walk — iterating only over root_node.children misses deeply nested functions and class methods.

# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)

TARGET_TYPES = {"function_definition", "class_definition"}

def walk_tree(node, source_code: str, file_path: str, chunks: list):
    """Recursively walk the AST to catch nested definitions
    (methods inside classes, functions inside functions, etc.)"""
    if node.type in TARGET_TYPES:
        chunk_text = source_code[node.start_byte:node.end_byte]
        chunks.append({
            "text": chunk_text,
            "file": file_path,
            "start_line": node.start_point[0],
            "end_line": node.end_point[0],
            "type": node.type
        })
        # For class_definition, continue recursing to capture methods.
        # For function_definition, stop — we want the whole function,
        # not its nested helpers as separate chunks.
        if node.type == "class_definition":
            for child in node.children:
                walk_tree(child, source_code, file_path, chunks)
    else:
        for child in node.children:
            walk_tree(child, source_code, file_path, chunks)

def extract_chunks(source_code: str, file_path: str) -> list[dict]:
    tree = parser.parse(source_code.encode())
    chunks = []
    walk_tree(tree.root_node, source_code, file_path, chunks)
    return chunks

Embedding models:

Model	Context window	Strengths	When to use
`voyage-code-3`	16K tokens	Purpose-built for code; top-ranked on code retrieval benchmarks (2025)	Production code assistant, maximum retrieval quality
`text-embedding-3-large`	8K tokens	Strong general performance; well-supported; large community	Mixed code + documentation retrieval; existing OpenAI integrations
`nomic-embed-code`	8K tokens	Open-weight; can run locally; no API cost	Air-gapped environments; cost-sensitive deployments; on-prem

Vector store:

pgvector in Postgres — sufficient for single-developer or small-team tools
Qdrant — supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores

Prompt injection template:

You are a coding assistant for this codebase.

## Relevant context from the codebase:

### [payments/gateway.py · lines 42–87]

python
{chunk_1_text}


### [payments/exceptions.py · lines 1–24]

python
{chunk_2_text}


### [payments/models.py · lines 88–112]

python
{chunk_3_text}


## Current task:
{user_request}

Include file path and line numbers in each chunk header. These cost very few tokens but give the model the module structure needed to generate correct imports and references.

Do not retrieve more than you need. It is tempting to inject 10–15 chunks to "give the model more information." Resist this. Each additional chunk increases context size (paying the quadratic cost from Section 4), increases attention dilution, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.

The Through-Line

The surprising thing about attention dilution is that it isn't a bug you can patch. It's a structural property of softmax normalization — the total attention weight sums to 1.0 regardless of sequence length, so every token you add is competing with every other for a share of that budget. More context doesn't mean more understanding; it means each fact gets a smaller slice. The lost-in-the-middle position bias makes it worse: code injected into the middle of a long prompt is structurally disadvantaged by both RoPE's distance decay and the recency bias that causal pretraining instills. Knowing this changes how you think about the whole problem.

RAG doesn't solve attention dilution — it sidesteps it. Instead of sending everything and hoping the model finds what's relevant, it figures out what's relevant first and sends only that. The context window ends up containing what actually matters for the task: the right type definitions, the right helper functions, the right error handling patterns.

In practice: below roughly 3,000–5,000 lines, context stuffing usually works well enough. Above that, the problems stack up fast. At 50,000+ lines, naive stuffing reliably hurts. At 500,000+ lines, AST chunking, hybrid BM25 + dense retrieval, RRF fusion, and careful prompt injection aren't premature optimization — they're the baseline.

References

Foundational Papers

Vaswani et al. (2017) — Attention Is All You Need. NeurIPS. The original transformer paper introducing scaled dot-product attention.
Liu et al. (2023) — Lost in the Middle: How Language Models Use Long Contexts. Stanford / Berkeley. Empirical study of U-shaped attention bias and the 30% accuracy drop at mid-context positions.
He et al. (2024) — Found in the Middle: Calibrating Positional Attention Bias. UW / MIT / Google. Proposed calibration method that partially corrects RoPE position bias at inference time.
Survey (2025) — Retrieval-Augmented Code Generation: A Survey. Comprehensive survey of RAG approaches specifically for code generation and repository-level tasks.

RAG & Retrieval

Cormack, Clarke & Buettcher (2009) — Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods
Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
RAG vs Large Context Window: Real Trade-offs for AI Apps — Redis Engineering Blog
Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents — Shaped AI
RAG for LLM Code Generation using AST-Based Chunking — Vishnudhat Natarajan
Better Retrieval Beats Better Models for Large Codebases — Stéphane Derosiaux

Code Assistants & Architecture

How Cursor Actually Indexes Your Codebase — Towards Data Science
How GitHub Copilot Works — Quastor Engineering
What is Retrieval-Augmented Generation? — GitHub Blog

Hybrid Search & Ranking

BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production — Ranjan Kumar
Hybrid Search: BM25 and Dense Retrieval Combined — Michael Brenndoerfer

DEV Community