Adam Poulemanos for Knitli

Posted on Jan 6

Context Engineering: How We Work Around the Goldfish Problem

#contextengineering #mcp #ai #development

Originally published at blog.knitli.com

tl;dr

Context engineering is the practice of deciding what information goes into a large language model's (LLM's) context window and when
The dominant approach today is summarization: using an LLM to compress context when the window fills up
Summarization works well for some tasks but loses critical details in others, forcing agents to re-retrieve the same information repeatedly
Other approaches like RAG and fine-tuning exist, each with real tradeoffs
Understanding these tradeoffs helps you choose the right tools and know when to trust them

If Context is King, Context Engineering is Kingmaking

In my last post, I explained that LLMs are goldfish. They can only see what fits in their context window, and they forget everything else. I also showed how context poisoning happens when you dump too much irrelevant information into that window, making it harder for the model to find what matters.

So how do engineers actually deal with this? That's where context engineering comes in.

Context engineering is the practice of deciding what information an LLM sees, when it sees it, and how much of it gets included. It's the difference between an AI that gives you useful answers and one that hallucinates or misses obvious details.

The Summarization Approach (What Most Tools Do Today)

Here's how most AI coding tools handle context limits:

The agent starts working on your task. It reads files, makes changes, runs commands, and accumulates history. All of this fills the context window.

When the window approaches its limit—usually around 95% full—the system needs to make room for more.

The standard solution: call another LLM to summarize everything that's happened so far. The summarization LLM gets the entire conversation history and a prompt like "compress this to save tokens." It produces a shortened version, the system discards the original details, and the agent continues with this compressed summary as its only record of what came before.

This is called "auto-compact" or "context compression" or "hierarchical summarization," but it's all the same basic idea. Claude Code does it. Cursor does it. Most agent frameworks do it.

Why is this approach so common? Because it's a reasonable response to a hard constraint. Context windows are finite. Work sessions aren't. Something has to give, and summarization is cheap to implement and works surprisingly well for many tasks.

But it has real limitations.

Where Summarization Works and Where It Doesn't

Summarization works well when:

The task is mostly linear (do step A, then B, then C)
Earlier details genuinely don't matter once completed
The agent won't need to revisit specific information from early in the session

Summarization struggles when:

The task involves debugging or iterative refinement
The agent needs to compare current state to earlier state
Specific details (exact error messages, variable names, code snippets) matter more than general narrative

Here's a concrete example of the second case:

An agent is debugging a function. It reads the function definition, identifies a bug, makes a fix, tests it, sees a new error, and reads the function again to understand the new error.

Then the context window fills up. The system summarizes.

The summary might say: "Fixed bug in calculate_total function, encountered new error."

But it doesn't include the actual function code, the specific error message, or the change that was made. That detail is gone.

Two turns later, the agent needs to understand why the new error is happening. It doesn't have the function code anymore—that got summarized away. So it re-reads the file, re-retrieving context it already had.

This happens often in debugging workflows. Agents spend time and tokens re-reading information they've already seen because summarization discarded the details they need.

It's like taking notes during a meeting by writing "discussed the budget" and then, when someone asks you what the actual numbers were, having to go back and re-watch the recording.

The deeper problem: summarization is lossy in unpredictable ways. The LLM doing the compression has to guess what's important. Sometimes it guesses wrong. When that happens, the agent either fails or has to backtrack and reconstruct context from scratch.

Other Approaches and Their Tradeoffs

Summarization dominates because it is easy and often 'good enough.' There are other approaches, each with their own pros and cons:

RAG: Retrieval Augmented Generation

RAG treats your codebase (or other data) like a searchable database. It breaks everything into chunks, converts them into numerical representations called embeddings (essentially coordinates in a high-dimensional space where similar content clusters together), and stores them. When the agent needs information, it searches for relevant chunks and adds them to the context.

The appeal: RAG lets you work with massive codebases without loading everything at once. You retrieve only what's relevant for each query.

The tradeoff: The quality of RAG depends entirely on how you implement it. Naive implementations use simple similarity matching—essentially asking "which chunks of text sound most like this query?" This works okay for documentation but breaks down for code. A function definition might have low textual similarity to a query about debugging an error that function causes. Dependencies three files away don't "sound like" the immediate problem, even when they're critical to understanding it.

More sophisticated RAG systems understand code structure: they know about function calls, imports, type definitions, and can traverse these relationships. This makes retrieval much more accurate but is significantly harder to build.

The practical result: RAG quality varies enormously between tools. When evaluating a tool that uses RAG, the question isn't "does it use RAG" but "how smart is its retrieval?"

Caching: Remember What You've Already Seen

Some systems cache frequently-accessed context so they don't have to re-retrieve or re-process it. If an agent reads the same file five times during a session, caching means you only pay the retrieval cost once.

The appeal: Caching directly addresses the re-retrieval problem that summarization creates.

The tradeoff: Caches take memory. They can become stale if files change. And deciding what to cache (and when to invalidate it) adds complexity.

Agents: Let the Model Search for Itself

Agent systems give the LLM tools to retrieve its own context. Instead of pre-selecting information, you let the model search files, run commands, or call APIs to find what it needs.

The appeal: Agents can adapt. They search for what they need in the moment and course-correct based on what they find.

The tradeoff: Agents are slower and more expensive. Every search is another API call (called an "inference call"), which means more tokens—the basic units that AI providers charge you for—and more compute. Agents also make mistakes: they search for the wrong things, miss obvious information, or get stuck in loops. And because the model has to reason about what to retrieve at each step, the whole process uses tokens fast.

Fine-tuning: Bake It Into the Model

Fine-tuning means retraining the model on your specific codebase or domain so it "learns" your patterns and doesn't need them in the context window.

The appeal: Once fine-tuned, the model already "knows" your code. No retrieval needed.

The tradeoff: Fine-tuning is expensive and inflexible. You need GPU time, training data, and constant retraining as your codebase changes. Fine-tuned models also aren't great at specific details—they learn general patterns but still hallucinate function names or recent changes. For fast-moving projects, fine-tuning can't keep up.

The Real Challenge: Context Engineering is a Hard Problem

Good context engineering for coding tasks requires several things that are genuinely difficult:

Understanding code structure: What depends on what? Which files matter for which tasks? How does information flow through the system? This requires parsing and analyzing code, not just treating it as text.

Dynamic decision-making: Different questions need different context. Understanding what a function does requires different information than debugging why it crashes, which requires different information than refactoring it for performance.

Precision: Pulling the right information without including noise. Every irrelevant token makes it harder for the model to find what matters.

Most tools make pragmatic tradeoffs here. They use approaches that are cheap to implement and work well enough for common cases, even if they break down on complex tasks. That's not incompetence—it's engineering under constraints.

But it does mean that for complex, real-world work, context engineering is often the limiting factor. Not model capability. Not prompt quality. Whether the model has the right information to work with.

The Hidden Costs of Poor Context Engineering

When context engineering breaks down, the costs show up in three places:

Money: Every token you process costs money. When you re-retrieve the same information repeatedly, you're paying to process those tokens over and over. When you include irrelevant context "just in case," you're paying for all of it. For teams using AI at scale, this can significantly increase infrastructure costs.

Speed: Processing large contexts takes time. The more tokens you feed the model, the longer it takes to respond. When agents have to search repeatedly for information they've already seen, tasks stretch out.

Reliability: When the model has to work with lossy summaries or sift through irrelevant information, it makes mistakes. It latches onto the wrong details, misses important nuance, or hallucinates. This is why AI coding tools sometimes confidently suggest fixes that break your code or miss bugs that are obvious if you have the right context.

What You Can Do About It

If you're using AI coding tools, here are some practical things to keep in mind:

Watch for re-retrieval patterns. If you notice an agent reading the same file multiple times in a session, that's a sign that context is being lost. Some tools handle this better than others.

Match tools to tasks. Summarization-based tools work fine for straightforward, linear tasks. For debugging or iterative work, look for tools with smarter context management.

Ask about context strategy. When evaluating AI coding tools, ask: How do they handle long sessions? What happens when the context window fills up? Do they use RAG, and if so, how sophisticated is the retrieval?

Keep sessions focused. Shorter, focused sessions are less likely to hit context limits than sprawling multi-hour sessions. If you're doing complex work, sometimes starting fresh with targeted context is more effective than continuing a bloated session.

Provide explicit context. Don't assume the tool will find what it needs. If you know a specific file or function is relevant, mention it directly.

How I'm Trying to Fix It

With Knitli, I'm working on context engineering that understands code structure—tracking dependencies, call graphs, repository patterns, and type relationships so retrieval is precise and adaptive rather than approximate and sweeping. My goal: assemble exactly the context each task needs, avoiding both the re-retrieval problem and context pollution.

My first attempt at that is CodeWeaver, which you can try today. It's rough around the edges and doesn't achieve that goal yet, but it's much more capable at attacking the problem than existing tools. It's also fully open source and free.

I'm not claiming I've solved context engineering. It's a genuinely hard problem. But I think current approaches leave a lot of room for improvement, and I'm focused on closing that gap.

If you're interested in following along, you can learn more at knitli.com, or try CodeWeaver and get involved in making something better.

What context engineering challenges have you run into with AI coding tools? I'd love to hear your experiences in the comments.

DEV Community