Guy

Posted on Jan 29

Managing Your AI Context in Real Apps

#ai #webdev #productivity #architecture

What is Context and Why Does it Matter?

Everything that a running model is aware of is its context. Its messages from you, its previous responses, tool calls, tool responses and even system and developer messages. Every interaction with a model grows its context.

Context is a limited and scarce resource. Each model has a context limit measured in tokens – the numbers can seem large but are often deceptive. Put a few MCP tool calls in there and what seems like an endless context can fill up rapidly. Once it’s full no more messages can be sent to the model.

If you’re creating any serious AI application you’re going to fill your context and decide what happens next. The model is only aware of what’s in its context – start a new one from scratch and it knows nothing of the conversation to date.

Understanding context is the difference between playing with AI wrappers and truly building something that lasts.

What Impacts Context

Most people underestimate what impacts context and often forget all the invisible parts that quickly flood in.

Input and output messages fill up context. Potentially 300,000 words may seem like a lot but if you’re getting ChatGPT to edit a 5,000 word article for you it translates to only 30 back-and-forth full drafts and the whole of the context is full.

Now imagine you’re developing an agent that’s scraping websites and receiving JSON API responses that are several KB each. Without management your agent is going to grind to a halt quickly.

Uploaded files count towards your context – depending on the model it could count the full file towards your context, or part of it.

Any model that is given access to tool calls will create at least two context messages for each use – one to call the tool and one for the response. The amount of data in the request and response will depend on the tool but you can imagine that scraping several websites when average HTML page sizes now sit around 100k will add a lot to your context.

Every message is also wrapped in a structure for each model defining the message type. This all eats into your context as well.

If you’ve got several tool calls these are going to be described by the orchestrator to the model. That’s more context.

This all adds up very quickly.

So how do you deal with it when your context fills up?

Memories are Not Directly Context

The concepts of memories and context are separate. Memory is retrieval, context is execution. Many AI systems (but importantly not models) contain memory systems. These allow a model to access information that was discussed with them previously. These sit outside of the model and can be thought of as a datastore.

In some cases, models may explicitly say “give me a memory about this” whilst in other instances the orchestrator may detect the need for a memory to be present and provide it to the model.

In either case the relevant memories end up in the context (either as system messages or tool calls) – but the entire memory story is separate. So, memories do impact context – but only in the same way as other system or tool messages.

Compaction is a Blunt Tool

Context compaction takes the entire context and summarises it. When the context is almost full the orchestrator will give the entire context to another model and say ‘Create a summary of this context in a few thousand words’. If you use Claude code you’ll see this happen quite often and can view the generated summary. A compaction will often take up more than 20% of your new context.

Image: Claude Code – compaction allocated 22.5% of entire context

This works – but it’s a blunt tool. All detail is lost. Compaction optimises space, not intent. Specific messages won’t be present and key knowledge may be lost. It’s up to another AI model to determine what it thinks is relevant and everything else is gone.

This gets worse when you fill your context window up again (now 20% smaller than it was before) and the next agent will create a summary of both the original summary and the new messages. Earlier information gradually becomes a copy of a copy.

If you’re building a long-running app do you really want your agent to forget what it’s been working towards?

Trim or Roll Your Context

Trimming the context is the more surgical approach. The orchestrator will dynamically select how to manage individual messages in the context.

You could remove older tool calls entirely from the context, or delete large responses (such as big API responses).

If a large part of the context is filled with system messages you may dynamically adapt them (such as removing unused memories).

The exact way to trim context depends on what you are doing with your agent. Some context may be critical and should never be trimmed but other context may be irrelevant moments later.

Any serious app is going to need to implement some form of context trimming otherwise you’ll end up with agents that can’t run for extended periods of time without losing their focus and purpose.

Another related technique is a rolling context. A rolling context is a first-in-first-out approach where old messages age out of context as it fills up. Think of trimming as surgical-precision alteration of memories.

You can combine rolling and trimming together. You could even combine rolling, trimming and compaction if it makes sense in your context.

Don’t Just Throw Everything in There

In your own system the only thing going into your context is what you put in it. If you add 100 MCP servers that each exhibit 20 tool calls then your model is going to have a huge amount of its context filled up with JSON schema definitions that don’t actually help you.

Is it better to provide hundreds of tools to your agent or a handful of perfectly crafted tools that require only a minimal input and minimal output to be given to the model? Does it need all those properties on that gigantic object or just a couple?

How does an agent need to use the context? Does it need to read a whole file? If not give it a tool to search within it and never load the full file into context.

In our orchestrator we treat tool design as an integral part of agentic and context design. I’ve learned the hard way that shrinking the surface area of what we expose makes a big difference.

Do your system prompts need to be the size of a small novel? Probably not.

Doing a different task than you were a minute ago that doesn’t need full continuity? Wipe your context and start afresh.

Be smarter with context. Any technique to tidy-up your context – whether trimming or compacting – ultimately impacts an agent’s working memory and that’s never as good as just not filling your context in the first place. Give a model just what it needs and no more.

Context Impacts Performance

Some models provide better output with a certain amount of their context filled and then it degrades after a certain point. Other models may just become slow at a certain point. Each model is different.

I find that Anthropic’s Opus model gives its best outputs at around 60-75% of the context window. The first half can be a slog getting it up to speed and the final part feels like it has a tendency to drift.

Plan your context to the task and the models.

The Real-World Cost

It’s not often that developers like to think about cost – but your context directly impacts it. There is a cost for every token sent to a model and received from a model. Each time you send a request the entire previous context is sent with it.

Many providers offer caching whereby if you make a follow-up call with starting tokens that match a previous request then the cost of those tokens will be less than non-cached tokens.

Giant context costs a lot of tokens, but a lot of it may be cached.

Constantly trimmed context may be smaller in token count but will never be cached because the bulk of it changes every time.

Compacted context where the compaction message is immediately after system messages will provide a good basis for token cache optimisation but will grow quickly and may result in more messages required due to a lack of understanding.

Context size does not necessarily equate to cost. Everything is a trade-off of:

Number of tokens in the context
Proportion of cacheable context
Whether a shorter context results in the model ultimately requiring more context than it would otherwise

You should optimize on a case-by-case basis blending a combination of optimisations, compaction, rolling and trimming where it makes sense.

How you structure your context will have a direct impact on your monthly AI bill and the difference can be staggering.

Next Time…

I’ll be following up shortly with part two taking a look at a real-world example of how I’ve used a combination of all the techniques detailed here to build a software testing AI agent that can run for hours without exhausting context or drifting off-task.

DEV Community