Notes from building infrastructure for 17,000+ LLMs
One of the promises of modern AI infrastructure is simple:
You should be able to switch models whenever you want.
Different models have different strengths. Some are faster. Some are cheaper. Some reason better. Some support large context windows.
In theory, you route requests dynamically and get the best of each.
In practice, something breaks almost immediately.
Context windows don’t match.
The Moment Everything Breaks
Imagine this common scenario
A conversation begins on a large context model. Maybe something like a 128k context window.
The system prompt is fairly large.
The user has been chatting for a while.
Tools have been called.
A RAG system has pulled in documents.
Everything works.
Then your router decides to switch to a smaller model. Maybe for latency or cost reasons.
Suddenly the entire state no longer fits.
The request fails or the model behaves unpredictably.
This happens because the model’s context window is not just holding messages. It contains the entire runtime state:
system prompts recent conversation turns tool calls and tool outputs RAG results web search context other metadata.
When you exceed the limit, something has to give.
Most teams end up writing custom logic to handle this:
truncating older messages prioritizing certain content summarizing conversation history trying to prevent context overflow
This logic grows quickly and often becomes fragile.
We ran into this problem while building Backboard, which currently routes across 17,000+ LLMs.
So we built a system to handle it automatically.
The Core Idea: Treat Context Like a Budget
The approach we landed on was surprisingly simple.
Instead of filling the entire context window with raw state, we reserve a portion of it as a stable budget.
When a request is routed to a model, we allocate the context window like this:
~20% reserved for raw state
~80% available for summarization
The system calculates how many tokens fit inside that 20% allocation.
Within that space we prioritize the most important live inputs:
system prompt most recent messages tool calls, RAG results, web search context:
Everything else becomes eligible for summarization.
The Summarization Strategy
Once the system identifies which parts of the state cannot fit directly into the context window, it compresses them.
We designed the summarization pipeline around a simple rule:
First try summarizing using the target model.
If the summary still does not fit, fall back to the larger model previously used to generate a more efficient summary.
This helps preserve as much information as possible while guaranteeing the final prompt fits inside the model’s context window.
All of this happens automatically in the runtime.
Avoiding Hard Context Failures
One of our goals was to make context exhaustion extremely rare.
Because the system runs continuously during requests and tool calls, the state is reshaped before the context window is fully consumed.
In practice this means applications rarely hit the absolute context limit of a model.
Developers do not have to constantly monitor token counts or worry about prompt overflow.
Making Context Usage Observable
Even though the system runs automatically, we wanted developers to see what was happening.
So we added context metrics directly to the API response.
Example:
"context_usage": {
"used_tokens": 1302,
"context_limit": 8191,
"percent": 19.9,
"summary_tokens": 0,
"model": "gpt-4"
}
This makes it easy to track:
how much context is being used when summarization happens how close you are to a model’s limit which model processed the request
For production systems, this visibility is useful for debugging and optimization.
Why We Think This Belongs in Infrastructure
A lot of AI applications now route between multiple models depending on cost, latency, or capability.
But context window management often ends up as application code.
Our view was that this is infrastructure responsibility, not application responsibility.
Developers should be able to move between models freely without rebuilding state management every time.
Adaptive Context Management
We ended up calling this system Adaptive Context Management.
Its job is simple:
Ensure the conversation state always fits the model being used.
No prompt surgery.
No manual truncation logic.
No context window surprises.
As AI systems move toward multi-model architectures, context management becomes one of the most important reliability problems.
Different models will always have different limits.
The goal is to make those differences invisible to developers.
If you are curious about the architecture behind this or how we tested summarization quality, I’d love to hear how others are approaching context management in multi-model systems.
Adaptive Context Management is now available in Backboard and automatically enabled for users.
Top comments (1)
totally, what do you think this type of advancement will enable? how creative will devs get with this solved?