Adamo Software

Posted on Apr 21

How we handle LLM context window limits without losing conversation quality

#ai #webdev #llm #tutorial

Every developer building on LLMs hits the same wall eventually. Your chatbot works beautifully for the first 10 turns, then starts forgetting things. Your agent ran a 30-step workflow and lost track of the original goal halfway through. Your RAG system stuffed so much context into the prompt that response quality dropped.

This is the context window problem, and it does not go away by switching to a model with a bigger window. We learned this the hard way while building an AI assistant for a travel booking platform. This post covers the strategies we actually use in production, with the trade-offs we hit.

Why bigger context windows are not the answer

Claude 3.5 Sonnet has a 200K token window. GPT-4o has 128K. Gemini 1.5 Pro has up to 2M. The temptation is to just throw everything in.

Three problems with that approach.

First, cost. Input tokens are not free. At 2M tokens per call, you are spending significant money on every request even before the model generates anything.

Second, latency. Processing a 200K-token prompt takes meaningfully longer than a 10K-token one. For a chat interface, this is the difference between instant and sluggish.

Third, and most importantly, quality degrades with length. Research from Anthropic and others has consistently shown that models pay less attention to content in the middle of very long contexts. This is called the "lost in the middle" problem. A fact placed at token 80,000 of a 150,000-token context has a real chance of being ignored.

So the question is not "how do we fit everything," it is "what actually needs to be in the prompt right now."

The four strategies we use

We combine four techniques depending on the use case. None of these are novel individually. The value is in knowing when to use which.

1. Sliding window with summarization

For chatbots and conversational agents, we keep the last N turns verbatim and summarize everything older. The key design decision is how often to summarize.

from typing import List
from dataclasses import dataclass

@dataclass
class Message:
    role: str
    content: str
    tokens: int

RECENT_TURNS = 6
SUMMARIZE_THRESHOLD = 20

def manage_context(messages: List[Message], summary: str) -> tuple[List[Message], str]:
    if len(messages) <= SUMMARIZE_THRESHOLD:
        return messages, summary

    # Keep the last N turns raw
    recent = messages[-RECENT_TURNS:]
    to_summarize = messages[:-RECENT_TURNS]

    # Incremental summarization: feed old summary + new messages
    new_summary = summarize(
        existing_summary=summary,
        new_messages=to_summarize
    )
    return recent, new_summary

We trigger summarization when the conversation exceeds 20 turns, not on every turn. Summarizing every turn is wasteful and introduces quality drift because you are summarizing summaries of summaries.

The trade-off: summaries lose specificity. If a user mentioned "I prefer aisle seats near the front" on turn 3 and you compressed that into "user discussed seat preferences" on turn 25, the agent may forget the actual preference. We mitigate this with strategy #3 below.

2. Relevance-based retrieval instead of full history

For long-running agents that make many tool calls, we do not send the entire tool call history back on every step. Instead, we embed each prior action and its result, and retrieve only the top-k most relevant to the current step.

def build_agent_context(current_goal: str, all_steps: List[Step], k: int = 5):
    # Embed the current goal
    query_embedding = embed(current_goal)

    # Embed each step's summary
    step_embeddings = [embed(f"{s.action}: {s.result}") for s in all_steps]

    # Retrieve top-k most relevant prior steps
    scores = cosine_similarity(query_embedding, step_embeddings)
    top_k_indices = np.argsort(scores)[-k:]
    relevant_steps = [all_steps[i] for i in sorted(top_k_indices)]

    return relevant_steps

This works well when agent steps are semantically diverse. It works poorly when every step is similar, because the embeddings cluster too tightly. For those cases we fall back to the sliding window.

3. Structured memory for facts that must not be lost

Some information cannot be lost to summarization. User preferences, confirmed bookings, authentication context, critical constraints. We extract these into a structured memory object that travels with every prompt.

structured_memory = {
    "user_profile": {
        "name": "extracted_from_conversation",
        "preferences": ["aisle seat", "non-smoking", "high floor"],
    },
    "session_state": {
        "current_booking": {"destination": "Tokyo", "dates": "2026-06-12 to 2026-06-20"},
        "confirmed_steps": ["flight_selected", "hotel_searched"],
    },
    "hard_constraints": ["budget: $3000 max", "must arrive before June 14"]
}

The LLM does not write to this object freely. We use a dedicated extraction step after each turn, with a structured output schema, to pull out facts. This gives us deterministic memory instead of relying on the model to remember.

The Anthropic prompt caching documentation is worth reading if you go this route, because a stable memory block at the start of your prompt is an ideal cache target.

4. Context compression for large retrieved documents

For RAG systems retrieving long documents, we compress before injection. Instead of pasting a 5000-word document into the context, we run a fast model (Haiku or GPT-4o-mini) to extract only the passages relevant to the user's query.

This is a two-model pipeline:

Retrieval returns top-k documents (often 3-5 long docs)
A fast, cheap model extracts relevant sections from each
The main model sees only the compressed, relevant content

The extra inference call adds ~200ms of latency but typically reduces main prompt size by 70-85%. Net cost is lower and quality is usually higher because the main model is not distracted by irrelevant content.

When each strategy fails

Being specific about failure modes, because this is where blog posts usually wave their hands:

Sliding window fails when users reference something from far back in the conversation ("like that restaurant I mentioned earlier"). Always pair with structured memory.
Relevance retrieval fails when the current step has no good semantic overlap with prior relevant steps. For example, if step 30 needs information from step 2 but they use completely different vocabulary, retrieval misses it.
Structured memory fails when the extraction step produces low-quality outputs. Garbage in, garbage out. We validate extractions against a Pydantic schema and retry with a stricter prompt on validation failure.
Context compression fails when the query is ambiguous. If the user asks "tell me more about that," the compression model has no way to know what "that" refers to. We rewrite the query using recent conversation context before passing it to compression.

What changed when we combined all four

Before we had a structured context strategy, a 50-turn conversation in our travel agent would produce noticeably worse responses by turn 40. Users would need to re-state preferences. The agent would propose options the user had already rejected.

After combining sliding window + relevance retrieval + structured memory:

Average tokens per request dropped from ~18,000 to ~6,500, a 64% reduction
User-reported "the AI forgot what I said" complaints dropped significantly in internal testing
Response latency p95 improved from 4.2s to 2.1s

One thing we did not improve: cost per successful conversation. The reduction in tokens was offset by the extra inference calls for summarization and extraction. What we got was better quality at roughly the same cost, which for a production agent is the right trade.

Wrapping up

The context window is a constraint to design around, not a capacity to fill. A model with 2M tokens gives you more runway, but if you depend on stuffing everything in, your quality will still degrade and your costs will still climb.

Start with a sliding window for recent turns, structured memory for facts that matter, and retrieval for everything in between. Compression is the advanced move once the basics are in place.

If you are working on production AI systems and want deeper context on multi-step agent design, we have written previously about AI agent fallback chains and human-in-the-loop patterns that pairs well with this post. For background reading, Greg Kamradt's Needle in a Haystack benchmarks are a good way to see context window degradation empirically.

I work on AI platform engineering at Adamo Software, where we build custom AI systems for travel, healthcare, and enterprise clients.

DEV Community