Deva

Posted on Jun 4 • Originally published at arihantdeva.com

Context Window Management: Tactics That Survive Real Sessions

#ai #programming #devtools #llm

The Illusion of Infinite Context: Effective vs. Nominal Limits

Large language models advertise massive context windows, but the practical limit you experience in a real session is often far smaller. The nominal limit is the maximum number of tokens the model can accept in a single request; for Claude 3.5 Sonnet that number is 200,000 tokens. In practice, each request carries overhead, system messages, and tokenization quirks that shrink the usable space. Moreover, the model’s attention mechanism degrades as the window fills, so the quality of responses drops long before the hard limit is reached.

The difference between nominal and effective limits matters because developers tend to design prompts that assume the full window is available for user data. When the window is partially occupied by system instructions, prior conversation history, or tool output, the remaining capacity for new content can be only a few thousand tokens. That mismatch leads to truncation, loss of important context, and unexpected failures in downstream logic.

The underlying mechanism is simple: the model processes a linear sequence of token embeddings. Each token consumes a fixed amount of memory in the attention matrix, and the matrix size grows quadratically with the sequence length. As the sequence approaches the nominal limit, the attention computation becomes more expensive and the model’s ability to keep earlier tokens salient diminishes. The result is a soft degradation of context relevance that is not reflected in the token count alone.

A concrete example illustrates the gap. Suppose a chatbot maintains a 10‑turn conversation, each turn averaging 150 tokens, and adds a system prompt of 500 tokens. That consumes 2,000 tokens. If the developer then appends a user request of 1,500 tokens, the total reaches 3,500 tokens, well under the nominal limit. However, the model’s attention will allocate most of its focus to the most recent 1,500 tokens, and the earlier turns may be effectively ignored. The engineer must therefore prune or summarize older turns to keep the effective context within a few thousand tokens.

Understanding this illusion helps you avoid hidden token overrun bugs. Track the token budget of every component, reserve headroom for the model’s internal processing, and design a fallback strategy such as summarization or selective retrieval when the budget is exceeded. By treating the nominal limit as an upper bound rather than a usable capacity, you can build systems that remain reliable as sessions grow.

Claude 3.5 Sonnet can accept up to 200,000 tokens in a single request. The Anthropic Models Overview lists this as the maximum context window, but practical usage rarely reaches that ceiling.

Prompt caching lets you reuse frequently used context, reducing API calls and token waste. The Anthropic Developer Documentation explains that the API checks if the prompt prefix is already cached before sending a new request. Anthropic Developer Documentation - Prompt Caching

Prompt Caching Under the Hood: Optimizing for Claude's 1024-Token Breakpoints

Claude’s models enforce hard token limits that split a session into “breakpoints.” When a prompt exceeds the breakpoint, the model discards earlier tokens and starts a new context window. The cost of rebuilding the prompt at each breakpoint is linear in the number of tokens, which quickly dominates latency and token budget in long‑running sessions. Prompt caching sidesteps this by storing a pre‑computed prefix that can be re‑used whenever the same breakpoint is reached. The cache is keyed by the exact token sequence, so a cache hit restores the model’s internal state without re‑encoding the text.

The cache works only when the prefix length meets the model’s minimum cacheable size. For Claude 3.5 Sonnet and Claude 3.5 Haiku, that threshold is exactly 1,024 tokens.

Claude 3.5 Sonnet’s cache trigger is 1,024 tokens, as documented in the Anthropic Prompt Caching Guide. This figure defines the smallest prefix that the runtime will store for reuse.
When the prefix is shorter, the runtime falls back to a full recompute, erasing any performance gain. The developer therefore structures the prompt so that the most expensive, static portion, often the system instructions, schema definitions, or domain knowledge, occupies at least that many tokens.

A concrete pattern is to pre‑assemble a “static block” that contains all immutable context, then append a “dynamic block” that holds the user‑specific conversation. The static block is cached once; the dynamic block is regenerated each turn. The following Python snippet shows how to implement this with the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()
STATIC_PROMPT = """You are an AI assistant for a cloud‑native platform.
The platform uses a microservice architecture, Kubernetes, and Istio.
All responses must be concise, use markdown, and include code examples."""
# Ensure STATIC_PROMPT is at least 1,024 tokens after tokenization
STATIC_TOKENS = client.count_tokens(STATIC_PROMPT)
assert STATIC_TOKENS >= 1024, "Static block too short for caching"

def chat(messages):
    dynamic = "\n".join(messages)
    full_prompt = f"{STATIC_PROMPT}\n{dynamic}"
    response = client.completions.create(
        model="claude-3.5-sonnet",
        prompt=full_prompt,
        max_tokens=512,
        cache=True  # request cache reuse for the static prefix
    )
    return response.completion

Failure modes appear when the static block drifts below the cache threshold, when tokenization changes (e.g., new vocabulary adds tokens), or when the cache key mismatches because of invisible whitespace. In practice, a cache miss shows up as a sudden jump in latency and a higher token consumption per turn. Monitoring latency spikes and token counts can surface these issues early.

Compared to the naïve approach of re‑encoding the entire prompt each turn, caching reduces per‑turn compute by up to 30 % in typical 2,000‑token sessions. The trade‑off is added complexity in managing block boundaries and ensuring the static block stays cacheable.

Use prompt caching when your session exceeds a single breakpoint and you have a sizable immutable prefix that can be token‑aligned to 1,024 tokens. If your workload consists of short, highly variable prompts, the overhead of maintaining a cache may outweigh the benefits. In those cases, stick with a straightforward recompute.

The Anthropic documentation states, “For Claude 3.5 Sonnet and Claude 3.5 Haiku, the minimum cacheable prompt length is 1,024 tokens. For Claude 3 Opus, it is 8,192 tokens,” confirming the breakpoint constraints. Anthropic Developer Documentation - Prompt Caching Limits

State Compaction vs. Session Handoff: Architectural Decision Boundaries

When a user session exceeds the model’s context window, engineers must decide whether to compress the accumulated state into a smaller representation (state compaction) or to split the conversation across multiple model invocations (session handoff). The choice determines latency, cost, and the fidelity of the user experience. In practice, the boundary between the two tactics is not a hard line; it is shaped by the model’s token budget, the volatility of the domain, and the need for deterministic behavior across calls.

State compaction works by summarizing or otherwise reducing the token count of prior messages. Techniques include lossy summarization, extracting key entities, or encoding the dialogue into a structured schema that the model can ingest as a single prompt. The advantage is that the entire conversation remains in a single inference, preserving the model’s internal state and avoiding cross‑call coordination. The downside is that any compression inevitably discards detail; subtle cues or rare facts may be lost, leading to degraded answers.

Session handoff, by contrast, treats the conversation as a series of independent windows. When the token budget is reached, the system stores the current state externally (e.g., in a database or a file) and starts a fresh prompt that includes a concise handoff summary. The next model call receives the handoff token and continues as if the conversation never paused. This approach scales cleanly with long sessions and keeps the original context intact, but it introduces latency for each handoff and requires reliable storage and retrieval logic.

A concrete example appears in the Leviathan terminal, where a multi‑step troubleshooting workflow exceeds the model’s 2 million‑token nominal limit. The system compacts the early steps into a JSON summary, then hands off to a fresh model instance for the remaining steps. This hybrid pattern lets the workflow stay within the window while preserving the logical flow across calls.

Failure modes surface when the handoff summary is too terse or when the compaction algorithm misrepresents the user’s intent. In the former case, the model may ask redundant clarification questions; in the latter, it may produce answers that ignore critical constraints. Monitoring token usage and validating handoff fidelity are essential safeguards.

If the application can tolerate occasional loss of nuance and values low latency, state compaction is the simpler choice. If the conversation must be auditable and span many thousands of tokens, session handoff provides a more robust foundation. The decision should be guided by the token budget, the importance of detail, and the operational overhead you are willing to accept.

**The nominal context window for Gemini 1.5 Pro is 2,000,000 tokens.** This figure comes from Google AI Studio Model Features and sets an upper bound for raw token consumption. "Prompt caching is automatic for prompts longer than 1024 tokens. The cache is structured around prefixes, allowing reuse of system instructions and historical messages," explains the OpenAI API Reference - Prompt Caching. The source details how caching can reduce repeated token costs.

Scratchpads and File-Based Memory vs. Structured Tool Calls

When a conversation exceeds the model’s token window, engineers need a strategy to keep relevant information accessible without re‑sending the entire history. Two common patterns are scratchpads (in‑prompt notes) and file‑based memory (external storage of excerpts). Both rely on the model’s ability to read and write text, but they differ in how the data is presented to the model. A scratchpad is a short, mutable block inserted into the prompt each turn; a file‑based approach stores longer passages in a separate repository and retrieves them on demand. Structured tool calls, by contrast, let the model invoke a defined API that returns data in a predictable format, bypassing the token limit entirely.

The scratchpad pattern is simple to implement. After each turn the system appends a bullet list of key facts, decisions, or open questions. The list is kept under a few hundred tokens, so it fits comfortably in the context window. Because the model sees the list as part of the prompt, it can reason about it directly. However, the list grows linearly with the number of turns, and stale items can accumulate, leading to token bloat. A typical implementation looks like this:

def update_scratchpad(scratchpad, new_fact):
    # Keep only the most recent 10 items
    items = scratchpad.split("\n")[-9:]
    items.append(new_fact)
    return "\n".join(items)

scratchpad = ""
for turn in session:
    response = model.generate(prompt + "\n" + scratchpad)
    fact = extract_fact(response)
    scratchpad = update_scratchpad(scratchpad, fact)

File‑based memory shifts the burden to an external store. The system writes long passages to a document store (e.g., a vector database) and retrieves the top‑k most relevant chunks when the model needs context. Retrieval is performed before each generation, and only the selected chunks are inserted into the prompt. This reduces token usage dramatically, but it introduces latency and requires a reliable similarity search. The retrieved text must be formatted consistently, otherwise the model may misinterpret headings or code blocks.

Structured tool calls avoid these pitfalls by exposing data through a defined interface. The model issues a call like search_documents(query="project deadline"), and the backend returns a JSON payload with the matching excerpts. The model then incorporates the payload without expanding the prompt. This pattern scales well because the model never sees more than the payload size, and the tool contract guarantees type safety. The downside is that the model must learn the tool schema, and any change to the API requires a prompt update.

The Model Context Protocol (MCP) provides a standardized way for developers to expose data and capabilities to AI models safely, managing context boundaries cleanly via tools and resources.

The Model Context Protocol (MCP) offers a clean boundary for exposing data to AI, as described in the Model Context Protocol Introduction Model Context Protocol Introduction.

When cost is a factor, the prompt cache read discount can be decisive.

Claude 3.5 Sonnet’s prompt cache reads cost 90 % less than base input, $0.30 versus $3.00 per million tokens (Anthropic API Pricing Page).

Choose scratchpads for rapid prototyping or when the knowledge base is tiny and changes each turn. Opt for file‑based memory when you need to retain large, static documents and can tolerate a retrieval step. Prefer structured tool calls when you have a stable API surface and want predictable token usage across sessions. The right tactic depends on the size of the knowledge, latency tolerance, and how often the data schema evolves.

Lossy Summarization Patterns: Preserving Semantic Intent Over Raw Tokens

When a session exceeds the model’s context window, the naive approach is to truncate the oldest messages. Truncation keeps the raw token count low but discards the logical thread that the model has been following. Lossy summarization replaces a block of dialogue with a compact representation that retains the user’s goals, the system’s decisions, and any constraints that have been introduced. The result is a prompt that fits the window while still giving the model enough semantic scaffolding to continue the conversation coherently.

The pattern works in three steps. First, identify the segment that will be removed. Second, run a summarization pass that extracts intent, state changes, and open questions. Third, inject the summary back into the prompt as a single system message. Because the summary is a single message, it consumes far fewer tokens than the original exchange. At the same time, the model sees the same high‑level information it would have seen in the full transcript, so it can reason about next actions without re‑reading every line.

A practical implementation uses a small, cheap model to produce the summary. The small model is fast enough to run on every turn, and its output is short enough to stay well within the context budget. The larger model that powers the main assistant then receives the summary as part of its system prompt. This division of labor keeps cost low while preserving the conversation’s logical shape.

Below is a minimal Python example that demonstrates the pattern with the Anthropic API. The summarize_intent function calls a lightweight Claude model to produce a one‑sentence summary of the last N messages. The main chat function then builds a new prompt that contains the summary followed by the newest user query. The code assumes that ANTHROPIC_API_KEY is set in the environment.

import os, json, httpx

ANTHROPIC_URL = "https://api.anthropic.com/v1/complete"
HEADERS = {
    "x-api-key": os.getenv("ANTHROPIC_API_KEY"),
    "content-type": "application/json",
}

def anthropic_complete(prompt, model="claude-2.0", max_tokens=256):
    payload = {
        "prompt": prompt,
        "model": model,
        "max_tokens_to_sample": max_tokens,
        "temperature": 0.0,
    }
    resp = httpx.post(ANTHROPIC_URL, headers=HEADERS, json=payload)
    resp.raise_for_status()
    return resp.json()["completion"].strip()

def summarize_intent(messages):
    # Build a short prompt for the summarizer
    raw = "\n".join(f"{m['role']}: {m['content']}" for m in messages)
    prompt = f"Summarize the user's intent and any system state changes in one sentence.\n\n{raw}\n\nSummary:"
    return anthropic_complete(prompt, model="claude-instant-1.0", max_tokens=64)

def chat(history, new_user_msg, window=8192):
    # Determine how many tokens the history occupies (simplified)
    # Assume each message averages 50 tokens
    token_estimate = len(history) * 50 + len(new_user_msg.split())
    if token_estimate > window:
        # Summarize the oldest half of the history
        to_summarize = history[: len(history)//2]
        summary = summarize_intent(to_summarize)
        # Replace the summarized segment with a single system message
        history = [
            {"role": "system", "content": f"Summary: {summary}"}
        ] + history[len(history)//2 :]
    # Append the new user message
    history.append({"role": "user", "content": new_user_msg})
    # Build the final prompt for the main model
    prompt = "\n".join(f"{m['role']}: {m['content']}" for m in history)
    return anthropic_complete(prompt, model="claude-2.0", max_tokens=512)

# Example usage
if __name__ == "__main__":
    past = [
        {"role": "user", "content": "I need a travel itinerary for a week in Japan."},
        {"role": "assistant", "content": "Sure, which cities are you interested in?"},
        {"role": "user", "content": "Tokyo, Kyoto, and Osaka."},
        {"role": "assistant", "content": "Do you have a budget limit?"},
    ]
    response = chat(past, "I prefer budget hotels and local food.")
    print(response)

The example shows how a short summary can replace a multi‑turn exchange without losing the essential request (“budget hotels and local food”) or the constraints already established (cities to visit). In practice, engineers tune the summarizer’s prompt, the length of the summary, and the token budget to match their application’s latency and cost targets. The pattern is most useful when the conversation is long, the model’s window is modest, and the user’s intent remains stable across many turns. It should be avoided in highly dynamic contexts where each turn introduces new entities that a one‑sentence summary cannot capture.

Measuring Context Degradation: Setting Up Automated Retrieval Evaluations

When a session exceeds the model’s context window, the system must decide which parts of the conversation to keep and which to discard. That decision is rarely binary; each pruning step introduces a small loss of information that can accumulate into a noticeable drop in answer quality. Measuring that loss in a repeatable way is the first step toward building reliable mitigation tactics. An automated retrieval evaluation provides a controlled loop: a known query is issued, the model answers using a progressively trimmed context, and the answer is compared against a reference that had full context. The difference quantifies degradation and highlights the point at which the chosen pruning strategy becomes harmful.

Defining the Retrieval Baseline

Start with a static corpus of prompts and expected completions. The corpus should cover the range of intents your application supports, clarification, code generation, troubleshooting, and so on. For each entry, store the full conversation history that produced the reference answer. This “gold” context serves as the upper bound; any deviation from its answer signals loss.

Building the Evaluation Loop

The loop runs a series of experiments that simulate context shrinkage. A typical implementation follows these steps:

import json
from anthropic import Anthropic

client = Anthropic(api_key="YOUR_KEY")
MAX_TOKENS = 8192  # model limit

def token_count(text):
    # Simple approximation using Anthropic's token estimator
    return client.count_tokens(text)

def evaluate(entry, truncate_at):
    # Truncate the history to the last `truncate_at` tokens
    history = entry["history"]
    while token_count("\n".join(history)) > truncate_at:
        history.pop(0)  # drop oldest turn
    prompt = "\n".join(history) + "\n" + entry["question"]
    response = client.completions.create(
        model="claude-2",
        prompt=prompt,
        max_tokens=256,
    )
    return response.completion

def run_suite(corpus, steps):
    results = {}
    for step in steps:
        degradations = []
        for entry in corpus:
            answer = evaluate(entry, step)
            degradations.append(
                {"id": entry["id"], "score": similarity(answer, entry["reference"])}
            )
        results[step] = degradations
    return results

The similarity function can be a BLEU score, ROUGE-L, or a semantic embedding distance, depending on the domain. The steps list contains token thresholds that mimic realistic window pressures (e.g., 6 k, 5 k, 4 k tokens). By plotting the average similarity across steps, you obtain a degradation curve that directly shows how much answer quality is lost as context shrinks.

Interpreting the Results

A steep drop after a particular threshold indicates that the pruning rule discards essential information too early. If the curve flattens, the rule is likely preserving the most salient turns. Compare multiple strategies, simple FIFO drop, semantic relevance ranking, or summarization, by running the same suite. The strategy with the highest similarity at the lowest token budget is the most robust.

Integrating with Continuous Integration

Treat the evaluation suite as a regression test. Store the degradation curve in a JSON artifact and compare it against a baseline from the previous commit. If a new change pushes the curve down by more than a predefined epsilon, the CI job fails. This guardrail prevents accidental introduction of aggressive context trimming that would harm downstream users.

When to Use Automated Retrieval Evaluations

Use this methodology whenever your product relies on long‑running conversations, especially when you plan to push the model toward its context limit. It is also valuable when experimenting with new summarization or compaction techniques; the evaluation will surface regressions before they reach production. If your workload stays well within the model’s nominal window, a full suite may be unnecessary, but a lightweight sanity check can still catch edge cases where a single long user turn triggers degradation.

DEV Community