DEV Community

Cover image for Sliding Window vs Summarization: How to Free Up AI Agent Context
Gantz AI for Gantz

Posted on

Sliding Window vs Summarization: How to Free Up AI Agent Context

Your context is full. You need to drop something.

Two options: slide the window or summarize.

Both have costs. Here's how to choose.

The problem

Turn 1:   "My name is Alice, I work on the payments team"
Turn 2:   "Can you help me debug the checkout flow?"
Turn 3:   [tool call: read checkout.js]
Turn 4:   "I see the issue, it's in the validation"
...
Turn 30:  Context is 95% full
Turn 31:  Need to make room
Enter fullscreen mode Exit fullscreen mode

What do you drop?

Option 1: Sliding window

Keep the last N messages. Drop everything older.

Before (30 messages):
[1] [2] [3] [4] [5] ... [26] [27] [28] [29] [30]

After sliding (keep last 20):
                [11] [12] ... [26] [27] [28] [29] [30]

Dropped: Messages 1-10
Enter fullscreen mode Exit fullscreen mode

Implementation

class SlidingWindow:
    def __init__(self, max_messages=20):
        self.messages = []
        self.max_messages = max_messages

    def add(self, message):
        self.messages.append(message)
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self):
        return self.messages
Enter fullscreen mode Exit fullscreen mode

What you gain

Speed: O(1) operation. Just slice the list.

# Instant, no API calls
self.messages = self.messages[-20:]
Enter fullscreen mode Exit fullscreen mode

Predictability: Always same size. No surprises.

# Context size is bounded
assert len(context.messages) <= 20
Enter fullscreen mode Exit fullscreen mode

Simplicity: No LLM calls. No prompts. No failure modes.

Cost: Zero additional tokens.

What you lose

Early context: First messages disappear.

Turn 1: "My name is Alice"        ← GONE
Turn 2: "I'm on the payments team" ← GONE
...
Turn 25: "What's my name again?"
Agent: "I don't know your name"   ← Oops
Enter fullscreen mode Exit fullscreen mode

Important setup: Key instructions vanish.

Turn 1: "Always respond in French" ← GONE
...
Turn 25: [Agent responds in English]
Enter fullscreen mode Exit fullscreen mode

Accumulated knowledge: Facts learned early are lost.

Turn 3: [Reads config file]        ← GONE
Turn 5: [User explains architecture] ← GONE
...
Turn 25: Agent asks same questions again
Enter fullscreen mode Exit fullscreen mode

Option 2: Summarization

Compress old messages into a summary. Keep recent messages verbatim.

Before:
[msg1] [msg2] [msg3] ... [msg28] [msg29] [msg30]

After summarizing:
[SUMMARY of msgs 1-20] [msg21] [msg22] ... [msg30]
Enter fullscreen mode Exit fullscreen mode

Implementation

class SummarizingContext:
    def __init__(self, keep_recent=10, summarize_threshold=20):
        self.messages = []
        self.summary = ""
        self.keep_recent = keep_recent
        self.summarize_threshold = summarize_threshold

    def add(self, message):
        self.messages.append(message)

        if len(self.messages) > self.summarize_threshold:
            self.compress()

    def compress(self):
        to_summarize = self.messages[:-self.keep_recent]
        to_keep = self.messages[-self.keep_recent:]

        # Summarize old messages
        new_summary = self.create_summary(to_summarize)

        # Update state
        if self.summary:
            self.summary = f"{self.summary}\n\n{new_summary}"
        else:
            self.summary = new_summary

        self.messages = to_keep

    def create_summary(self, messages):
        messages_text = "\n".join([
            f"{m['role']}: {m['content'][:500]}"
            for m in messages
        ])

        response = llm.create(
            model="gpt-4o-mini",  # Fast, cheap model
            messages=[{
                "role": "user",
                "content": f"""Summarize this conversation, preserving:
- Key facts (names, preferences, decisions)
- Important context (what was discussed, what was decided)
- Any instructions or requirements mentioned

Conversation:
{messages_text}

Summary:"""
            }]
        )
        return response.content

    def get_context(self):
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{self.summary}"
            })
        context.extend(self.messages)
        return context
Enter fullscreen mode Exit fullscreen mode

What you gain

Preserved knowledge: Key facts survive.

Summary: "User is Alice from payments team. Debugging checkout flow.
         Found validation issue in checkout.js line 47."

Turn 25: "What team am I on?"
Agent: "You're on the payments team"  ← Remembered!
Enter fullscreen mode Exit fullscreen mode

Accumulated context: Learning compounds.

Summary includes:
- User preferences discovered
- Decisions made
- Problems solved
- Architecture understood
Enter fullscreen mode Exit fullscreen mode

Better continuity: Conversation feels coherent.

What you lose

Latency: LLM call to summarize.

# 500ms - 2s per summarization
summary = llm.create(...)
Enter fullscreen mode Exit fullscreen mode

Cost: Tokens to generate summary.

Summarizing 10 messages: ~500 input tokens + ~200 output tokens
At $0.01/1K tokens: ~$0.007 per summarization
Enter fullscreen mode Exit fullscreen mode

Accuracy: Summarizer might miss things.

# Important detail in message 5
"Make sure to use UTC timestamps, not local time"

# Summarizer misses it
Summary: "User discussed timestamp handling"  ← Lost the UTC detail!
Enter fullscreen mode Exit fullscreen mode

Complexity: More failure modes.

# What if summarization fails?
try:
    summary = llm.create(...)
except APIError:
    # Fall back to sliding window?
    # Retry?
    # Keep full context temporarily?
Enter fullscreen mode Exit fullscreen mode

The trade-off matrix

Factor Sliding Window Summarization
Speed ✓ Instant Slow (LLM call)
Cost ✓ Free Costs tokens
Simplicity ✓ Simple Complex
Knowledge retention Poor ✓ Better
Accuracy ✓ Perfect (recent) May lose details
Continuity Abrupt ✓ Smooth
Failure modes ✓ None Several

When to use sliding window

Short tasks: User will complete within window.

# Task typically takes 5-10 turns
# Window of 20 is plenty
context = SlidingWindow(max_messages=20)
Enter fullscreen mode Exit fullscreen mode

Stateless interactions: Each turn is independent.

# Q&A bot - each question stands alone
# No need to remember previous questions
Enter fullscreen mode Exit fullscreen mode

High-throughput: Need speed, can't afford LLM latency.

# Processing 1000 requests/minute
# Can't add 1s latency for summarization
Enter fullscreen mode Exit fullscreen mode

Budget-constrained: Every token counts.

# Free tier, limited API budget
# Can't afford summarization overhead
Enter fullscreen mode Exit fullscreen mode

When to use summarization

Long sessions: Conversations span many turns.

# Coding session: 50+ turns over hours
# Must remember early context
context = SummarizingContext(keep_recent=15)
Enter fullscreen mode Exit fullscreen mode

Personalization: User identity matters.

# Personal assistant
# Must remember: name, preferences, past interactions
Enter fullscreen mode Exit fullscreen mode

Complex tasks: Building on previous work.

# Multi-step project
# Each step depends on previous decisions
Enter fullscreen mode Exit fullscreen mode

Knowledge accumulation: Learning throughout conversation.

# Onboarding agent that learns your codebase
# Must retain discovered information
Enter fullscreen mode Exit fullscreen mode

Hybrid approaches

Approach 1: Tiered retention

Recent: verbatim. Medium: summarized. Old: key facts only.

class TieredContext:
    def __init__(self):
        self.key_facts = {}           # Permanent
        self.summary = ""             # Compressed old
        self.recent = []              # Last 10 verbatim

    def get_context(self):
        facts = "\n".join(f"- {k}: {v}" for k, v in self.key_facts.items())
        return [
            {"role": "system", "content": f"Key facts:\n{facts}"},
            {"role": "system", "content": f"Previous context:\n{self.summary}"},
            *self.recent
        ]
Enter fullscreen mode Exit fullscreen mode

Approach 2: Smart sliding

Slide, but protect important messages.

class SmartSlidingWindow:
    def __init__(self, max_messages=20):
        self.messages = []
        self.protected = []  # Never dropped
        self.max_messages = max_messages

    def add(self, message, protect=False):
        if protect:
            self.protected.append(message)
        else:
            self.messages.append(message)
            if len(self.messages) > self.max_messages:
                self.messages.pop(0)

    def get_context(self):
        return self.protected + self.messages
Enter fullscreen mode Exit fullscreen mode

Approach 3: On-demand summarization

Only summarize when explicitly needed.

class OnDemandSummarizing:
    def __init__(self, max_tokens=50000):
        self.messages = []
        self.max_tokens = max_tokens

    def add(self, message):
        self.messages.append(message)

        # Only summarize when approaching limit
        if self.count_tokens() > self.max_tokens * 0.9:
            self.compress()

    def compress(self):
        # Keep last 10, summarize rest
        to_summarize = self.messages[:-10]
        summary = self.create_summary(to_summarize)

        self.messages = [
            {"role": "system", "content": f"Summary:\n{summary}"}
        ] + self.messages[-10:]
Enter fullscreen mode Exit fullscreen mode

Approach 4: Extract-then-slide

Extract key facts, then slide.

class ExtractAndSlide:
    def __init__(self, max_messages=20):
        self.messages = []
        self.facts = {}
        self.max_messages = max_messages

    def add(self, message):
        self.messages.append(message)

        # Extract facts from all messages
        self.extract_facts(message)

        # Slide normally
        if len(self.messages) > self.max_messages:
            self.messages.pop(0)

    def extract_facts(self, message):
        # Fast extraction (could use regex or small model)
        if "my name is" in message["content"].lower():
            # Extract name
            ...
        if "i work on" in message["content"].lower():
            # Extract team
            ...

    def get_context(self):
        facts_str = "\n".join(f"- {k}: {v}" for k, v in self.facts.items())
        return [
            {"role": "system", "content": f"Known facts:\n{facts_str}"},
            *self.messages
        ]
Enter fullscreen mode Exit fullscreen mode

Decision flowchart

                    Start
                      │
                      ▼
         ┌──────────────────────┐
         │ Session > 20 turns?  │
         └──────────┬───────────┘
                    │
          ┌─────────┴─────────┐
          │                   │
         No                  Yes
          │                   │
          ▼                   ▼
    ┌──────────┐    ┌─────────────────┐
    │ Sliding  │    │ Need to remember │
    │ Window   │    │ early context?   │
    └──────────┘    └────────┬────────┘
                             │
                   ┌─────────┴─────────┐
                   │                   │
                  No                  Yes
                   │                   │
                   ▼                   ▼
             ┌──────────┐      ┌─────────────┐
             │ Sliding  │      │ Can afford  │
             │ Window   │      │ latency?    │
             └──────────┘      └──────┬──────┘
                                      │
                            ┌─────────┴─────────┐
                            │                   │
                           No                  Yes
                            │                   │
                            ▼                   ▼
                     ┌────────────┐      ┌─────────────┐
                     │ Extract +  │      │ Summarize   │
                     │ Slide      │      │             │
                     └────────────┘      └─────────────┘
Enter fullscreen mode Exit fullscreen mode

Practical example with Gantz

Using Gantz Run tools with managed context:

class AgentWithManagedContext:
    def __init__(self, mcp_client):
        self.mcp = mcp_client

        # Hybrid: facts + summarization + recent
        self.facts = {}
        self.summary = ""
        self.recent = []

    def run(self, user_input):
        self.recent.append({"role": "user", "content": user_input})

        # Check if compression needed
        if len(self.recent) > 25:
            self.compress()

        # Build context
        context = self.build_context()

        # Run agent loop
        response = self.agent_loop(context)

        self.recent.append({"role": "assistant", "content": response})
        return response

    def compress(self):
        # Extract facts from old messages
        for msg in self.recent[:-10]:
            self.extract_facts(msg)

        # Summarize old messages
        old_messages = self.recent[:-10]
        new_summary = summarize(old_messages)

        if self.summary:
            self.summary = f"{self.summary}\n\n{new_summary}"
        else:
            self.summary = new_summary

        # Keep only recent
        self.recent = self.recent[-10:]

    def build_context(self):
        parts = []

        if self.facts:
            facts_str = "\n".join(f"- {k}: {v}" for k, v in self.facts.items())
            parts.append({"role": "system", "content": f"Known facts:\n{facts_str}"})

        if self.summary:
            parts.append({"role": "system", "content": f"Context:\n{self.summary}"})

        parts.extend(self.recent)
        return parts
Enter fullscreen mode Exit fullscreen mode

Summary

Sliding window:

  • Fast, free, simple
  • Loses early context
  • Best for: short tasks, stateless, high-throughput

Summarization:

  • Preserves knowledge
  • Costs latency and tokens
  • Best for: long sessions, personalization, complex tasks

Hybrid (usually best):

  • Extract key facts permanently
  • Summarize medium-old context
  • Keep recent verbatim

There's no perfect answer. Match your strategy to your use case.


Which approach do you use for context management?

Top comments (0)