The Context Window Is RAM — Why Your Agent's SLIs Are Telling You It's Full

#ai #sre #devops #azure

The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to.
Six months into building it, they realized they weren't building an SRE agent. They were building a context engineering system that happens to do site reliability engineering. Better models were table stakes, but what moved the needle was what they controlled: disciplined context management. Kore.ai
That framing is exactly right. And it has a reliability implication that I haven't seen anyone write about directly.
The Problem
Your agent's context window is volatile working memory. Fast, expensive, and non-persistent. It's RAM, not storage. When the session ends, it's gone. When it fills up, quality degrades — not linearly, but in ways that are hard to predict and easy to miss.
As you fill the context window, model quality drops non-linearly. "Lost in the middle," "not adhering to my instructions," and long-context degradation show up well before the advertised token limits. More tokens don't just cost latency — they quietly erode accuracy. Kore.ai
That quiet erosion is the reliability failure mode. It doesn't throw an exception. It doesn't spike your error rate. Your agent keeps running. It just makes progressively worse decisions as the context fills.
And here's the part I want to be specific about: you already have the SLIs to catch this. You just haven't connected them to context state yet.
What Context Overflow Looks Like in Your SLIs
When an agent's context fills beyond its effective working range, three things happen in order:
DQR (Decision Quality Rate) drops first. The agent's decisions get worse because early instructions are now competing with thousands of tokens of recent tool output. An instruction from turn 3 gets buried under content that arrived after it — the agent isn't ignoring it, it's attending more reliably to recent content as the session grows. This is a passive decay process, not a model bug. incident.io
RTD (Reasoning Trace Depth) climbs next. The agent re-plans more because its earlier context — what it already established about the problem — is partially decayed. It's not re-planning because something changed. It's re-planning because it partially forgot what it already figured out.
TIE (Tool Invocation Efficiency) degrades last. The agent starts calling tools to reconstruct context it already had. It queries the same data sources again. It re-fetches runbooks it already read. Tool call count per task climbs above baseline while task quality continues to fall.
By the time TIE is visibly elevated, you're already well into the degradation window. DQR was the earlier signal. And DQR dropping in a long-running session, without any external trigger, is your context overflow signature.
The Architecture Fix
Mem0's 2026 benchmarks quantify the difference clearly: full-context baseline (everything packed into the window) scored 72.9% accuracy using 26,000 tokens per query at 17 second p95 latency. A two-layer memory architecture scored 91.6% accuracy using under 7,000 tokens at 1.4 second p95 latency. That's an 18.7 point accuracy improvement while using 4x fewer tokens and cutting latency by 91%. Yahoo Finance
The two-layer architecture is straightforward:
Working memory (context window): Only what's needed for the current decision. Active task state, recent tool results, current instructions. Managed actively — compressed, summarized, or paged out as the session grows.
Persistent memory (external store): Facts that persist across decisions and sessions. User preferences, established system state, prior investigation findings, runbook contents. Fetched into context when relevant, not kept resident the whole time.
The discipline is knowing what belongs in each layer and managing the boundary actively.
Connecting This to Your Production Readiness Checklist
Before a long-running agent goes to production, two questions need answers:
What is the expected context budget for a typical session? Not the model's maximum. The budget at which you've measured DQR starting to degrade for this specific agent on this specific task class. That number is your operational ceiling, not the advertised token limit.
What happens when the agent approaches that ceiling? Does it compress? Summarize and page out? Escalate to human? Or does it silently continue with degrading accuracy until something downstream notices?
If the answer to the second question is "it keeps going," that's your reliability gap. The context ceiling needs the same circuit breaker thinking as your token budget ceiling from the cost post.
python# agentsre/context_budget.py

from dataclasses import dataclass, field
from typing import Optional
import json
from datetime import datetime, timezone

@dataclass
class ContextBudgetTracker:
"""
Track context utilization against operational DQR ceiling.

The model's advertised token limit is not your operational limit.
Your operational limit is the token count at which DQR starts
to degrade for this agent on this task class. Establish that
baseline in shadow mode. Set your ceiling below it.

Attributes:
    agent_id: Agent being tracked
    task_class: Task type (DQR ceiling varies by task complexity)
    operational_ceiling_tokens: Tokens at which DQR degrades
        for this agent/task combination. NOT the model's max.
    warning_threshold_pct: Fraction of ceiling triggering warning
    current_tokens: Current context utilization
"""
agent_id: str
task_class: str
operational_ceiling_tokens: int
warning_threshold_pct: float = 0.75
current_tokens: int = 0
session_id: str = ""
compression_events: int = 0

@property
def utilization_pct(self) -> float:
    """Current context utilization as fraction of operational ceiling."""
    return self.current_tokens / self.operational_ceiling_tokens

@property
def budget_status(self) -> str:
    """
    OK — within safe operating range
    WARNING — approaching DQR degradation ceiling
    CRITICAL — at or above operational ceiling, DQR degrading
    """
    u = self.utilization_pct
    if u < self.warning_threshold_pct:
        return "OK"
    elif u < 1.0:
        return "WARNING"
    return "CRITICAL"

def update(self, current_tokens: int) -> dict:
    """
    Update current context utilization and return status record.
    Call this after each tool call or model response.

    Returns status record for logging to CloudWatch / Datadog.
    """
    self.current_tokens = current_tokens
    record = {
        "agent_id": self.agent_id,
        "session_id": self.session_id,
        "task_class": self.task_class,
        "current_tokens": self.current_tokens,
        "operational_ceiling": self.operational_ceiling_tokens,
        "utilization_pct": round(self.utilization_pct, 3),
        "budget_status": self.budget_status,
        "compression_events": self.compression_events,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    return record

def record_compression(self) -> None:
    """Call when context compression or summarization fires."""
    self.compression_events += 1

def should_compress(self) -> bool:
    """True when context is approaching DQR degradation ceiling."""
    return self.utilization_pct >= self.warning_threshold_pct

def should_escalate(self) -> bool:
    """
    True when context is at or above operational ceiling.
    At this point DQR is actively degrading.
    Escalate to human or terminate session cleanly.
    """
    return self.utilization_pct >= 1.0

The Practical Baseline Protocol
Before you can set an operational context ceiling, you need to know where DQR actually starts to degrade for your specific agent on your specific task class. The steps:
Run the agent in shadow mode on a representative sample of tasks. Record DQR at 25%, 50%, 75%, and 100% of the model's advertised context limit. Find the inflection point — where DQR starts dropping. Set your operational ceiling at 80% of that inflection point. That's your warning threshold. At the ceiling, trigger compression or escalation, not continuation.
This is the same baseline protocol as HER and RTD. Thirty days of shadow mode, measure the metric, set the threshold. The only difference is that context budget degradation is session-scoped rather than task-scoped.
Why This Post Belongs in This Series
Post 4 established DQR as your output quality SLI. Post 9 established token budget as a cost circuit breaker. Post 11 introduced RTD as your reasoning observability layer.
This post connects all three: context window mismanagement is the common cause that degrades DQR, elevates RTD, and burns your token budget simultaneously. Fix the memory architecture and you see improvement across all three SLIs. That's not a coincidence — they're measuring the same failure from different angles.
The code is in agentsre/context_budget.py on GitHub. MIT licensed, zero external dependencies.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre

DEV Community

The Context Window Is RAM — Why Your Agent's SLIs Are Telling You It's Full

Top comments (0)