J Schoemaker

Posted on Mar 7

Your AI Coding Session Is Degrading Silently — Here's How to Measure It

#claudeai #devtools #ai #typescript

How driftguard-mcp Detects AI Context Degradation in Real Time

Long AI coding sessions degrade. Not gradually and gracefully — silently, until the model is already repeating itself, hedging on things it was confident about an hour ago, and producing code that contradicts what it wrote earlier in the same session.

Most developers don't catch this when it happens. They just feel like the AI is "having an off day" and keep pushing. The session compounds.

I built driftguard-mcp to measure this in real time and expose the score as MCP tools you can call mid-session. This article covers why the problem is hard to detect, what signals actually predict it, and how the implementation works under the hood.

Why Context Degradation Is Hard to Notice

The underlying mechanism is well-documented: academic benchmarks like NoLiMa (ICML 2025) show that at 32K tokens, 10 out of 12 models drop below 50% of their short-context performance — models that all claim to support at least 128K tokens. The same degradation pattern appears in coding sessions specifically. Engineers at Sourcegraph found Claude Code quality declining around 147,000–152,000 tokens, well before its advertised 200K limit. Practitioners running daily Claude Code and Cursor sessions have documented it starting as early as 20–40% context capacity. The failure mode is the same regardless of domain: the model doesn't error — it degrades.

Output gets shorter. It starts paraphrasing things it said 30 messages ago. It hedges more, qualifies more, and corrects itself on minor points rather than reasoning forward. None of this looks obviously broken. The model is still responding. It's still generating code. It just isn't the same model you were talking to at message 12.

The two most reliable signals are also the most invisible:

Context saturation accumulates incrementally. Each message pushes the window a little further. There's no threshold warning, no indicator. By the time you're at 88% token fill, the model has been operating under pressure for a while.

Repetition is equally invisible because developers don't read transcripts — they read current output. If the model recycled a code pattern from 20 messages ago, you'd have to actively compare to catch it.

The result: most people notice something is wrong at message 60+, well after the session became unreliable.

Sources: NoLiMa: Long-Context Evaluation Beyond Literal Matching (ICML 2025) · Lost in the Middle (Liu et al., TACL 2024) · Why Claude Code Sessions Keep Dying · Context Rot (Chroma Research)

Reading the Session Directly

driftguard-mcp reads session files on disk rather than intercepting API calls. This has a few advantages: it requires no proxy layer, no API key, no modified toolchain. It just watches the same JSONL files the CLI produces.

Claude Code writes session state to ~/.claude/projects/<hash>/<session-uuid>.jsonl. Each line is a typed message with role, content, and — critically — token counts from the API response. The usage field includes input_tokens and cache_read_input_tokens, which together give an accurate picture of what the model actually processed.

Gemini CLI writes to ~/.gemini/tmp/. Its format uses functionCall / functionResponse pairs for tool use, which required a separate adapter to normalise into the shared message structure.

Codex CLI uses ~/.codex/ with tool_calls / role:tool format. Token counts aren't available here, so context saturation falls back to a character-based estimate with a calibration factor.

All three adapters normalise to the same internal ParsedMessage[] structure before scoring:

interface ParsedMessage {
  role: 'user' | 'assistant' | 'tool';
  content: string;
  tokenCount?: number;       // real API counts where available
  timestamp?: number;
}

One edge case worth noting: Claude Code compact boundaries. When Claude compacts mid-session, pre-compaction messages are dropped from its active context. driftguard-mcp detects this boundary in the JSONL and drops the same messages from scoring — the score only reflects what Claude actually remembers, not the full conversation history on disk.

The 6-Factor Scoring Model

The composite drift score (0–100) is a weighted sum of six factors. The weights reflect signal reliability, not equal contribution:

Factor	Weight	Signal type
Context saturation	37%	Quantitative — token fill %
Repetition	37%	Statistical — 3-gram overlap
Response length collapse	15%	Trend — rolling window
Goal distance	8%	Semantic — TF-IDF cosine similarity
Uncertainty signals	2%	Lexical — explicit self-corrections
Confidence drift	1%	Lexical — hedging language trend

Context saturation and repetition dominate at 74% combined. This is intentional — they're the most direct, measurable predictors of degradation. The lexical signals (uncertainty, confidence drift) contribute noise-reduction rather than primary signal, which is why they're weighted at 3% combined.

Context Saturation (37%)

For Claude and Gemini, token counts come directly from the API response metadata in the session file. The saturation score is a calibrated curve against the model's known context window:

function contextSaturationScore(tokenCount: number, maxTokens: number): number {
  const fill = tokenCount / maxTokens;
  // Smooth ramp: low penalty below 50%, steep above 75%
  if (fill < 0.5) return fill * 20;
  if (fill < 0.75) return 10 + (fill - 0.5) * 120;
  return 40 + (fill - 0.75) * 240;
}

This produces near-zero scores in healthy sessions and rapidly climbing scores as fill approaches capacity — matching actual model behaviour, which degrades non-linearly near the limit.

Repetition (37%)

Repetition is measured using a 3-gram sliding window across recent assistant responses. The algorithm extracts all 3-word sequences from the last N responses and measures overlap:

function extractTrigrams(text: string): Set<string> {
  const words = text.toLowerCase().split(/\s+/);
  const trigrams = new Set<string>();
  for (let i = 0; i < words.length - 2; i++) {
    trigrams.add(`${words[i]} ${words[i+1]} ${words[i+2]}`);
  }
  return trigrams;
}

function repetitionScore(messages: ParsedMessage[]): number {
  const recent = messages.filter(m => m.role === 'assistant').slice(-10);
  if (recent.length < 3) return 0;

  const allTrigrams = recent.flatMap(m => [...extractTrigrams(m.content)]);
  const unique = new Set(allTrigrams);
  const overlapRatio = 1 - (unique.size / allTrigrams.length);

  return Math.min(100, overlapRatio * 180); // calibrated multiplier
}

3-grams at this window size are reliable enough to catch genuine repetition without false-positives from incidental shared vocabulary (e.g., variable names appearing across multiple messages).

Tool noise filtering: Tool call messages — "Tool loaded.", "Calling bash...", etc. — are filtered from the user message stream before scoring. Without this, tool-heavy sessions score artificially high on repetition due to repeated tool invocation boilerplate.

Response Length Collapse (15%)

As sessions degrade, responses get shorter. The model starts truncating explanations, omitting context it would have included earlier. This is a reliable secondary signal.

The score measures the trend in response length across the last 15 assistant messages using a simple linear regression slope:

function lengthCollapseScore(messages: ParsedMessage[]): number {
  const recent = messages
    .filter(m => m.role === 'assistant')
    .slice(-15)
    .map(m => m.content.length);

  if (recent.length < 5) return 0;

  const slope = linearRegressionSlope(recent);
  // Negative slope = shrinking responses
  return slope < 0 ? Math.min(100, Math.abs(slope) * 0.4) : 0;
}

Goal Distance (8%)

This factor only activates when you pass a goal string to get_drift(). It measures vocabulary drift from your original objective using TF-IDF cosine similarity:

get_drift({ goal: "implement JWT authentication with refresh token rotation" })

The goal string is vectorised against recent assistant responses. As the session drifts from the original task — handling edge cases, going down tangents, responding to follow-up questions — cosine similarity to the goal string decreases.

The threshold curve is calibrated so that similarity ≥ 0.5 returns a near-zero score, with penalty scaling steeply below 0.3. Without a goal param, this factor returns 0 and its 8% weight is redistributed proportionally.

Uncertainty Signals (2%) and Confidence Drift (1%)

These are intentionally low-weight. Uncertainty signals count explicit self-corrections ("I was wrong about", "let me correct that", "actually, I made an error") — not general hedging, which is too noisy. Confidence drift measures the trend in hedging language frequency (perhaps, might, could, I think) between the first third and last third of the session.

Both factors were originally weighted higher. In practice, hedging language is too context-dependent — a research session is supposed to have more hedging — and self-corrections are too rare to contribute meaningful signal in most sessions. Keeping them at 3% combined means they can nudge a borderline score without ever dominating it.

Score Thresholds and Output Design

Scores map to four states:

Range	State
0–29	Fresh
30–60	Warming
61–80	Drifting
81–100	Polluted

The get_drift() output leads with a plain-English recommendation rather than just the score. The score is a number — what most developers need is "should I start fresh right now or not":

⚠️  Start fresh now — context is full and responses are repeating heavily.

  Context depth         █████████░   88
  Repetition            ████████░░   72
  Length collapse       █████░░░░░   48

Score: 84/100 · 67 messages

→ Call get_handoff() to write handoff.md before starting fresh.

Factor bars only appear when they're contributing meaningfully to the score. A healthy session shows only the top two; a degraded session shows all contributing factors. This avoids surfacing noise in the common case.

Handoff trigger: The suggestion to call get_handoff() fires independently of the composite score — it triggers when context depth or repetition individually cross their thresholds. A session can have a composite score of 65 (drifting) and still get a handoff suggestion if repetition is at 78.

The Handoff Workflow

get_handoff() returns a structured prompt instructing the current AI session to write a handoff.md file. The AI generates the file using its full session context — which, crucially, still exists even in a degraded session. The model may be repeating itself, but it still has access to everything it did.

A typical handoff.md:

## What we accomplished
Implemented JWT authentication with refresh token rotation. Added middleware,
updated the user model, wrote integration tests. All tests passing.

## Current state
Auth flow is working end-to-end. Rate limiting is stubbed but not implemented.
The `/refresh` endpoint has a known edge case with concurrent requests (see auth.ts:142).

## Files modified
- src/middleware/auth.ts — JWT verify + refresh logic
- src/models/user.ts — added refreshToken field + index
- src/routes/auth.ts — /login, /logout, /refresh endpoints
- tests/integration/auth.test.ts — 14 new tests

## Open questions / next steps
- Implement rate limiting on /login (5 attempts per 15 min)
- Fix concurrent refresh edge case
- Add token blacklist for logout

## Context for next session
Using jsonwebtoken@9, refresh tokens stored in DB. Access token TTL: 15min,
Refresh TTL: 7 days.

Load this at the start of the next session. You don't lose context — you lose the degraded session state while keeping the useful information.

Trend Tracking

get_trend() returns the full score history for the current session with a sparkline, peak, average, and trajectory annotation:

Session drift trend (18 snapshots)

  12 ▁▁▂▃▄▄▅▆▇▇█  84

  Peak: 84  ·  Avg: 47  ·  Trajectory: ↑ climbing

Snapshots: 12 → 18 → 24 → 31 → 38 → 42 → 51 → 58 → 63 → 70 → 76 → 84

Snapshots are persisted to ~/.driftcli/history/ as JSONL and survive session restarts. The sparkline starts appearing after 3 get_drift() calls. Trend data is per-session, keyed by session UUID.

Configuration

driftguard-mcp merges global (~/.driftclirc) and per-project (.driftcli) config. Presets adjust factor weights without requiring manual override:

Preset	Adjustment
`coding`	Default weights — emphasises context depth and repetition
`research`	Weights goal distance more heavily
`brainstorm`	Relaxes repetition and confidence drift penalties
`strict`	Equal weight across all six factors

Custom weight overrides are supported on top of any preset:

{
  "preset": "coding",
  "warnThreshold": 55,
  "weights": {
    "repetition": 0.45
  }
}

Install

npm install -g driftguard-mcp
driftguard-mcp setup

setup auto-configures Claude Code, Gemini CLI, Codex CLI, and Cursor. Restart your CLI after running it — the tools are live immediately.

What's Next

Current areas of active work: better token count estimation for Codex (the character-based fallback works but real counts would improve saturation accuracy), and a VSCode extension surface for teams that don't use CLI-first workflows.

The core scoring algorithm is intentionally conservative — better to miss a drifting session than to cry wolf on healthy ones. If you're running sessions and find the thresholds too tight or too loose for your workflow, the config system is designed for exactly that.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.