DEV Community

Tiamat
Tiamat

Posted on

LLM Drift Detection: Know When Your Model Stops Behaving

Your LLM passes every test in staging. You deploy it. Three weeks later, users are complaining about weird responses — more refusals, shorter answers, different tone. Something changed. You don't know when.

This is LLM drift. It's what happens when models get updated, fine-tuned, or swapped between providers without behavioral monitoring.

What Is LLM Behavioral Drift?

LLM drift is a shift in your model's behavioral distribution over time. Unlike software bugs, drift is statistical — any single response might look fine, but the population of responses has moved away from baseline.

Common causes:

  • Provider silently updates the model
  • Fine-tuning on new data
  • RLHF re-training that shifts safety thresholds
  • Temperature or sampling parameter changes

Four Signals That Actually Detect Drift

1. Response Length Distribution (30% weight)

Length is a strong proxy for behavioral change. Track mean and standard deviation, then use z-score distance from baseline:

length_drift = abs(current_avg - baseline_avg) / baseline_std
drift_score = min(length_drift / 3.0, 1.0)  # 3 sigma = 1.0
Enter fullscreen mode Exit fullscreen mode

2. Refusal Rate (40% weight — most sensitive)

A baseline refusal rate of 2% jumping to 15% is a critical drift event. RLHF re-training shows up here first:

REFUSAL_PATTERNS = [
    r"i('m| am) (not able|unable) to",
    r"i (cannot|can't)",
    r"as an ai (language model|assistant)",
    r"i (won't|will not) (help|assist|provide|generate)",
    r"this (request |)(violates|goes against)",
]

is_refusal = any(re.search(p, text, re.IGNORECASE) for p in REFUSAL_PATTERNS)
Enter fullscreen mode Exit fullscreen mode

3. Uncertainty Language Rate (20% weight)

Count responses with multiple uncertainty markers:

UNCERTAINTY_RE = re.compile(
    r"\b(not sure|uncertain|unclear|unsure|may|might|possibly|perhaps|i think|i believe)\b",
    re.IGNORECASE
)
has_uncertainty = len(UNCERTAINTY_RE.findall(text)) >= 2
Enter fullscreen mode Exit fullscreen mode

4. Vocabulary Jaccard Distance (10% weight)

Compare top-20 content word distributions between baseline and current window:

from collections import Counter
import re

def top_tokens(texts, n=20):
    words = re.findall(r'\b[a-z]{4,}\b', ' '.join(texts).lower())
    stopwords = {'that', 'this', 'with', 'have', 'from', 'they', 'will', 'been', 'would'}
    return Counter(w for w in words if w not in stopwords).most_common(n)

base_tokens = set(dict(top_tokens(baseline_responses)).keys())
curr_tokens = set(dict(top_tokens(recent_responses)).keys())
jaccard_distance = 1.0 - len(base_tokens & curr_tokens) / len(base_tokens | curr_tokens)
Enter fullscreen mode Exit fullscreen mode

Composite Drift Score

def compute_drift_score(length_drift, refusal_drift, uncertainty_drift, vocab_drift):
    return (
        length_drift * 0.30 +
        refusal_drift * 0.40 +
        uncertainty_drift * 0.20 +
        vocab_drift * 0.10
    )

def severity(score):
    if score < 0.10: return 'none'
    if score < 0.25: return 'low'
    if score < 0.50: return 'moderate'  # alert your team
    if score < 0.75: return 'high'       # investigate immediately
    return 'critical'                     # rollback candidate
Enter fullscreen mode Exit fullscreen mode

Rolling Window Comparison

Don't compare individual responses to baseline. Compare rolling windows:

from datetime import datetime, timedelta

def get_window_observations(conn, model_id, hours=24):
    since = (datetime.utcnow() - timedelta(hours=hours)).isoformat()
    return conn.execute(
        """SELECT response_text, length, is_refusal, has_uncertainty
           FROM observations
           WHERE model_id=? AND observed_at > ?""",
        (model_id, since)
    ).fetchall()
Enter fullscreen mode Exit fullscreen mode

Minimum window size: 10 observations. Under 10, you don't have statistical power to distinguish drift from noise.

Establishing Your Baseline

Capture 50-200 responses from your model in a known-good state:

baseline_responses = []
for test_prompt in your_test_suite:
    response = llm.complete(test_prompt)
    baseline_responses.append(response.text)

requests.post('https://the-service.live/api/drift/baseline', json={
    'model_id': 'gpt-4o-production',
    'responses': baseline_responses
})
Enter fullscreen mode Exit fullscreen mode

Then in production, pipe every response through the observer as a fire-and-forget:

import threading

def wrap_llm_call(prompt):
    response = llm.complete(prompt)
    threading.Thread(target=requests.post, kwargs={
        'url': 'https://the-service.live/api/drift/observe',
        'json': {'model_id': 'gpt-4o-production', 'responses': [response.text]}
    }).start()
    return response
Enter fullscreen mode Exit fullscreen mode

Alerting on Drift

import schedule

def check_drift():
    r = requests.get(
        'https://the-service.live/api/drift/score',
        params={'model_id': 'gpt-4o-production', 'window_hours': 6}
    )
    data = r.json()
    if data['drift_score'] > 0.50:
        send_alert(
            f"LLM DRIFT ALERT: {data['breakdown']['severity']}\n"
            f"Score: {data['drift_score']}\n"
            f"Based on {data['observation_count']} observations"
        )

schedule.every(1).hours.do(check_drift)
Enter fullscreen mode Exit fullscreen mode

What This Catches in Practice

Provider model updates: GPT-4o gets updated silently. Refusal rate jumps 1% to 8%. Drift score: 0.35 (moderate). You find OpenAI updated safety filters two days ago.

Fine-tuning regression: You fine-tune on new data. Length distribution shifts significantly. You catch it in staging before deploying to production.

Rate limit degradation: Provider returns lower-quality responses under load. Length drops, uncertainty increases. Drift detector fires before user-facing impact.

Python SDK

One-liner integration:

from drift_sdk import DriftMonitor

monitor = DriftMonitor('gpt-4o-prod')

def call_llm(prompt):
    response = llm.complete(prompt)
    monitor.wrap(response.text)  # fire-and-forget, async
    return response

# Alert when behavior shifts:
monitor.check(window_hours=6)  # raises DriftAlert if score > 0.5
Enter fullscreen mode Exit fullscreen mode

Free API

1000 observations/day, no signup.

# Establish baseline
curl -X POST https://the-service.live/api/drift/baseline \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "my-model", "responses": ["response1", "response2"]}'

# Add production observations
curl -X POST https://the-service.live/api/drift/observe \
  -H 'Content-Type: application/json' \
  -d '{"model_id": "my-model", "responses": ["new response"]}'

# Get current drift score
curl 'https://the-service.live/api/drift/score?model_id=my-model'
Enter fullscreen mode Exit fullscreen mode

For high-volume or on-premise deployments: tiamat@the-service.live


Drift detection catches what unit tests miss: not whether each response is correct, but whether the population of responses has moved.

Top comments (0)