Your LLM passes every test in staging. You deploy it. Three weeks later, users are complaining about weird responses — more refusals, shorter answers, different tone. Something changed. You don't know when.
This is LLM drift. It's what happens when models get updated, fine-tuned, or swapped between providers without behavioral monitoring.
What Is LLM Behavioral Drift?
LLM drift is a shift in your model's behavioral distribution over time. Unlike software bugs, drift is statistical — any single response might look fine, but the population of responses has moved away from baseline.
Common causes:
- Provider silently updates the model
- Fine-tuning on new data
- RLHF re-training that shifts safety thresholds
- Temperature or sampling parameter changes
Four Signals That Actually Detect Drift
1. Response Length Distribution (30% weight)
Length is a strong proxy for behavioral change. Track mean and standard deviation, then use z-score distance from baseline:
length_drift = abs(current_avg - baseline_avg) / baseline_std
drift_score = min(length_drift / 3.0, 1.0) # 3 sigma = 1.0
2. Refusal Rate (40% weight — most sensitive)
A baseline refusal rate of 2% jumping to 15% is a critical drift event. RLHF re-training shows up here first:
REFUSAL_PATTERNS = [
r"i('m| am) (not able|unable) to",
r"i (cannot|can't)",
r"as an ai (language model|assistant)",
r"i (won't|will not) (help|assist|provide|generate)",
r"this (request |)(violates|goes against)",
]
is_refusal = any(re.search(p, text, re.IGNORECASE) for p in REFUSAL_PATTERNS)
3. Uncertainty Language Rate (20% weight)
Count responses with multiple uncertainty markers:
UNCERTAINTY_RE = re.compile(
r"\b(not sure|uncertain|unclear|unsure|may|might|possibly|perhaps|i think|i believe)\b",
re.IGNORECASE
)
has_uncertainty = len(UNCERTAINTY_RE.findall(text)) >= 2
4. Vocabulary Jaccard Distance (10% weight)
Compare top-20 content word distributions between baseline and current window:
from collections import Counter
import re
def top_tokens(texts, n=20):
words = re.findall(r'\b[a-z]{4,}\b', ' '.join(texts).lower())
stopwords = {'that', 'this', 'with', 'have', 'from', 'they', 'will', 'been', 'would'}
return Counter(w for w in words if w not in stopwords).most_common(n)
base_tokens = set(dict(top_tokens(baseline_responses)).keys())
curr_tokens = set(dict(top_tokens(recent_responses)).keys())
jaccard_distance = 1.0 - len(base_tokens & curr_tokens) / len(base_tokens | curr_tokens)
Composite Drift Score
def compute_drift_score(length_drift, refusal_drift, uncertainty_drift, vocab_drift):
return (
length_drift * 0.30 +
refusal_drift * 0.40 +
uncertainty_drift * 0.20 +
vocab_drift * 0.10
)
def severity(score):
if score < 0.10: return 'none'
if score < 0.25: return 'low'
if score < 0.50: return 'moderate' # alert your team
if score < 0.75: return 'high' # investigate immediately
return 'critical' # rollback candidate
Rolling Window Comparison
Don't compare individual responses to baseline. Compare rolling windows:
from datetime import datetime, timedelta
def get_window_observations(conn, model_id, hours=24):
since = (datetime.utcnow() - timedelta(hours=hours)).isoformat()
return conn.execute(
"""SELECT response_text, length, is_refusal, has_uncertainty
FROM observations
WHERE model_id=? AND observed_at > ?""",
(model_id, since)
).fetchall()
Minimum window size: 10 observations. Under 10, you don't have statistical power to distinguish drift from noise.
Establishing Your Baseline
Capture 50-200 responses from your model in a known-good state:
baseline_responses = []
for test_prompt in your_test_suite:
response = llm.complete(test_prompt)
baseline_responses.append(response.text)
requests.post('https://the-service.live/api/drift/baseline', json={
'model_id': 'gpt-4o-production',
'responses': baseline_responses
})
Then in production, pipe every response through the observer as a fire-and-forget:
import threading
def wrap_llm_call(prompt):
response = llm.complete(prompt)
threading.Thread(target=requests.post, kwargs={
'url': 'https://the-service.live/api/drift/observe',
'json': {'model_id': 'gpt-4o-production', 'responses': [response.text]}
}).start()
return response
Alerting on Drift
import schedule
def check_drift():
r = requests.get(
'https://the-service.live/api/drift/score',
params={'model_id': 'gpt-4o-production', 'window_hours': 6}
)
data = r.json()
if data['drift_score'] > 0.50:
send_alert(
f"LLM DRIFT ALERT: {data['breakdown']['severity']}\n"
f"Score: {data['drift_score']}\n"
f"Based on {data['observation_count']} observations"
)
schedule.every(1).hours.do(check_drift)
What This Catches in Practice
Provider model updates: GPT-4o gets updated silently. Refusal rate jumps 1% to 8%. Drift score: 0.35 (moderate). You find OpenAI updated safety filters two days ago.
Fine-tuning regression: You fine-tune on new data. Length distribution shifts significantly. You catch it in staging before deploying to production.
Rate limit degradation: Provider returns lower-quality responses under load. Length drops, uncertainty increases. Drift detector fires before user-facing impact.
Python SDK
One-liner integration:
from drift_sdk import DriftMonitor
monitor = DriftMonitor('gpt-4o-prod')
def call_llm(prompt):
response = llm.complete(prompt)
monitor.wrap(response.text) # fire-and-forget, async
return response
# Alert when behavior shifts:
monitor.check(window_hours=6) # raises DriftAlert if score > 0.5
Free API
1000 observations/day, no signup.
# Establish baseline
curl -X POST https://the-service.live/api/drift/baseline \
-H 'Content-Type: application/json' \
-d '{"model_id": "my-model", "responses": ["response1", "response2"]}'
# Add production observations
curl -X POST https://the-service.live/api/drift/observe \
-H 'Content-Type: application/json' \
-d '{"model_id": "my-model", "responses": ["new response"]}'
# Get current drift score
curl 'https://the-service.live/api/drift/score?model_id=my-model'
For high-volume or on-premise deployments: tiamat@the-service.live
Drift detection catches what unit tests miss: not whether each response is correct, but whether the population of responses has moved.
Top comments (0)