Most LLM monitoring advice says "run evals in CI." What it doesn't say: how to structure those evals so you catch the class of failures that actually breaks production — format regressions, instruction compliance drift, punctuation changes in single-token outputs.
Here's a practical CI/CD setup using DriftWatch's free tier that catches behavioral drift before it reaches production.
The Problem with Standard LLM CI
A typical LLM test in CI looks like this:
def test_sentiment_classifier():
response = call_llm("Classify: 'great product'. Return one word.")
assert response.strip().lower() in ["positive", "negative", "neutral"]
This passes even when the model starts returning "Positive." instead of "positive" — a trailing period that breaks any downstream code doing exact-match comparison.
Unit tests verify your code. They don't verify whether the model is still behaving the same way it did when you wrote the code.
What Drift Monitoring Adds
Drift monitoring compares live model behavior against a saved baseline:
- Establish baseline — run your production prompts, save outputs
- Run on schedule (or in CI) — same prompts, same parameters
- Score the delta — format compliance + semantic similarity + output length
- Alert on threshold — 0.3 = investigate, 0.5 = page
The key difference: you're comparing against previous model behavior, not against a hardcoded expected value.
Setting Up DriftWatch in 5 Minutes (Free Tier)
Step 1: Register and get your API key
# Register
curl -X POST https://your-driftwatch-url/auth/register \
-H "Content-Type: application/json" \
-d '{"email": "your@email.com", "password": "yourpassword"}'
# Save the api_key from the response
API_KEY="dw_your_api_key_here"
Step 2: Add your production prompt
curl -X POST https://your-driftwatch-url/prompts \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "sentiment-classifier",
"prompt_text": "Classify the sentiment as exactly one word: positive, negative, or neutral. Review: \"The product works fine but packaging was damaged.\"",
"model": "gpt-4o",
"validators": ["single_word", "word_in:positive,negative,neutral"]
}'
Free tier: 3 prompts, no card required.
Step 3: Run a drift check in CI
# .github/workflows/llm-drift-check.yml
name: LLM Drift Check
on:
schedule:
- cron: '0 * * * *' # hourly
push:
branches: [main]
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- name: Run drift check
run: |
RESULT=$(curl -s -X POST https://your-driftwatch-url/monitor/run \
-H "Authorization: Bearer ${{ secrets.DRIFTWATCH_API_KEY }}")
MAX_DRIFT=$(echo $RESULT | jq '.summary.max_drift')
echo "Max drift: $MAX_DRIFT"
# Fail CI if drift exceeds threshold
if (( $(echo "$MAX_DRIFT > 0.5" | bc -l) )); then
echo "BREAKING CHANGE: drift score $MAX_DRIFT exceeds threshold"
exit 1
fi
if (( $(echo "$MAX_DRIFT > 0.3" | bc -l) )); then
echo "WARNING: drift score $MAX_DRIFT above alert threshold"
fi
What This Catches That Unit Tests Miss
| Failure Type | Unit Test | Drift Monitor |
|---|---|---|
"Neutral." returned instead of "Neutral"
|
❌ passes | ✅ catches |
| JSON whitespace format changed | ❌ passes (json.loads works) | ✅ catches |
| Response length shifted significantly | ❌ passes | ✅ catches |
| Semantic meaning changed subtly | ❌ passes | ✅ catches |
| Validator compliance: returns "neutral" not "Neutral" | ✅ catches | ✅ catches |
The first four are what your unit tests won't catch. They're format and behavioral consistency checks — not correctness checks.
Real Example: Why This Matters
In our own test run (same model, two consecutive calls, no update), we got:
-
inst-01 (single-word classifier): drift score 0.575 —
"Neutral."→"Neutral". Both pass theword_in:positive,negative,neutralvalidator. Butresponse.strip() == "Neutral."is now false. -
json-01 (JSON extraction): drift score 0.316 — whitespace stripped, trailing period removed from value.
json.loads()works.baseline == currentdoes not.
These are natural variance samples — not even from a model update. When a model actually gets updated, drift scores are higher and more consistent.
When to Alert vs When to Fail
| Score | Action |
|---|---|
| < 0.1 | Normal variance — no action |
| 0.1–0.29 | Log it — monitor trend |
| 0.3–0.49 | Alert — investigate before next deploy |
| 0.5+ | Fail CI — breaking change |
For most production pipelines, I'd set the CI failure threshold at 0.5 and the alert threshold at 0.3. You get a Slack/email notification at 0.3 to investigate, and CI blocks deploy at 0.5.
Getting Started
The free tier gives you 3 prompts and hourly monitoring — enough to protect your most critical LLM calls.
→ Set up your first drift check — 5 minutes, no card required.
The GitHub repo (including the drift detection algorithm) is at GenesisClawbot/llm-drift if you want to self-host.
What's the highest-risk LLM call in your production stack? For most teams it's either a JSON extraction or a classification prompt — those are the ones where small format changes have the biggest downstream impact.
Top comments (0)