LLM drift breaks production. Most teams don't notice until users report bugs. Here's how to catch it automatically.
Why CI/CD for LLMs?
If you have unit tests for code, you should have drift tests for LLMs. The principle is the same: catch regressions before they reach production.
The Setup (10 Minutes)
Step 1: Create a drift test file
# tests/test_llm_drift.py
import pytest
from driftwatch import Monitor
@pytest.fixture
def monitor():
return Monitor(
model="gpt-4o",
baseline="tests/baseline_outputs.json"
)
def test_json_format_drift(monitor):
score = monitor.check_prompt("Give me a JSON object with user data")
assert score < 0.1, f"Drift detected: {score}"
def test_classification_drift(monitor):
score = monitor.check_prompt("Classify this email as spam or not spam")
assert score < 0.1
Step 2: Add to GitHub Actions
# .github/workflows/llm-drift.yml
name: LLM Drift Check
on: [push, pull_request]
jobs:
drift-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with: python-version: '3.11'
- run: pip install driftwatch
- run: python -m pytest tests/test_llm_drift.py
Step 3: Set up baseline (one time)
# Generate baseline on a known-good model version
python -c "from driftwatch import Monitor; m = Monitor('gpt-4o'); m.capture_baseline('tests/baseline_outputs.json')"
git add tests/baseline_outputs.json
git commit -m "LLM baseline for drift detection"
What Gets Checked
The workflow runs your drift tests on:
- Every push to main
- Every pull request
- Optional: nightly scheduled run
You get a ✅ or ❌ on every code change that affects LLM behavior.
The Alert
# In your test or a GitHub Action step
if score > threshold:
# This creates a PR comment or blocks the merge
github.create_check_run(
name="LLM Drift Detection",
status="failure",
conclusion="drift_detected",
output={"title": f"Drift: {score}", "body": "Review required"}
)
Real Results
After adding this to a production system:
- Caught GPT-4o format drift before a release → blocked the merge, saved an incident
- Caught Claude verbosity regression → fixed prompt before users noticed
- CI time increase: ~30 seconds → acceptable for the safety gain
The open source tool: DriftWatch on GitHub
Building production LLM systems means testing them like production software.
Top comments (0)