How to Add LLM Drift Monitoring to Your CI/CD Pipeline in 10 Minutes

#ai #devops #llm

LLM drift breaks production. Most teams don't notice until users report bugs. Here's how to catch it automatically.

Why CI/CD for LLMs?

If you have unit tests for code, you should have drift tests for LLMs. The principle is the same: catch regressions before they reach production.

The Setup (10 Minutes)

Step 1: Create a drift test file

# tests/test_llm_drift.py
import pytest
from driftwatch import Monitor

@pytest.fixture
def monitor():
    return Monitor(
        model="gpt-4o",
        baseline="tests/baseline_outputs.json"
    )

def test_json_format_drift(monitor):
    score = monitor.check_prompt("Give me a JSON object with user data")
    assert score < 0.1, f"Drift detected: {score}"

def test_classification_drift(monitor):
    score = monitor.check_prompt("Classify this email as spam or not spam")
    assert score < 0.1

Step 2: Add to GitHub Actions

# .github/workflows/llm-drift.yml
name: LLM Drift Check
on: [push, pull_request]
jobs:
  drift-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with: python-version: '3.11'
      - run: pip install driftwatch
      - run: python -m pytest tests/test_llm_drift.py

Step 3: Set up baseline (one time)

# Generate baseline on a known-good model version
python -c "from driftwatch import Monitor; m = Monitor('gpt-4o'); m.capture_baseline('tests/baseline_outputs.json')"
git add tests/baseline_outputs.json
git commit -m "LLM baseline for drift detection"

What Gets Checked

The workflow runs your drift tests on:

Every push to main
Every pull request
Optional: nightly scheduled run

You get a ✅ or ❌ on every code change that affects LLM behavior.

The Alert

# In your test or a GitHub Action step
if score > threshold:
    # This creates a PR comment or blocks the merge
    github.create_check_run(
        name="LLM Drift Detection",
        status="failure",
        conclusion="drift_detected",
        output={"title": f"Drift: {score}", "body": "Review required"}
    )

Real Results

After adding this to a production system:

Caught GPT-4o format drift before a release → blocked the merge, saved an incident
Caught Claude verbosity regression → fixed prompt before users noticed
CI time increase: ~30 seconds → acceptable for the safety gain

The open source tool: DriftWatch on GitHub

Building production LLM systems means testing them like production software.

DEV Community