Jamie Cole

Posted on Mar 13

How to Add LLM Drift Monitoring to Your CI/CD Pipeline (Free, 5 Minutes)

#ai #llm #devops #python

Most LLM monitoring advice says "run evals in CI." What it doesn't say: how to structure those evals so you catch the class of failures that actually breaks production — format regressions, instruction compliance drift, punctuation changes in single-token outputs.

Here's a practical CI/CD setup using DriftWatch's free tier that catches behavioral drift before it reaches production.

The Problem with Standard LLM CI

A typical LLM test in CI looks like this:

def test_sentiment_classifier():
    response = call_llm("Classify: 'great product'. Return one word.")
    assert response.strip().lower() in ["positive", "negative", "neutral"]

This passes even when the model starts returning "Positive." instead of "positive" — a trailing period that breaks any downstream code doing exact-match comparison.

Unit tests verify your code. They don't verify whether the model is still behaving the same way it did when you wrote the code.

What Drift Monitoring Adds

Drift monitoring compares live model behavior against a saved baseline:

Establish baseline — run your production prompts, save outputs
Run on schedule (or in CI) — same prompts, same parameters
Score the delta — format compliance + semantic similarity + output length
Alert on threshold — 0.3 = investigate, 0.5 = page

The key difference: you're comparing against previous model behavior, not against a hardcoded expected value.

Setting Up DriftWatch in 5 Minutes (Free Tier)

Step 1: Register and get your API key

# Register
curl -X POST https://your-driftwatch-url/auth/register \
  -H "Content-Type: application/json" \
  -d '{"email": "your@email.com", "password": "yourpassword"}'

# Save the api_key from the response
API_KEY="dw_your_api_key_here"

Step 2: Add your production prompt

curl -X POST https://your-driftwatch-url/prompts \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "sentiment-classifier",
    "prompt_text": "Classify the sentiment as exactly one word: positive, negative, or neutral. Review: \"The product works fine but packaging was damaged.\"",
    "model": "gpt-4o",
    "validators": ["single_word", "word_in:positive,negative,neutral"]
  }'

Free tier: 3 prompts, no card required.

Step 3: Run a drift check in CI

# .github/workflows/llm-drift-check.yml
name: LLM Drift Check

on:
  schedule:
    - cron: '0 * * * *'  # hourly
  push:
    branches: [main]

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - name: Run drift check
        run: |
          RESULT=$(curl -s -X POST https://your-driftwatch-url/monitor/run \
            -H "Authorization: Bearer ${{ secrets.DRIFTWATCH_API_KEY }}")

          MAX_DRIFT=$(echo $RESULT | jq '.summary.max_drift')
          echo "Max drift: $MAX_DRIFT"

          # Fail CI if drift exceeds threshold
          if (( $(echo "$MAX_DRIFT > 0.5" | bc -l) )); then
            echo "BREAKING CHANGE: drift score $MAX_DRIFT exceeds threshold"
            exit 1
          fi

          if (( $(echo "$MAX_DRIFT > 0.3" | bc -l) )); then
            echo "WARNING: drift score $MAX_DRIFT above alert threshold"
          fi

What This Catches That Unit Tests Miss

Failure Type	Unit Test	Drift Monitor
`"Neutral."` returned instead of `"Neutral"`	❌ passes	✅ catches
JSON whitespace format changed	❌ passes (json.loads works)	✅ catches
Response length shifted significantly	❌ passes	✅ catches
Semantic meaning changed subtly	❌ passes	✅ catches
Validator compliance: returns "neutral" not "Neutral"	✅ catches	✅ catches

The first four are what your unit tests won't catch. They're format and behavioral consistency checks — not correctness checks.

Real Example: Why This Matters

In our own test run (same model, two consecutive calls, no update), we got:

inst-01 (single-word classifier): drift score 0.575 — "Neutral." → "Neutral". Both pass the word_in:positive,negative,neutral validator. But response.strip() == "Neutral." is now false.
json-01 (JSON extraction): drift score 0.316 — whitespace stripped, trailing period removed from value. json.loads() works. baseline == current does not.

These are natural variance samples — not even from a model update. When a model actually gets updated, drift scores are higher and more consistent.

When to Alert vs When to Fail

Score	Action
< 0.1	Normal variance — no action
0.1–0.29	Log it — monitor trend
0.3–0.49	Alert — investigate before next deploy
0.5+	Fail CI — breaking change

For most production pipelines, I'd set the CI failure threshold at 0.5 and the alert threshold at 0.3. You get a Slack/email notification at 0.3 to investigate, and CI blocks deploy at 0.5.

Getting Started

The free tier gives you 3 prompts and hourly monitoring — enough to protect your most critical LLM calls.

→ Set up your first drift check — 5 minutes, no card required.

The GitHub repo (including the drift detection algorithm) is at GenesisClawbot/llm-drift if you want to self-host.

What's the highest-risk LLM call in your production stack? For most teams it's either a JSON extraction or a classification prompt — those are the ones where small format changes have the biggest downstream impact.

DEV Community