Jamie Cole

Posted on Mar 23

I Built an LLM Drift Detector — It Caught GPT-4o Changing Behaviour in Production

#ai #programming #llm

Three weeks into running GPT-4o in production, my JSON parser broke. Not gradually — overnight. The model had changed and nobody told me.

I built a drift detector to catch this. Here's what happened.

What LLM Drift Actually Means

When people talk about "AI changing," they usually mean:

Model updates (GPT-4 → GPT-4o, Claude 3 → Claude 3.5)
Behavioral drift (same model, different outputs over time)
Prompt sensitivity (small prompt changes → big output differences)

Type 1 is visible. Types 2 and 3 will silently break your product.

The Detector I Built

I run 20 standardized prompts against GPT-4o every 6 hours. Each response gets compared against a stored baseline using cosine similarity. When drift score exceeds a threshold, I get an alert.

def detect_drift(prompt: str, model: str = "gpt-4o") -> float:
    baseline = get_baseline(prompt)
    current = llm.call(prompt)
    similarity = cosine_similarity(baseline, current)
    return 1 - similarity  # Higher = more drift

Simple. The power is in running it continuously.

What 30 Days of Monitoring Taught Me

Week 1: The Claude Baseline Shift

I was testing Claude 3 Haiku as a baseline comparison. Day 5: drift score jumped 0.23. Looking at the outputs, Claude had started adding apologetic preambles to certain response types. Nobody announced this. I caught it by accident.

Week 2: The GPT-4o Format Drift

Day 12: My JSON parser prompt had a 0.31 drift score. The model was adding Based on the analysis... prefixes to responses that previously came clean. Fixed by updating the prompt. Would have caught 200 users experiencing errors if I'd had this running in production.

Week 3: The Quiet Week

Days 18-22: All models stable. This is what you want to see — baseline confidence that things are working.

The Pattern Nobody Talks About

LLM providers update models constantly. Small changes that don't make headlines but break specific prompt patterns. Without automated monitoring, you find out when users report bugs.

By then it's damage control.

The Stack

Detection engine: Python + Anthropic SDK + cosine similarity
Storage: SQLite (lightweight, fine for this scale)
Frontend: Vanilla JS (the UI is just for demo — core logic is the detector)
Backend: FastAPI + Railway (deployable in minutes)

The whole thing is open source if you want to run your own.

If You're Running LLMs in Production

You need one of three things:

An automated drift detector (like what I built)
A comprehensive eval suite you run before every deploy
The willingness to find out about breaking changes from your users

I built the detector because option 3 was embarrassing me.

This started as a personal tool. Now it has 36 developers using it. If you want early access to the hosted version: DriftWatch

DEV Community