PromptFoo Passes. Production Still Breaks. Here's the Gap.

#llm #testing #devops #ai

I had PromptFoo set up in CI. Evals passed on every deployment. The model still silently changed in production three weeks later — without any CI run, without any code change.

This is the gap eval frameworks don't cover. And it's not PromptFoo's fault.

What PromptFoo does (and does well)

PromptFoo is a developer-initiated eval framework. You define test cases, expected outputs, and assertions. You run it in CI on pull requests. It catches prompt regressions before you ship.

It's genuinely good at this. I use it. The model comparison feature is useful for evaluating whether you should switch from GPT-4o to GPT-4o-mini for a given use case. The CI integration is clean.

The problem it can't solve

PromptFoo runs when you trigger it. CI triggers it on a code change. You run it locally before committing.

But model providers change their models without you doing anything.

No pull request. No commit. No CI trigger. The model updates server-side. Your evals are sitting in .github/workflows/ waiting for the next code change. They never run. The model has already changed.

This happened with GPT-5.2 on Feb 10, 2026, with GPT-4o-2024-08-06 (a pinned version that still received behavioral updates), with Claude 3.5 Sonnet, and with Gemini 1.5 Pro. None of these triggered a CI run. All caused production failures.

The actual failure sequence

Day 0: You ship. PromptFoo evals pass in CI.
Day 21: OpenAI updates GPT-4o behavior server-side. No announcement.
Day 21-24: JSON extraction prompt starts returning preamble text. json.loads() throws intermittently (~15% of calls).
Day 24: First user complaint.
Day 26: You trace it to a model change. Ship a fix.
Day 27: You wonder if you'll catch this next time.

PromptFoo catches regressions at step 1. It doesn't catch steps 2-6.

The fix: continuous production monitoring

I built DriftWatch to fill this gap. It runs your production prompts on a schedule, compares each result to a stored baseline, and sends a Slack or email alert when the output signature changes.

The drift score (0.0-1.0) covers: semantic similarity, format compliance (JSON validity, code wrapper presence), and instruction-following delta. Alert fires when drift crosses 0.3. Typically within an hour of a provider model update.