Jamie Cole

Posted on Mar 13

I shipped an LLM feature, got 11 users, then the model silently changed on me. Here's what I built to stop it happening again.

#llm #indiehackers #saas #ai

I build small AI tools. I don't have a team. I don't have a staging environment with automated evals running 24/7. I have a laptop, some API keys, and a support inbox that I check too often.

Last year I shipped a feature that used Claude to classify support tickets. Three categories: billing, technical, account. The prompt was simple. The output was clean. It worked perfectly for six weeks.

Then it stopped working.

Not completely — it still classified tickets. But my downstream code was doing if category == "billing" and Claude started returning "Billing" with a capital B. Two hundred tickets routed incorrectly over four days before a user flagged it. No error logs. No exceptions. Just wrong.

I spent three hours debugging before I realized: the model had changed. Not my code. The model.

I checked the Anthropic changelog. Nothing. No release notes about this. No email. Nothing.

This Is the Indie Hacker Tax on LLM Products

When you're a solo builder shipping LLM features, you carry a risk that didn't exist before:

Your dependencies change without telling you, and your tests don't catch it.

Unit tests pass. Your CI is green. You deployed nothing. But your product broke anyway because the upstream model provider pushed a silent update.

Large teams have eval suites running on every deployment. They catch regressions before users do. They have dedicated tooling — LangSmith, Helicone, Braintrust.

Indie hackers have… a prayer and a monitoring plan they wrote in a Notion doc and haven't looked at since.

What I Actually Needed

After the billing/Billing incident, I wanted something specific:

Run my test prompts every hour — not just on deploy, but continuously, because the model can change any time
Alert me immediately when output changes — not a dashboard I have to remember to check, but a Slack message or email
Tell me exactly what changed — not just "drift detected", but "this validator that was passing is now failing"
Be cheap and simple — I'm not running a $10k/year enterprise tool on a side project

I couldn't find it. So I built it.

DriftWatch: What It Does

DriftWatch runs a test suite against your LLM endpoint on a schedule. Every hour, it sends your prompts, scores the responses against your expected behaviour, and alerts you if anything drifts.

Here's what a real check looks like from our Claude-3-Haiku monitoring:

🔍 Drift check — claude-3-haiku-20240307
   Baseline: 2026-03-12T18:51

  [🔴 MEDIUM] Single word response: drift=0.575
    ⚠️ Regression: word_in:positive,negative,neutral
    Baseline: "neutral" → Current: "Neutral"

  [🟠 MEDIUM] JSON extraction: drift=0.316
    Whitespace formatting changed

  [✅ NONE] JSON array extraction: drift=0.000

────────────────────────────────────────────────
Avg drift: 0.213 | Max drift: 0.575

That "neutral" → "Neutral" result? That's exactly the class of bug that took me three hours to debug. With DriftWatch, it shows up within an hour, before any user notices.

What We Track

We built 20 test prompts across the failure modes that actually break production LLM products:

Format compliance — does it still return valid JSON? Still return exactly one word?
Instruction following — did verbosity or structure change?
Classification stability — same inputs, same output categories?
Code generation — still returns code without explanation?
Data extraction — still returns dates in ISO format?

The composite drift score is 0.0 (no change) to 1.0 (completely different behaviour). Anything above ~0.4 on a validator you depend on is worth investigating.

The Indie Hacker Version

I wanted to build something I'd actually use, so the free tier exists:

3 prompts monitored continuously
No credit card
Real drift scoring, not a toy demo

That's enough to cover your most critical output format — the one that will break your product if it changes.

Start monitoring free →

If you're running a classifier, a JSON extraction, or anything where the output format is load-bearing — add that prompt. If it ever changes, you'll know in under an hour instead of four days.

The Thing Nobody Tells You

LLM providers don't think of this as a breaking change. Capitalization. Whitespace. Verbosity. To them it's a quality improvement. To your product, it's a broken feature.

The only way to know when it happens is to watch for it yourself.

That's all DriftWatch does. It watches.

Built by Jamie Cole. We're in early access — if you try it and hit a problem, email me directly. I read every message.

DEV Community