DEV Community

Printo Tom
Printo Tom

Posted on

The AI system that worked in staging destroyed us in production. Here's what we missed.

I've been a software and enterprise architect for over twelve years. I've shipped pricing platforms, fraud detection systems, and order management infrastructure at scale — most recently at one of the UK's largest retailers. I say that not to flex, but to explain why I'm writing this post with a specific kind of frustration.

Because almost every article I read about AI in enterprise sounds like it was written by someone who has never been paged at 2am because an LLM-backed pricing rule marked 40,000 product lines as zero.

So here's what actually happens when you put AI into systems where the decisions have consequences.

The staging trap

Staging environments lie. They lie about load, they lie about data shape, and — critically for AI systems — they lie about context drift.

Context drift is when the world changes between the moment you assembled the input to your model and the moment the model's output takes effect. In a pricing engine, that gap can be milliseconds. In those milliseconds: a competitor might have repriced, a promotional rule might have fired, a stock threshold might have been crossed.

What this looks like in practice: your orchestrator assembles context — product cost, margin floor, competitor price, stock level — and sends it to the model. The model reasons and returns a recommended price. Validation passes. But by the time you write to the pricing store, the stock level has changed and the margin floor has been updated by a concurrent batch job. The model's recommendation was correct for a world that no longer exists.

The fix isn't faster models. It's a snapshot contract: a bounded, versioned, immutable view of state captured at orchestration time and passed all the way through to the action layer. Every downstream system confirms against the snapshot version before committing. If the snapshot is stale, you abort and re-orchestrate.

This pattern is borrowed directly from event sourcing. Most AI architects I've met have never heard of it.

Fraud signals don't behave like pricing signals — and that matters architecturally

One of the most useful things I've done is build both a fraud detection system and a pricing platform, because the contrast forces architectural clarity.

Fraud signals are high-frequency, low-latency, and the cost of a false negative is asymmetric — you can recover from a false positive (apologise to a good customer) but you can't unwind a fraudulent transaction. This pushes the architecture toward fail-closed defaults: when confidence is low, decline and escalate.

Pricing signals are lower frequency, higher context, and the cost structure is different — a bad price for 10 minutes on a low-velocity SKU costs less than a declined checkout. This pushes toward fail-open defaults with aggressive post-hoc monitoring.

The point is that "AI system" is not a single architecture. The trust posture of your validation layer, your fallback strategy, your human-in-the-loop gates — all of these should be derived from the asymmetry of your failure modes, not from a generic best-practice blog post (including this one).

Before you design the system, map your failure modes. A false positive in fraud is not the same as a false positive in pricing. Your architecture should know the difference.

The prompt is a contract. Treat it like one.

Your codebase versions your APIs. It versions your database schemas. It does not version your prompts — and that is a production incident waiting to happen.

We learned this the hard way. A well-intentioned tweak to the system prompt of a fraud classification model changed the output structure enough to break the downstream parser. Silently. For six hours. Because the validation layer was checking for the presence of a field, not its semantic content.

Prompt versioning isn't complicated. It's a git-tracked file, a version identifier injected into every API call, and a log entry that records which version produced which output.

{
  "prompt_version": "fraud-classifier-v2.4.1",
  "model": "claude-sonnet-4-20250514",
  "input_snapshot_id": "snap_01JV...",
  "output": { ... },
  "validation_result": "pass",
  "action_taken": "flag_for_review"
}
Enter fullscreen mode Exit fullscreen mode

Every LLM-influenced decision that touches production state should produce a record like this. Not for debugging — for auditability. In retail, in finance, in any regulated domain, the question "why did the system do that?" will be asked by someone whose salary is higher than yours. You want a clean answer.

The layer nobody builds until they need it

Teams build the orchestration layer. They build the reasoning layer (the model call). They often skip the trust and validation layer, tell themselves they'll add it later, and then spend six months retrofitting it after their first production incident.

The trust layer is not a safety net. It's load-bearing infrastructure. It includes:

  • Schema enforcement — structured output validation before anything downstream sees the result. Not "does the JSON parse" but "does this output satisfy the business constraints it was supposed to satisfy."
  • Confidence routing — when the model signals uncertainty, the output should not go to production. Route to a fallback rule, a human queue, or a conservative default.
  • Semantic drift detection — over time, the distribution of what your model produces drifts. Not because the model changed, but because the world feeding it changed. Monitor output distributions the same way you'd monitor latency percentiles.

What I'd tell myself three years ago

The model is not the system. The model is one component inside a system that has to earn the right to touch production state. It earns that right through versioned contracts, explicit validation, bounded context, and audit trails.

Every shortcut you take on those four things will come back as a production incident. I know because I've taken most of them.

Build boring AI systems. Your on-call rotation will thank you.

If you've been through something similar — or disagree with any of this — I'd genuinely like to hear it in the comments.

Top comments (0)