Printo Tom

Posted on May 14

The AI system that worked in staging destroyed us in production. Here's what we missed.

#ai #architecture #llm #lessonslearned

I've been a software and enterprise architect for over twelve years. I've shipped pricing platforms, fraud detection systems, and order management infrastructure at scale — most recently at one of the UK's largest retailers. I say that not to flex, but to explain why I'm writing this post with a specific kind of frustration.

Because almost every article I read about AI in enterprise sounds like it was written by someone who has never been paged at 2am because an LLM-backed pricing rule marked 40,000 product lines as zero.

So here's what actually happens when you put AI into systems where the decisions have consequences.

The staging trap

Staging environments lie. They lie about load, they lie about data shape, and — critically for AI systems — they lie about context drift.

Context drift is when the world changes between the moment you assembled the input to your model and the moment the model's output takes effect. In a pricing engine, that gap can be milliseconds. In those milliseconds: a competitor might have repriced, a promotional rule might have fired, a stock threshold might have been crossed.

What this looks like in practice: your orchestrator assembles context — product cost, margin floor, competitor price, stock level — and sends it to the model. The model reasons and returns a recommended price. Validation passes. But by the time you write to the pricing store, the stock level has changed and the margin floor has been updated by a concurrent batch job. The model's recommendation was correct for a world that no longer exists.

The fix isn't faster models. It's a snapshot contract: a bounded, versioned, immutable view of state captured at orchestration time and passed all the way through to the action layer. Every downstream system confirms against the snapshot version before committing. If the snapshot is stale, you abort and re-orchestrate.

This pattern is borrowed directly from event sourcing. Most AI architects I've met have never heard of it.

Fraud signals don't behave like pricing signals — and that matters architecturally

One of the most useful things I've done is build both a fraud detection system and a pricing platform, because the contrast forces architectural clarity.

Fraud signals are high-frequency, low-latency, and the cost of a false negative is asymmetric — you can recover from a false positive (apologise to a good customer) but you can't unwind a fraudulent transaction. This pushes the architecture toward fail-closed defaults: when confidence is low, decline and escalate.

Pricing signals are lower frequency, higher context, and the cost structure is different — a bad price for 10 minutes on a low-velocity SKU costs less than a declined checkout. This pushes toward fail-open defaults with aggressive post-hoc monitoring.

The point is that "AI system" is not a single architecture. The trust posture of your validation layer, your fallback strategy, your human-in-the-loop gates — all of these should be derived from the asymmetry of your failure modes, not from a generic best-practice blog post (including this one).

Before you design the system, map your failure modes. A false positive in fraud is not the same as a false positive in pricing. Your architecture should know the difference.

The prompt is a contract. Treat it like one.

Your codebase versions your APIs. It versions your database schemas. It does not version your prompts — and that is a production incident waiting to happen.

We learned this the hard way. A well-intentioned tweak to the system prompt of a fraud classification model changed the output structure enough to break the downstream parser. Silently. For six hours. Because the validation layer was checking for the presence of a field, not its semantic content.

Prompt versioning isn't complicated. It's a git-tracked file, a version identifier injected into every API call, and a log entry that records which version produced which output.

{
  "prompt_version": "fraud-classifier-v2.4.1",
  "model": "claude-sonnet-4-20250514",
  "input_snapshot_id": "snap_01JV...",
  "output": { ... },
  "validation_result": "pass",
  "action_taken": "flag_for_review"
}

Every LLM-influenced decision that touches production state should produce a record like this. Not for debugging — for auditability. In retail, in finance, in any regulated domain, the question "why did the system do that?" will be asked by someone whose salary is higher than yours. You want a clean answer.

The layer nobody builds until they need it

Teams build the orchestration layer. They build the reasoning layer (the model call). They often skip the trust and validation layer, tell themselves they'll add it later, and then spend six months retrofitting it after their first production incident.

The trust layer is not a safety net. It's load-bearing infrastructure. It includes:

Schema enforcement — structured output validation before anything downstream sees the result. Not "does the JSON parse" but "does this output satisfy the business constraints it was supposed to satisfy."
Confidence routing — when the model signals uncertainty, the output should not go to production. Route to a fallback rule, a human queue, or a conservative default.
Semantic drift detection — over time, the distribution of what your model produces drifts. Not because the model changed, but because the world feeding it changed. Monitor output distributions the same way you'd monitor latency percentiles.

What I'd tell myself three years ago

The model is not the system. The model is one component inside a system that has to earn the right to touch production state. It earns that right through versioned contracts, explicit validation, bounded context, and audit trails.

Every shortcut you take on those four things will come back as a production incident. I know because I've taken most of them.

Build boring AI systems. Your on-call rotation will thank you.

If you've been through something similar — or disagree with any of this — I'd genuinely like to hear it in the comments.

Top comments (6)

Gilder Miller • May 14

The snapshot contract pattern is one of those things that seems obvious in hindsight but gets missed constantly. I watched a recommendation engine burn through a weekend because inventory shifted between context assembly and model inference. The model was technically correct. The world had just moved on.

The fail-closed vs fail-open distinction is underrated. Most teams copy their validation strategy from whatever framework they found first. Your fraud example nails the asymmetry. A false positive costs an apology. A false negative costs real money. The architecture should reflect that.

One thing I would add on audit trails. They become critical faster than people expect. Not just for regulatory reasons, but for debugging confidence drift. You see the model output changed. You need to know whether the world changed, the prompt changed, or the model behavior changed. Without that trail, you are guessing.

Have you seen semantic drift detection work well in practice? Most monitoring setups I encounter track latency and error rates but stop short of output distribution analysis. The tooling gap there feels real.

Printo Tom • May 15

Totally agree — the “technically correct but context-shifted” scenario is the one that bites hardest in production. That’s why the snapshot contract feels so load‑bearing once you’ve lived through it.

Your point on audit trails is spot on too. They stop being a compliance checkbox and start being the only way to untangle whether drift came from the model, the prompt, or the world itself. Without that lineage, you’re debugging blind.

On semantic drift detection — I’ve seen it work when teams treat outputs like any other metric distribution. Simple histograms of classification confidence or output ranges, tracked over time, can flag when the model is “answering the same question differently.” The tooling gap is real though. Most monitoring stacks are built for latency/errors, not semantics. Feels like the next frontier of MLOps: making distribution monitoring as turnkey as latency dashboards.

Curious if you’ve seen any lightweight approaches that don’t require a full data science team to maintain? That’s the piece I think most enterprises struggle with.

Gilder Miller • May 15

The lightweight answer is simpler than most teams expect.

Track your embedding centroids. If the average embedding of your outputs drifts past a threshold, flag it. No data science team required. Just cosine similarity and a cron job.
PSI on classification confidence scores is another low-lift win. Banks have used it for decades. If it works for regulatory scrutiny, it works for production monitoring.

The real blocker is not complexity. It is that nobody owns the output space. Latency lives in infra. Errors live in platform. Semantics lives in a gray zone where no one has clear ownership.
The hard part is the meeting to decide who owns it, not the implementation.

Printo Tom • May 19

That’s such a sharp way to put it — “the hard part is the meeting to decide who owns it.” Couldn’t agree more. The mechanics of drift detection are lightweight compared to the organizational friction of assigning responsibility. Latency has a home in infra, errors in platform, but semantics sits in that no‑man’s‑land where nobody feels accountable until something breaks.

Your embedding centroid idea is elegant precisely because it lowers the barrier. Cosine similarity plus a cron job is the kind of boring, reliable check that enterprises can actually sustain. Same with PSI on confidence scores — decades of regulatory precedent make it hard to argue against. The blocker isn’t technical; it’s governance.

Feels like the next frontier of MLOps isn’t just tooling, but ownership models. Who gets paged when the outputs start drifting? Until that’s answered, even the simplest monitoring setup risks being shelfware.

I’d be curious — have you seen any orgs successfully carve out “semantic SRE” responsibilities, or does it always end up bolted onto data science by default?

Gilder Miller • May 19

Totally agree. The mechanics are easy, the politics are the real blocker. It usually defaults to data science because they own the model, but they are busy training, not monitoring.
I haven't seen a dedicated semantic SRE role yet. Most orgs just bolt it onto data science until someone gets annoyed. It's hard to page someone when they are heads down.
Who do you think should actually get the pager for drift alerts?
Really appreciate the sharp take on the ownership issue.

Printo Tom • May 20

I’d keep it simple: the pager should sit with Platform/SRE, since they already handle runtime anomalies and know how to triage fast. From there, Data Science gets pulled in when it’s clearly a drift issue, and Product decides if it’s business‑critical.

That way, incidents get contained quickly without waking up data scientists mid‑training loop. The real challenge isn’t the mechanics — it’s agreeing that “semantic reliability” belongs in the SRE charter instead of floating in no‑man’s‑land.