If You Can Survive a Toddler, You Can Ship LLMs in Production

#ai #evals #llm

A few years back I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was an LLM. Reviews rolled in continuously, ratings flowed into a dashboard the product team checked every Monday morning. Everything ran clean for months. Then one Monday the chart had a step in it.

Reviews from the prior week averaged 6.4. The current week averaged 7.6. Same product. Same customers. The reviews themselves, when I went back to read them, looked indistinguishable from what we had been getting all year.

The model had changed. The provider had pushed a quiet update to the weights, and the LLM that gave us 6.4-equivalent scores last week was now giving 7.6-equivalent scores for the same content. Every historical comparison in that dashboard was silently invalid. The cleanup took a week. The harder conversation was about how much of our reporting had been real in the first place.

That kind of failure is the default behavior of LLMs in production. Trying to engineer it away with tighter parameters or pinned versions is a losing fight. The job is to design for it. I learned the lesson twice. Once from the reviews pipeline. Once from raising two kids.

What parents of small children already know about non-determinism

If you have lived through the toddler years, you have run this experiment a few hundred times without calling it one. The lunch you packed all last week, the one that came home empty every day, suddenly gets pushed off the table on Tuesday with full commitment. The bedtime story that worked for six straight nights stops working on the seventh. The nap routine the babysitter swore was solid breaks the moment you start calling it a "rule."

Experienced parents eventually stop trying to force determinism on the kid. Patterns and trends still matter. But you stop expecting any individual input to produce any individual output, and you build a system that absorbs the variance instead of fighting it. This is the same shift production AI engineers make, usually after their first calibration regression.

The LLM-as-judge can drift too

The reviews pipeline taught me that the judge can be the most fragile thing in the system. The model being evaluated can drift. The model doing the evaluating can drift too. Without something stable to anchor against, you cannot tell which one moved.

The pattern that works is a small held-out set of inputs with known, human-validated scores, and the habit of re-running it on a regular cadence. Call it the calibration set. 20 to 50 examples is plenty. You re-score the calibration set first. If the average jumps from 6.4 to 7.6 with no other changes, you know the judge moved, not the data. Without that anchor, the same diagnosis takes weeks of reading individual reviews and arguing about what changed.

This is where offline evaluations in AgentControl earn their keep. You upload your calibration set as a dataset, point a judge at it, and re-run on a cadence or before any variation change. The discipline I had to learn the hard way: keep the judge anchored, keep its inputs comparable, watch the distribution rather than any single response, becomes a property of the configuration instead of a script someone has to remember to run.

The parenting version is the pencil marks on a doorframe. The doorframe does not move. Every few months you put the kid against it, shoes off and back to the wall. If the line jumps three inches and you realize the kid is wearing sneakers, you take the shoes off and measure again before believing any of it. The doorframe is your held-out set. The shoes-off rule is the discipline that keeps re-runs comparable.

Temperature zero is a comfort blanket

Setting temperature to zero feels like it should make a model deterministic. Within a single model version it mostly does. The catch is that determinism within a version buys you nothing across versions.

My reviews pipeline had been running at temperature zero the whole time. The provider's swap underneath did not care. The judge changed, and greedy sampling kept producing the new, shifted scores with the same false confidence as before. Temperature zero compresses the variance you can see during testing, which makes you feel safer. It does nothing about the variance that actually breaks production. Design as if the model can produce a different valid output every time, because eventually it will.

Build the fallback before the happy path

The last move gets cut for time most often, which is why it matters. Before you ship the model that does the new thing well, ship the path that runs when the model does the new thing badly, slowly, or not at all.

For an LLM endpoint that usually looks like a cached response for known-bad inputs, a secondary model behind the primary that can take traffic when the primary fails an inline check, a circuit breaker that routes around the model entirely if error rates cross a threshold, and a logging path that captures failure cases instead of returning a stack trace to the user. None of this is exotic. All of it assumes the model will misbehave at some point and defines what good behavior looks like when it does.

The stronger version of the same idea is making the fallback adaptive instead of static. A static fallback still needs a human to notice something is wrong and pull the lever. An adaptive system watches the production signal itself and switches over without anyone in the loop. This is what configuration-driven LLM tooling is built for. With AgentControl by LaunchDarkly, model variations live as configuration rather than code, traffic shifts between them without a deploy, and a guarded rollout can tie an online evaluation score, or any AgentControl metric you care about, directly to whether a variation advances, pauses, or reverts. When the judge sees scores regress past a threshold, the rollout reverses itself. The fallback stops being a piece of code someone wrote in case of trouble. It becomes the architecture, watching itself.

Parents already operate this way. The grocery store meltdown will happen. The school will call at 11am about a "low-grade fever." You have a snack pre-stuffed in your bag and a backup babysitter on text. The fallback is the architecture. The happy path is the bonus.

What changed for me

After the reviews pipeline incident, my work changed. I started logging the model version returned in every API response. I built a calibration set. I stopped trusting any single eval run as a verdict. The boring fallback path now ships before the impressive demo path.

None of this is harder than the alternative. It is mostly a matter of accepting at the architecture stage what every parent of a small kid already knows. The interesting unit of measurement is the distribution, not the sample.`

Top comments (3)

Gilder Miller • May 14

This framing is perfect. Expecting variance instead of fighting it is the mindset shift most teams miss 😅
The calibration set pattern solves the real problem. Without that anchor, diagnosing drift takes forever.
Temperature zero as a comfort blanket is spot on. It feels like control, but does not help when providers swap models underneath.

One question. For catching semantic drift that does not move your calibration scores, do you rely on any secondary signals, or is manual review still the fallback?

Harjot Singh • May 31

The toddler analogy is genuinely apt for shipping LLMs, because the core skill is the same: you're supervising something capable but unpredictable, so you childproof the environment rather than trusting it to always behave. You don't hand a toddler something dangerous and hope, you remove the hazard, and that's exactly the structural-safety instinct that production LLMs need, gate the irreversible actions, sandbox what they can touch, validate what they hand back, because hoping it behaves doesn't survive contact with reality. The analogy carries further than it looks. Toddlers are confidently wrong (so verify, don't trust the assertion), they need consistent boundaries enforced not negotiated (structural guardrails beat please-don't), and they surprise you in ways you didn't anticipate (so observability and a kill switch matter). The teams that ship LLMs well are the ones who internalized that the model is the capable-but-unpredictable part and all the engineering is the childproofing around it. Childproof the environment, supervise the output, and never leave the irreversible stuff reachable. That assume-it-will-do-the-unexpected-and-bound-it instinct is core to how I think about Moonshift. Of your production childproofing, what saved you the most, the output validation, or the hard limits on what it could actually do?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.