DEV Community

Cover image for The boring engineering you skipped is the headline you'll wear
Thousand Miles AI
Thousand Miles AI

Posted on

The boring engineering you skipped is the headline you'll wear

The graduates at an Arizona high school stood up when their name was supposed to be called. The AI name-reader skipped them. They sat back down. Their families, holding up phones, didn't get the moment. Reporters caught it on video.

A few weeks earlier, a Pizza Hut franchisee filed suit alleging $100 million in damages from a cascading AI failure that knocked out ordering, inventory, and staffing across multiple locations. The complaint reads like an incident report someone forgot to write before going live.

I've been collecting these stories for about eighteen months now. Air Canada's chatbot. Municipal benefits adjudicators. A handful of healthcare triage tools that I won't name here because the litigation is active. The pattern is the same one every time, and it is not the pattern people keep telling each other it is.

What people say went wrong

The hot-take version is that the models are unreliable. Probabilistic systems. Hallucinations. The technology isn't ready. You hear this from model critics, from journalists who cover AI, and — quietly — from executives who don't want to explain why their org pushed a thing into production without a rollback plan.

The defensive version, from vendors, is that the customer integrated it wrong. Didn't follow the deployment guide. Didn't tune the prompts. Should have used the enterprise tier with the human-review SLA.

Both versions are doing work. Neither is the load-bearing explanation.

What actually went wrong

The name-reader at the graduation had no fallback. There was no human standing by with a printed list, no "if this fails, the principal reads the next ten names manually" plan. The system didn't fail unusually badly — it failed once, and the failure was visible to two thousand people because nothing caught it.

The Pizza Hut system, from what's in the public complaint, went live across many locations on a short timeline, with no canary deployment, no staged rollout, and — critically — no way to revert without losing the day's orders. When the cascade started, the franchise couldn't stop it. They could only watch.

Air Canada's chatbot was held legally responsible for telling a passenger about a refund policy that didn't exist. The court's reasoning was that Air Canada deployed the chatbot, and deploying organizations are responsible for what they deploy. The model didn't fail in some exotic way. It said a wrong thing, the way a poorly trained employee might say a wrong thing. The difference is that an employee has a manager who can intercept the wrong thing. The chatbot did not.

In every case I've looked at, the post-mortem — when one was written — points at the same gaps:

  • No staged rollout. The system went from "works in a demo" to "runs in production for everyone" with no intermediate state.
  • No defined degradation path. When the model produced a bad output or no output, there was no defined behavior for the surrounding system.
  • No human-in-the-loop checkpoint at the moment that mattered. Sometimes there was one in the design doc, but it had been removed for latency or cost reasons before go-live.
  • No rollback. Whatever the system did was committed and visible before anyone could intervene.

These are not AI problems. These are the same deployment problems we had with non-AI software, in 2010, in 2005, in 1998. Change management has known about staged rollouts and canaries for decades. We have not collectively decided to apply that knowledge to AI features.

The resolution, where there was one

The organizations that came out of these incidents intact did the same boring things, after the fact, that they should have done before.

They added a human in the loop at the moment of consequence — the moment something irreversible happens to a person. (For the name-reader, that's the moment before the name is announced; for an order system, the moment before stock decrements; for a chatbot, the moment a policy claim is asserted.)

They defined a degradation path in plain English. "If the model is unavailable, the system shows X." "If the model returns low-confidence, the system routes to a human." "If the system can't decide, it does nothing." The fact that you can write the sentence is the test that the path exists.

They staged rollouts geographically or by user cohort. The Pizza Hut situation, in particular, would have been a contained incident at one franchise instead of a $100M complaint if the rollout had moved one location at a time with a kill switch between each step.

They wrote down who would be paged when the system misbehaved, and what that person was authorized to do. (You'd be surprised how often the on-call person for the AI system is the data scientist who built it, who has no production access and no authority to roll anything back.)

None of these mitigations are novel. They are what enterprise software ops has done for thirty years.

The lesson, said once

If I could tell past-me one thing about this category of failure, it would be this: the model is almost never the thing that fails publicly. The integration around the model is the thing that fails publicly, and the model is just what gets blamed because it's the new and shiny part.

The legal system is starting to figure this out. Courts in the Air Canada case and regulators looking at the Pizza Hut complaint are placing liability on the deploying organization, not the model vendor — which is consistent with how software liability has always worked. You shipped it. You're responsible for what it did.

This is good news, actually. It means the conversation about AI risk is moving back toward something we know how to manage: change discipline, staging, rollback, observability, human accountability at the moments that matter. We have decades of practice with these things. The hard part is admitting that the AI feature does not get a pass on them just because it is the AI feature.

The graduates sat back down because there was no Plan B. Plan B is not glamorous. It does not get a launch announcement. Nobody writes a thought-leadership post about the printed backup list the principal kept in her jacket pocket. But the printed list is the difference between a story that gets told at the next reunion as a funny near-miss, and a story that gets told on local news as the day the school humiliated its kids with a chatbot.

The boring engineering you skipped is the headline you'll wear.

Further reading

Top comments (0)