The graduation AI that skipped names — and what it taught us about silent failure

#ai #watercooler #programming #agents

The first name it skipped was in the M's. I was watching the livestream from a hotel room three states away, only half paying attention, when my partner said "wait, did they just skip someone?" and I rewound the feed. They had. A graduate had crossed the stage, shaken the dean's hand, taken the diploma, and walked off — and the PA system had said nothing. No name. No pause. Just the next student's name a beat too early, like the audio had hiccuped.

It happened again two names later. Then four more in the next ten minutes. By the end of the ceremony, eleven graduates had walked across the stage in silence. The local paper covered it the next day. The vendor put out a statement. Nobody got fired, as far as I know. But I spent the next week reading the Puget Press writeup and the Reddit threads about the Pizza Hut franchise litigation — same week, different stakes, same shape — and I couldn't stop thinking about how I'd built systems with the exact same hole in them.

The stakes were boring until they weren't

For the graduation, the stakes looked small on paper. A name-reader. TTS over a roster. The kind of system you ship in a sprint and forget about. The vendor had probably demo'd it a dozen times — clean roster, quiet room, the engineer's own kid as the test name. It worked fine. Why wouldn't it work at the real ceremony?

It worked fine until the roster hit a name with a diacritic the TTS model hadn't seen in the training distribution. Or a hyphenated last name parsed as two separate fields. Or a name whose audio file failed to generate and the pipeline silently moved on to the next entry rather than alerting anyone. We don't know which — there's no postmortem, only the symptom. But the shape of the failure is unmistakable: a happy-path system meeting a long tail it wasn't built for, in front of 4,000 people, with no human in the loop.

The Pizza Hut case lands on the same shape with three more zeroes. Franchise operators are claiming roughly $100M in damages from an AI system — order routing, staffing, supply chain, the reporting is fuzzy — whose outputs propagated through dependent systems without any circuit breaker. One model's bad call became a cascade. There's no published postmortem there either, just the legal filings. But the structure of the complaint is recognizable to anyone who has shipped automation into a system-of-systems and watched it walk into a doom loop.

Both systems were built with the assumption they would be right. That's the assumption I want to talk about, because it's the one I keep making.

The resolution: there isn't one yet

I don't have a tidy resolution for the graduation story because I wasn't there to fix it. The ceremony ended. People went to brunch. The vendor shipped a patch, probably — a hardcoded fallback to a human reader on any name that the TTS pipeline can't render with confidence above some threshold. The next ceremony will go better. The graduates who got skipped got a public apology and presumably a re-do at some private event, which is the kind of thing that helps and also doesn't.

What I did do, the week after, was go back and audit two production pipelines I own at work. Both of them are agent-driven workflows that process a long tail of inputs: documents, user-submitted prompts, the usual. Both of them had the same hole. The agent would do its thing, and if it failed in a way the orchestrator didn't anticipate — empty output, malformed JSON, a tool call that timed out without raising — the pipeline would move on to the next item. Silent skip. Just like the graduation.

We shipped three changes that week:

Empty output is an error, not a success. If an agent returns nothing for an item that should produce something, the orchestrator raises and the item goes to a dead-letter queue for human review. This is two lines of code. It took us six months to write them because nobody had asked "what happens when the model just... doesn't say anything?" until I came back from watching the graduation feed.
Every batch run reports both the count it produced AND the count it skipped. If those two numbers don't sum to the input count, the run is marked failed. Previously, a run that processed 994 of 1,000 items reported success.
The blast radius for any new pipeline is capped at 1% of traffic for the first 72 hours. No exceptions, including "but the demo went great." Especially then.

None of those are clever. The Pizza Hut filing, if you read between the lines of the practitioner commentary, suggests they were missing the third one most acutely — no scope-gating before full rollout, no circuit breaker once the cascade started. The graduation system was missing the first.

What I'd tell past-me

The thing I'd say to the version of me who shipped those pipelines six months ago is: the demo is a lie, and you know it's a lie, and you have to behave as if you know it's a lie. The demo runs on the roster you cleaned. The demo runs in the room with the good acoustics. The demo runs against the model version that was current the morning you recorded it. Production runs on the roster as it actually exists, with the diacritics and the typos and the kid whose legal name is just one letter, in the room with the feedback loop in the PA, against whatever version of the model the upstream API decided to route you to today.

The failure mode you're most likely to ship is the one your tests don't cover because you didn't think to write them. The way you find those failures before your users do is to ask, for every output your system produces: what does it look like when this is wrong, and who notices? If the answer to "who notices" is "nobody, until someone complains," the system is one bad input away from a graduation ceremony.

The graduates who got skipped — I keep coming back to this — were not failed by a model. They were failed by an architecture that didn't have a place for a human to say stop. That's the part that's on us, not the model. The model is going to be wrong. The model has always been going to be wrong. The question every production AI deployment has to answer, before it goes live, is what happens in the seconds between being wrong and somebody noticing. If the answer is "nothing," the system isn't ready.

I think about that hotel room a lot. The thing that bothered me wasn't the bug. It was how calm the audio sounded as it skipped them. No stutter, no error tone, no pause. Just the cheerful next-name as if nothing had happened. That's the sound of a system that was never built with the assumption it could be wrong.