Why did one day of AI cost more than a month of servers?

#sre #devops #llm #ai

Same old story: I'm running the SaaS our CFO shipped to production in two days. A non-engineer exec builds something fast with Claude Code, and the engineer (me) goes through the back end one piece at a time. Every time I look, something crawls out.

This time it wasn't "where the secrets live," and it wasn't "there isn't a single test." This time, money burned.

One day I was staring at the LLM API cost graph, and there was a single day sticking up like Mount Fuji. Every other day hugs the floor; that one day pokes the sky. Roughly half of the whole month's bill landed on that one day.

I'll be honest, my stomach dropped when I saw the number. Because that single day of AI usage alone cost more than a full month of servers. Running the entire server fleet for a month is cheaper than letting the AI talk for one day. How is that a thing?

So I go ask the person who built it (the CFO): "What did you do that day?"

The answer:

"Honestly, I don't remember what I did."

Come on.

But this isn't a story about blame (well, half of it isn't). The deeper I dug, the more I landed on: of course they don't remember. It wasn't a human that burned the money. It was the retry machinery.

The hunt. At first I assumed they'd just hammered it all day

My first read was, roughly: "You built a bunch of features that day, tested them in prod over and over, and hit the expensive LLM every time. Death by a thousand cuts."

And it looked plausible. The commit history for that day was packed from morning to evening, with twenty-plus changes around the AI generation flow. So "slow burn from human repetition" had a face.

Then they actually dug into the app-side logs (task queue, DB, requests), and the picture was completely different. It wasn't a slow burn. The same heavy batch was being re-run, in full, by a machine, over and over. For a single tenant, a job that normally runs once had run 21 times.

A human doesn't press the same button 21 times in a day. The thing pressing the button wasn't human.

The scariest part was "it succeeds, then it falls over"

This is the core of the whole incident, so let me go slow.

The batch called several LLMs in sequence and saved the results to the DB. The flow, roughly:

Fire a pile of queries at several LLMs (this is where the money goes)
Write the returned results to the DB

The problem was in step 2: the write referenced a column that was supposed to have been added but wasn't there yet. The DB didn't have the column, so it threw column does not exist and the job returned a 500.

When you hear "it failed," you naturally picture "the call bombed and wasted a shot." Nope. Every LLM call succeeded. All 200s. Which means every one of them was billed, properly. You paid, you got the result back, and then it tripped on the very last step — the save.

If I put it in restaurant terms: you finish the full course, you pay the check, and right as you go to say "thanks for the meal," you trip, fall, and lose your memory. You come to, back at your seat, and start eating the same full course again. Twenty-one times. What you ate (= what you were billed for) doesn't un-happen, but every round starts from zero.

There's a term, "retry storm." Usually you picture it as "the call fails, fails again, fails again" — a flurry of misses. But this wasn't misses. It was a storm of throwing away the hits (the successes) and drawing a fresh hit each time. That's the counterintuitive part, and the scariest.

How did this happen? There were two culprits

The machine repeated it 21 times because of two pitfalls working together.

Pitfall 1: the deploy order was backwards.
The code shipped to production assuming a new column existed, but the migration that adds that column hadn't been applied to prod yet. Code first, schema second. In that order, the code reaches for a column that isn't there and fails deterministically. And "deterministically" is the kicker — it's the kind of failure that never fixes itself no matter how many times you retry.

Pitfall 2: when it fails, the task queue kindly re-runs it.
A managed task queue sees a job die with a 500 and goes "oh, that failed, let me run it again for you," automatically. For a transient network blip, that's the correct kindness. But this failure was "the column doesn't exist." No amount of re-running grows the column. It kept repeating an unfixable failure, infinitely, out of kindness.

And because the batch wasn't idempotent (it didn't skip already-processed work), every re-run starts over from the top. So every round carries the full LLM bill.

Deterministic failure × automatic retry × non-idempotent. When those three mesh, money burns quietly. No wonder the person doesn't remember — they didn't do anything. The thing holding down the button was the queue.

When I laid it out, the CFO scrunched their face: "Hmmmm?" (For a non-engineer, "it succeeded, you got billed, and then it threw the success away" is a genuinely hard pill to swallow.)

What I took away: retry is not kindness

Let me write the lessons down for myself, because they'll land for anyone in the same seat (anyone who's inherited someone else's running production).

A deterministic failure doesn't get better when you retry it. Schema mismatches, 4xx-class "you're the one who's wrong" errors — throw them as many times as you like, same result. Treat these as immediate "abort," and always put a retry ceiling on things. Retry is not a universal insurance policy.
The higher the side effect, the more it needs to be idempotent. Any batch that runs cost-bearing work (billing APIs, LLM calls) needs "skip what's already done" from day one. Without it, a re-run isn't a "redo," it's "double billing."
Deploy in the order "schema, then code." Apply the DB change first, then ship the code that uses it. Do it backwards and you mass-produce deterministic errors in the gap.
If cost isn't observable, you only notice "after it's burned." The only reason we caught this at all was that I happened to look at the cost graph. Without smoke detectors — separate keys for prod and test, budget alerts — nobody notices until the invoice arrives.

Vibe coding really did lower the bar for a non-engineer to build production. But seeing "how it can break" and "how it can get expensive" is still a separate skill. That part is still the job of the engineer who inherits it.

You can build the feature in two days. Preventing the moment where "gracefully retry the failure" mutates into "throw away the success and double-bill" — that doesn't come in two days.

Retry isn't always kindness.

I never want to see an AI invoice that's bigger than the server bill again. So I'm leaving this here, as a warning to myself.