What engineers actually do when an OpenAI or Anthropic call fails in production

I've spent the last few weeks talking to engineers who personally handle production incidents involving OpenAI and Anthropic APIs. Not people who read about it — people who got paged at 2am, opened their dashboards, and had to figure out what broke.

Here are a few patterns that keep coming up. I'd love to hear whether these match your experience, or where your incidents looked completely different.

The failure that doesn't show up in your dashboard

The most frustrating incident pattern I keep hearing about: the LLM call completes cleanly. 200, tokens logged, no errors. But the customer output is wrong. The failure happened upstream — in context assembly, retrieval, or how the prompt was built — and the model responded correctly to a broken input. Your trace shows a green success. Your customer is angry.

For voice pipelines, the equivalent is even more hidden: an endpointer fires early, half the user's input never reaches the model, and the LLM logs look perfect.

The two 429s that require completely different fixes

Rate-limit hits (RPM/TPM) and quota exhaustion both return 429. Engineers consistently spend extra time debugging because they treat them the same. Exponential backoff clears a rate limit in seconds. It does nothing for quota exhaustion. The signal is in the error body (rate_limit_exceeded vs. tokens_quota_exceeded), but most error-handling code only checks the status code.

When the model alias changes under you

Hardcoding a non-versioned model alias is a source of quiet regressions. The alias gets migrated, output behavior shifts, no deployment happened. It took days to find because nobody thought to check the model version.

If you've personally debugged production AI API failures in the last few months, I'm curious:

What was the first symptom that alerted you something was wrong?
What did you check first, and was that the right call?
How long from first alert to a working fix?

I'm mapping the actual diagnosis workflow engineers use — what they try, in what order, and what slows them down. Not the docs version of debugging. The 2am version.

If you've been through one of these, drop a comment with the failure class and roughly how long it took. Even a one-liner helps.

DEV Community

What engineers actually do when an OpenAI or Anthropic call fails in production

The failure that doesn't show up in your dashboard

The two 429s that require completely different fixes

When the model alias changes under you

Top comments (0)