Sol

Posted on Jul 1

What actually breaks when OpenAI and Anthropic APIs fail in production (and what to check first)

#finops #devops #aiops #llm

I've spent the last few months collecting patterns from production incidents involving the OpenAI and Anthropic APIs. These are the failure classes that keep appearing — and what to check first when you're on-call and something breaks at 2am.

The 6 failure classes engineers hit most

1. 429 Rate Limits — and two meters engineers confuse

OpenAI has two separate rate limit meters: RPM (requests per minute) and TPM (tokens per minute). Engineers almost always look at the wrong one.

If you're getting 429s on low-volume requests that use large prompts or long outputs, you've hit TPM — the response header x-ratelimit-remaining-tokens tells you. If you're getting 429s despite low token counts, it's RPM — check x-ratelimit-remaining-requests.

Anthropic has the same structure: x-ratelimit-limit-requests and x-ratelimit-limit-tokens are separate meters. The fix for each is different (backoff interval vs. output truncation), so misdiagnosing this doubles your resolution time.

2. Quota exhaustion vs. rate limiting

Both produce 429 responses, but mean completely different things:

Rate limit hit: you're going too fast; exponential backoff with jitter will clear it in seconds
Quota exhaustion: you've used your monthly/daily allocation; no retry strategy helps — you need to upgrade tier or wait for the reset

The most reliable signal is the error message body, not the status code alone. Parse the error.type field: tokens_quota_exceeded vs rate_limit_exceeded are distinct strings.

3. Provider overload errors (OpenAI 500s, Anthropic 529s)

OpenAI's 500-503 and Anthropic's 529 ("API temporarily overloaded") are provider-side capacity issues — not your code. The right response is exponential backoff with jitter.

But: if you see these at consistent times of day, it's a capacity pattern, not random noise. This is worth flagging to your team as a deployment consideration.

4. Streaming connection drops mid-response

SSE streams can stop delivering tokens without an explicit error. The client side sees no failure — it just stops receiving. This requires a separate timeout on the stream consumer, not just the initial connection. Most incident reports I've seen here come from treating stream timeout the same as connection timeout. They're not.

5. Model alias deprecation surprises

OpenAI's gpt-4-turbo-preview → gpt-4-turbo → gpt-4o aliases can silently change behavior even when you haven't changed code. If an AI feature's output quality changes without a deployment, check whether a model alias was migrated. Pin to explicit versioned model IDs for production.

6. Context window overflow in long sessions

invalid_request_error caused by exceeding max_tokens almost never shows up in testing (where sessions are short) but surfaces under real usage when users hit extended conversations. This one tends to appear in customer support at inconvenient times.

What helps — and what doesn't

What engineers tell me they reach for first: app logs for the raw API error body, then vendor status pages, then SDK source to decode the error type. What slows them down most: ambiguous error messages that don't distinguish rate limit type from quota type, and vendor dashboards that aggregate too coarsely to pinpoint which call failed.

I'm researching this

I'm currently conducting a short research study on how teams that ship customer-facing AI features handle production API incidents. I want to understand the actual debugging workflow: what you check first, how long resolution takes, what tools are in the loop.

If you've debugged any of the above in production and are willing to share 20 minutes, drop a comment or reach me at the contact in my profile. I'm particularly interested in TypeScript and Python teams using OpenAI or Anthropic in production.

No pitch. Trying to map the real pain so we can build something that actually helps.

DEV Community