Production AI API failures by category: what 429s, 529s, and timeouts are actually telling you

#openai #debugging #production #ai

When your LLM feature pages our team at 2am, the error message is rarely the whole story. After running OpenAI and Anthropic integrations in production for the past year across several SaaS products, I've started categorizing failures not by HTTP status code — but by what the failure pattern actually means for your debugging path.

Here's the taxonomy that's saved us the most time.

Category 1: Capacity failures (429 and 529)

The 429 from OpenAI and the 529 from Anthropic both mean "too many requests," but they behave differently in ways that matter:

OpenAI 429 comes in two flavors that share the same status code:

Rate limit on requests-per-minute (RPM) — recovers in seconds
Rate limit on tokens-per-minute (TPM) — recovers in 60s but depends on your model tier
Monthly quota exhaustion — doesn't recover until the billing cycle resets or you add credits

All three return 429. The error.code field distinguishes them: rate_limit_exceeded vs insufficient_quota. Treating a quota exhaustion with exponential backoff will burn your on-call for 45 minutes before you check the dashboard.

Anthropic 529 specifically signals overload rather than your quota. Your retry logic should treat it identically to a 503 — the provider is saturated, not you. Backoff + alert, but it's not your problem to fix.

Debugging path: Check error.code and error.type before deciding whether to backoff, alert, or escalate. Don't let a unified 429 handler mask a billing problem.

Category 2: Invalid request failures (400)

These are the most embarrassing in incident retros because they're always our fault. But they're harder to catch than they look:

Model version mismatches: You updated the model name in one place but not in the retry handler. Or the model you were calling got deprecated last month and the 404 started returning a 400 with a confusing message.
Context window overflow: The request built up too much conversation history. The error says context_length_exceeded but the root cause is usually upstream — a broken truncation step or a user with a very long session.
Schema validation failures for structured outputs: With function calling and JSON mode, the schema you're sending is sometimes rejected for subtle reasons (recursive references, unsupported types). These are hard to reproduce locally.

Debugging path: Log the full request payload on 400 errors (after redacting any user PII). The response body tells you exactly what field failed validation. The challenge is getting the failing payload, not reading the error.

Category 3: Timeout failures (read timeout vs connect timeout)

Timeouts are where most teams' observability breaks down because the failure is silent from the provider's perspective — the request was processing, and we interrupted it.

Connect timeout: The TLS handshake didn't complete within your timeout. This often happens during provider brownouts that precede a full outage, or due to DNS/networking issues on your side. Check provider status AND your outbound network.
Read timeout: The model started responding but didn't finish. For streaming responses, this may mean partial output was delivered. Your application needs to handle the difference between "timed out before first token" and "timed out mid-stream." They have different UX implications.
Gateway timeout (504): Your proxy or load balancer timed out before your configured timeout. The request may still be processing at the provider. Don't retry without deduplication.

Debugging path: Separate your connect timeout from your read timeout in your HTTP client config. Log both the start time and the time-to-first-token. The delta tells you whether latency is in setup or generation.

Category 4: Server errors (500, 503)

These are provider failures. Actionable steps are limited, but how you handle them determines your user experience:

A 500 from OpenAI or Anthropic rarely repeats on a retry. A single retry after 1-2 seconds resolves ~70% of them.
A 503 means the service is degraded. Check the status page. If status.openai.com or status.anthropic.com shows an incident, your retry logic is just adding load — switch to circuit-breaker mode.
Document which model/endpoint you were hitting when the 500 occurred. OpenAI has different reliability profiles across gpt-4o, o3, and the older models.

The debugging question no one asks first

Every time I'm on call for an AI API incident, the first question should be: is this my code, my configuration, or the provider? The categories above help answer that in under 60 seconds.

Most teams skip straight to logs → docs → Slack, which burns 15-20 minutes before they realize the OpenAI status page has been showing "degraded" for the past hour.

The tools I've found most useful in order: (1) the SDK's error type/code fields, (2) the provider status page, (3) your own request logs with full error bodies, (4) the provider's playground to test if the model is responsive.

I'm currently researching how production teams actually handle this in practice — what's your mental model when you see a failing trace? Drop a comment or reach out if you're willing to share your on-call runbook.