DEV Community

Void Stitch
Void Stitch

Posted on

5 things I wish I knew before my first LLM API incident in production

When an LLM API call fails in production, most engineers find the same problem: the error in the logs doesn't tell you what to actually do next.

I've been collecting patterns from OpenAI and Anthropic production incidents. Here are the five things that consistently slow resolution — and what we should be logging instead.

1. Log the error type, not just the status code

A 429 from OpenAI can mean at least three different things: RPM rate limit hit, TPM rate limit hit, or quota exhausted. Each requires a different fix. Logging only status: 429 is like logging HTTP 500 for every server error.

Log the error.type field from the response body: rate_limit_exceeded vs tokens_quota_exceeded are machine-readable and tell you exactly which meter you hit.

2. Know the difference between RPM and TPM rate limits

OpenAI has two separate meters. Engineers almost always look at the wrong one.

  • If you're hitting 429s on low-volume requests with large prompts, check x-ratelimit-remaining-tokens — you've hit TPM (fix: reduce output length or batch differently).
  • If you're hitting 429s despite low token counts, check x-ratelimit-remaining-requests — you've hit RPM (fix: reduce request frequency).

Anthropic has the same structure with x-ratelimit-limit-requests and x-ratelimit-limit-tokens. Log both headers on every failed call.

3. Treat streaming connection drops separately

An SSE stream can stop delivering tokens without closing cleanly. The client receives no exception, no HTTP error — it just stops. If your timeout logic only covers the initial connection, you'll miss mid-stream drops entirely.

You need two separate timeouts: one for the initial connection, and one for idle time between received tokens. Most incident reports I've seen here involve the second one being missing.

4. Pin explicit model versions in production

OpenAI's model aliases (gpt-4-turbo-preview, gpt-4-turbo) have silently pointed at different underlying versions over time. If an AI feature's output quality or cost changes without a deployment, the first thing to check is whether a model alias was migrated.

For production workloads, pin to explicit versioned model IDs. Leave alias resolution to dev/staging.

5. Track fast-fail error rate separately from latency

Anthropic's 529 (API overloaded) returns in milliseconds — so your overall latency p50/p95 might look fine while every AI feature is degraded. Monitor error_rate_by_type (not just overall error rate), and flag fast failures above a threshold as a separate alert.


Running a research project on this: I'm trying to understand where the actual pain is for engineers debugging LLM provider incidents. If you ship AI features (TypeScript or Python, OpenAI or Anthropic in your production path) and you've personally handled at least two provider-side incidents in the last 90 days — I'd like to hear about it. 15 minutes, no pitch, just trying to understand where existing tooling falls short before building anything.

Drop a comment with your stack and whether you're open to a short chat.

Top comments (0)