plasma

Posted on Jul 2

Stop Treating LLM API Errors Like Normal HTTP Errors

#llm #ai #api #webdev

Most backend engineers already know how to handle HTTP errors.

400 means the request is bad.

401 means auth failed.

429 means rate limited.

500 means something broke upstream.

Retry a few times, add exponential backoff, log the response body, move on.

That works fine for many APIs.

It works badly for LLM APIs.

LLM providers may use normal HTTP status codes, but the operational meaning behind those errors is different enough that treating them like ordinary REST failures can make your app slower, more expensive, and harder to debug.

The mistake I kept making

Early on, I handled LLM failures the same way I handled every other external API:

if (response.status === 429 || response.status >= 500) {
  retryWithBackoff();
}

Simple. Familiar. Dangerous.

That logic misses the actual question your app needs to answer:

What kind of LLM failure happened, and what should the product do next?

Because an LLM API failure is rarely just "one HTTP request failed."

It can break:

a user-facing chat response
a background agent run
a document generation job
a tool-calling workflow
a batch evaluation pipeline
a structured JSON generation step

And each one needs different handling.

Not all 429s mean the same thing

For a normal API, 429 Too Many Requests usually means:

Slow down and retry later.

With LLM APIs, 429 can mean several different things.

It might be a temporary rate limit:

{
  "error": {
    "message": "Rate limit reached",
    "type": "rate_limit_error"
  }
}

Retrying with backoff may help here.

But it might also mean quota exhaustion:

{
  "error": {
    "message": "You exceeded your current quota",
    "type": "insufficient_quota"
  }
}

Retrying this does not help. It just adds latency, noisy logs, and a worse user experience.

It could also be model-specific pressure. One model may be overloaded while another model from the same provider, or a different provider, would work fine.

So your handler should distinguish between:

temporary rate limit
hard quota exhaustion
model-level capacity issue
billing or account issue

Those should not all go through the same retry path.

Retrying can multiply the damage

Retries are one of the easiest ways to make an LLM system worse while thinking you made it more reliable.

Imagine this:

A user asks your app to generate a report.
Your backend calls an LLM.
The request times out after 30 seconds.
Your code retries three times.

That sounds reasonable.

But the first request may still be running provider-side. The second request may generate a duplicate answer. The third request may hit another rate limit. The user waits longer. Your token bill goes up. Your logs now show four attempts for one user action.

For simple completions, this is annoying.

For agent workflows, it can be dangerous.

If the LLM call can trigger tools, send emails, update tickets, write to a database, or call external systems, blind retries can create duplicate side effects.

Your retry policy needs to know what kind of operation it is retrying.

type LlmOperationType =
  | "simple_completion"
  | "streaming_chat"
  | "tool_calling_agent"
  | "background_batch_job"
  | "structured_output";

function shouldRetry(
  operation: LlmOperationType,
  status: number,
  errorType?: string
) {
  if (errorType === "insufficient_quota") return false;
  if (status === 401 || status === 403) return false;
  if (status === 400) return false;

  if (operation === "tool_calling_agent") {
    return status >= 500;
  }

  if (operation === "streaming_chat") {
    return status >= 500;
  }

  return status === 429 || status >= 500;
}

This is still simplified, but the idea matters:

retry policy should understand the LLM task, not just the HTTP status code.

Context length errors are product errors

A context length error often comes back as a 400.

For a normal API, a 400 often means the developer sent a malformed request.

For an LLM app, context length errors are often product-level failures.

It means your app did not manage the prompt budget properly.

Possible fixes include:

summarize older messages
trim retrieved documents
reduce tool output size
switch to a larger-context model
ask the user to narrow the task
split the job into multiple calls

The worst response is showing the raw provider error to the user.

A better product response is something like:

This request is too large to process in one pass. I can summarize the source material first, then continue.

At the system level, context errors should be classified separately from normal bad requests.

function classifyLlmError(status: number, message: string) {
  const text = message.toLowerCase();

  if (status === 400 && text.includes("context length")) {
    return "context_window_exceeded";
  }

  if (status === 429 && text.includes("quota")) {
    return "quota_exceeded";
  }

  if (status === 429) {
    return "rate_limited";
  }

  if (status >= 500) {
    return "provider_unavailable";
  }

  return "unknown_llm_error";
}

Once you classify this correctly, the rest of the app can respond correctly.

Streaming failures are partial product states

Streaming makes error handling more complicated.

With a normal request, the API either succeeds or fails.

With streaming, the model may produce 80% of the answer and then disconnect.

Now what?

You need to decide:

Should the partial answer be shown?
Should the UI mark it as incomplete?
Should the user be offered a continuation?
Should the backend retry from scratch?
Should the retry include the partial output?
Should partial structured output be discarded?

For chat products, partial output may still be useful.

For code generation, partial output may be misleading.

For JSON generation, partial output may be invalid and should not be consumed directly.

I like tracking stream state explicitly:

type StreamState = {
  requestId: string;
  startedAt: number;
  receivedTokens: number;
  partialText: string;
  completed: boolean;
  error?: string;
};

Then the app can make different decisions:

if no tokens were received, retry automatically
if some tokens were received, show an incomplete marker
if JSON was expected, discard or repair the output
if tools were involved, stop and ask for confirmation before continuing

A streaming disconnect is not just a network failure.

It is a partial product state.

Model fallback is not just retrying another URL

A common reliability pattern is:

If model A fails, try model B.

That can work, but only if you define safe fallback rules.

Different models may have different:

context windows
pricing
tool calling support
JSON reliability
latency profiles
refusal behavior
output style
reasoning quality

If you silently fall back from a stronger model to a smaller one, your request may "succeed" while the product quality quietly drops.

A better fallback strategy starts with task classes.

const fallbackModels: Record<string, string[]> = {
  high_reasoning: [
    "gpt-4.1",
    "claude-3-5-sonnet",
    "gemini-1.5-pro"
  ],
  fast_chat: [
    "gpt-4.1-mini",
    "claude-3-haiku",
    "gemini-1.5-flash"
  ],
  cheap_batch: [
    "gpt-4.1-mini",
    "gemini-1.5-flash"
  ]
};

Fallback should depend on what the task needs.

For example:

summarization can usually tolerate fallback
legal or financial analysis may not
tool-calling agents need models with compatible tool support
structured output workflows need models that follow schemas reliably
user-selected models should not be silently replaced

If you use an OpenAI-compatible routing layer, this is where it can help. You can centralize fallback and routing behavior instead of scattering provider-specific rules across the codebase.

This is one reason I like using a gateway such as TokenBay for multi-model apps: the API surface stays familiar, but you can still make model choice, fallback, and routing explicit.

Log the fields you will actually need later

For normal APIs, status code, endpoint, latency, and response body may be enough.

For LLM APIs, that is usually not enough.

Useful fields include:

{
  "provider": "openai",
  "model": "gpt-4.1-mini",
  "operation": "support_ticket_summary",
  "status": 429,
  "error_category": "rate_limited",
  "input_tokens_estimated": 18420,
  "output_tokens_received": 0,
  "streaming": false,
  "retry_count": 2,
  "fallback_used": false,
  "latency_ms": 12833
}

These fields help answer the questions you will care about during an incident:

Are failures concentrated on one provider?
Is one model causing most context errors?
Are retries helping or just increasing latency?
Are streaming failures happening after output starts?
Are background jobs consuming too much quota?
Are fallbacks hiding a provider outage?
Are structured output failures tied to one model?

Without this data, you end up debugging LLM reliability with screenshots, anecdotes, and vibes.

I have done that. Not fun.

Build an internal LLM error taxonomy

Instead of handling raw HTTP status codes everywhere, create your own internal error categories.

type LlmErrorCategory =
  | "auth_error"
  | "rate_limited"
  | "quota_exceeded"
  | "context_window_exceeded"
  | "content_refused"
  | "provider_unavailable"
  | "stream_interrupted"
  | "invalid_tool_call"
  | "malformed_structured_output"
  | "unknown";

type LlmError = {
  category: LlmErrorCategory;
  provider: string;
  model: string;
  status?: number;
  retryable: boolean;
  userVisibleMessage: string;
  rawMessage?: string;
};

Then map provider-specific errors into your own categories.

This gives the rest of your product one consistent way to decide:

whether to retry
whether to fallback
what to show the user
what to log
whether to alert the team
whether to pause a workflow

The goal is not to hide HTTP.

The goal is to stop pretending HTTP status codes contain all the operational context your LLM app needs.

A simple handling matrix

Here is a practical starting point.

Error category	Retry?	Fallback?	Product behavior
`auth_error`	No	No	Ask admin to check API key
`quota_exceeded`	No	Maybe	Explain capacity or billing issue
`rate_limited`	Yes, with backoff	Maybe	Queue or show short delay
`context_window_exceeded`	No	Maybe	Summarize, trim, or split task
`provider_unavailable`	Yes	Yes	Retry or switch model
`stream_interrupted`	Depends	Depends	Show incomplete state
`content_refused`	No	No	Ask user to revise request
`malformed_structured_output`	Yes, constrained	Maybe	Repair or regenerate
`invalid_tool_call`	Maybe	No	Validate schema and retry carefully

Your matrix will vary by product, but even a rough version is much better than this:

if (status >= 500 || status === 429) {
  retry();
}

Final thought

LLM APIs use HTTP, but LLM failures are not just HTTP failures.

A 429 might mean slow down, upgrade quota, switch models, queue the job, or stop retrying entirely.

A 400 might mean your app failed to manage context.

A streaming disconnect might mean the user already received half an answer.

A fallback might save the request, or quietly degrade quality.

If you are building anything serious on top of LLMs, create an error layer that understands LLM-specific failure modes.

Your users will see fewer broken states.

Your logs will become easier to reason about.

And your retry logic will stop turning small provider issues into expensive production incidents.

Top comments (6)

Sol • Jul 2

Good breakdown. One nuance worth adding: OpenAI's 429 sends x-ratelimit-reset-requests and x-ratelimit-reset-tokens in response headers (letting you distinguish TPM vs RPM hits), while Anthropic's 429 gives a more generic retry-after. Teams that wire up Anthropic after building against OpenAI often copy the header-parsing code and get silent misclassifications — Anthropic's 529 (overloaded) ends up retried the same way as 429, which compounds the incident.

Been looking at how teams actually debug these failure classes during live incidents — specifically how the resolution path unfolds. Quick question: when you hit a 429 in production, how long does it typically take to identify whether it's TPM or RPM you're hitting? And what's the first log or signal you reach for?

plasma • Jul 7

That header mismatch is a really good callout. It is easy to accidentally write “OpenAI-shaped” error handling and assume every other provider exposes the same recovery signals. Then 429, overload, token pressure, and request pressure all collapse into the same retry behavior, which is exactly how small incidents get louder.

For 429s, the first thing I want in the log is the provider’s raw rate-limit metadata plus my own estimated input/output token shape for the operation.

Ideally one log event tells me:

{
  "provider": "openai",
  "model": "gpt-4.1-mini",
  "operation": "support_ticket_summary",
  "status": 429,
  "provider_error_type": "rate_limit_exceeded",
  "reset_requests": "...",
  "reset_tokens": "...",
  "input_tokens_estimated": 18000,
  "retry_count": 1
}

If that is already captured, TPM vs RPM should be visible in a minute or two. If it is not captured, the incident turns into dashboard archaeology: checking token volume, request rate, queue depth, model-specific limits, and recent deploys. That is usually when the debugging gets slow.

Sol • Jul 7

Keeping the classification next to the client and the human context in a runbook is a clean split. In the last three provider incidents you debugged on TokenBay, which one took the longest from first symptom to working mitigation, and what first symptom sent you in the wrong direction?

Sol • Jul 7

That makes sense. The "dashboard archaeology" failure mode is exactly the expensive part.

One narrow follow-up: in the last 90 days, which OpenAI or Anthropic incident took the longest to stabilize, what was the first visible symptom, and roughly how long did it take to get to a working mitigation? Even rough numbers are useful.

Sol • Jul 2

Really clear framing on the 429 disambiguation — the RPM vs TPM distinction is the one that gets most engineers the first time. The part I keep hitting is that even when you parse error.type correctly, the downstream decision tree isn't in the SDK error handling; it lives in a runbook someone wrote after the first bad incident. Have you found any pattern for capturing that institutional knowledge closer to the actual error handling code?

plasma • Jul 7

Yeah, that is exactly the gap I keep running into. The SDK can tell you what happened at the request boundary, but the “what do we do now?” part usually lives in somebody’s head or in a runbook nobody opens until the incident is already messy.

The pattern I like is to keep the first decision close to the LLM client, but not the whole runbook.

Something like:

{
  category: "rate_limited_tokens",
  action: "queue",
  runbook: "llm-token-rate-limit"
}

So the code owns the classification and first move: retry, queue, fallback, fail closed, alert, etc. Then the runbook owns the human context: which dashboard to check, who owns quota, whether fallback models are approved, and what degradation is acceptable.

That keeps institutional knowledge connected to the error path without turning the application code into a wiki.