DEV Community

plasma
plasma

Posted on

Stop Treating LLM API Errors Like Normal HTTP Errors

Most backend engineers already know how to handle HTTP errors.

400 means the request is bad.

401 means auth failed.

429 means rate limited.

500 means something broke upstream.

Retry a few times, add exponential backoff, log the response body, move on.

That works fine for many APIs.

It works badly for LLM APIs.

LLM providers may use normal HTTP status codes, but the operational meaning behind those errors is different enough that treating them like ordinary REST failures can make your app slower, more expensive, and harder to debug.

The mistake I kept making

Early on, I handled LLM failures the same way I handled every other external API:

if (response.status === 429 || response.status >= 500) {
  retryWithBackoff();
}
Enter fullscreen mode Exit fullscreen mode

Simple. Familiar. Dangerous.

That logic misses the actual question your app needs to answer:

What kind of LLM failure happened, and what should the product do next?

Because an LLM API failure is rarely just "one HTTP request failed."

It can break:

  • a user-facing chat response
  • a background agent run
  • a document generation job
  • a tool-calling workflow
  • a batch evaluation pipeline
  • a structured JSON generation step

And each one needs different handling.

Not all 429s mean the same thing

For a normal API, 429 Too Many Requests usually means:

Slow down and retry later.

With LLM APIs, 429 can mean several different things.

It might be a temporary rate limit:

{
  "error": {
    "message": "Rate limit reached",
    "type": "rate_limit_error"
  }
}
Enter fullscreen mode Exit fullscreen mode

Retrying with backoff may help here.

But it might also mean quota exhaustion:

{
  "error": {
    "message": "You exceeded your current quota",
    "type": "insufficient_quota"
  }
}
Enter fullscreen mode Exit fullscreen mode

Retrying this does not help. It just adds latency, noisy logs, and a worse user experience.

It could also be model-specific pressure. One model may be overloaded while another model from the same provider, or a different provider, would work fine.

So your handler should distinguish between:

  • temporary rate limit
  • hard quota exhaustion
  • model-level capacity issue
  • billing or account issue

Those should not all go through the same retry path.

Retrying can multiply the damage

Retries are one of the easiest ways to make an LLM system worse while thinking you made it more reliable.

Imagine this:

  1. A user asks your app to generate a report.
  2. Your backend calls an LLM.
  3. The request times out after 30 seconds.
  4. Your code retries three times.

That sounds reasonable.

But the first request may still be running provider-side. The second request may generate a duplicate answer. The third request may hit another rate limit. The user waits longer. Your token bill goes up. Your logs now show four attempts for one user action.

For simple completions, this is annoying.

For agent workflows, it can be dangerous.

If the LLM call can trigger tools, send emails, update tickets, write to a database, or call external systems, blind retries can create duplicate side effects.

Your retry policy needs to know what kind of operation it is retrying.

type LlmOperationType =
  | "simple_completion"
  | "streaming_chat"
  | "tool_calling_agent"
  | "background_batch_job"
  | "structured_output";

function shouldRetry(
  operation: LlmOperationType,
  status: number,
  errorType?: string
) {
  if (errorType === "insufficient_quota") return false;
  if (status === 401 || status === 403) return false;
  if (status === 400) return false;

  if (operation === "tool_calling_agent") {
    return status >= 500;
  }

  if (operation === "streaming_chat") {
    return status >= 500;
  }

  return status === 429 || status >= 500;
}
Enter fullscreen mode Exit fullscreen mode

This is still simplified, but the idea matters:

retry policy should understand the LLM task, not just the HTTP status code.

Context length errors are product errors

A context length error often comes back as a 400.

For a normal API, a 400 often means the developer sent a malformed request.

For an LLM app, context length errors are often product-level failures.

It means your app did not manage the prompt budget properly.

Possible fixes include:

  • summarize older messages
  • trim retrieved documents
  • reduce tool output size
  • switch to a larger-context model
  • ask the user to narrow the task
  • split the job into multiple calls

The worst response is showing the raw provider error to the user.

A better product response is something like:

This request is too large to process in one pass. I can summarize the source material first, then continue.

At the system level, context errors should be classified separately from normal bad requests.

function classifyLlmError(status: number, message: string) {
  const text = message.toLowerCase();

  if (status === 400 && text.includes("context length")) {
    return "context_window_exceeded";
  }

  if (status === 429 && text.includes("quota")) {
    return "quota_exceeded";
  }

  if (status === 429) {
    return "rate_limited";
  }

  if (status >= 500) {
    return "provider_unavailable";
  }

  return "unknown_llm_error";
}
Enter fullscreen mode Exit fullscreen mode

Once you classify this correctly, the rest of the app can respond correctly.

Streaming failures are partial product states

Streaming makes error handling more complicated.

With a normal request, the API either succeeds or fails.

With streaming, the model may produce 80% of the answer and then disconnect.

Now what?

You need to decide:

  • Should the partial answer be shown?
  • Should the UI mark it as incomplete?
  • Should the user be offered a continuation?
  • Should the backend retry from scratch?
  • Should the retry include the partial output?
  • Should partial structured output be discarded?

For chat products, partial output may still be useful.

For code generation, partial output may be misleading.

For JSON generation, partial output may be invalid and should not be consumed directly.

I like tracking stream state explicitly:

type StreamState = {
  requestId: string;
  startedAt: number;
  receivedTokens: number;
  partialText: string;
  completed: boolean;
  error?: string;
};
Enter fullscreen mode Exit fullscreen mode

Then the app can make different decisions:

  • if no tokens were received, retry automatically
  • if some tokens were received, show an incomplete marker
  • if JSON was expected, discard or repair the output
  • if tools were involved, stop and ask for confirmation before continuing

A streaming disconnect is not just a network failure.

It is a partial product state.

Model fallback is not just retrying another URL

A common reliability pattern is:

If model A fails, try model B.

That can work, but only if you define safe fallback rules.

Different models may have different:

  • context windows
  • pricing
  • tool calling support
  • JSON reliability
  • latency profiles
  • refusal behavior
  • output style
  • reasoning quality

If you silently fall back from a stronger model to a smaller one, your request may "succeed" while the product quality quietly drops.

A better fallback strategy starts with task classes.

const fallbackModels: Record<string, string[]> = {
  high_reasoning: [
    "gpt-4.1",
    "claude-3-5-sonnet",
    "gemini-1.5-pro"
  ],
  fast_chat: [
    "gpt-4.1-mini",
    "claude-3-haiku",
    "gemini-1.5-flash"
  ],
  cheap_batch: [
    "gpt-4.1-mini",
    "gemini-1.5-flash"
  ]
};
Enter fullscreen mode Exit fullscreen mode

Fallback should depend on what the task needs.

For example:

  • summarization can usually tolerate fallback
  • legal or financial analysis may not
  • tool-calling agents need models with compatible tool support
  • structured output workflows need models that follow schemas reliably
  • user-selected models should not be silently replaced

If you use an OpenAI-compatible routing layer, this is where it can help. You can centralize fallback and routing behavior instead of scattering provider-specific rules across the codebase.

This is one reason I like using a gateway such as TokenBay for multi-model apps: the API surface stays familiar, but you can still make model choice, fallback, and routing explicit.

Log the fields you will actually need later

For normal APIs, status code, endpoint, latency, and response body may be enough.

For LLM APIs, that is usually not enough.

Useful fields include:

{
  "provider": "openai",
  "model": "gpt-4.1-mini",
  "operation": "support_ticket_summary",
  "status": 429,
  "error_category": "rate_limited",
  "input_tokens_estimated": 18420,
  "output_tokens_received": 0,
  "streaming": false,
  "retry_count": 2,
  "fallback_used": false,
  "latency_ms": 12833
}
Enter fullscreen mode Exit fullscreen mode

These fields help answer the questions you will care about during an incident:

  • Are failures concentrated on one provider?
  • Is one model causing most context errors?
  • Are retries helping or just increasing latency?
  • Are streaming failures happening after output starts?
  • Are background jobs consuming too much quota?
  • Are fallbacks hiding a provider outage?
  • Are structured output failures tied to one model?

Without this data, you end up debugging LLM reliability with screenshots, anecdotes, and vibes.

I have done that. Not fun.

Build an internal LLM error taxonomy

Instead of handling raw HTTP status codes everywhere, create your own internal error categories.

type LlmErrorCategory =
  | "auth_error"
  | "rate_limited"
  | "quota_exceeded"
  | "context_window_exceeded"
  | "content_refused"
  | "provider_unavailable"
  | "stream_interrupted"
  | "invalid_tool_call"
  | "malformed_structured_output"
  | "unknown";

type LlmError = {
  category: LlmErrorCategory;
  provider: string;
  model: string;
  status?: number;
  retryable: boolean;
  userVisibleMessage: string;
  rawMessage?: string;
};
Enter fullscreen mode Exit fullscreen mode

Then map provider-specific errors into your own categories.

This gives the rest of your product one consistent way to decide:

  • whether to retry
  • whether to fallback
  • what to show the user
  • what to log
  • whether to alert the team
  • whether to pause a workflow

The goal is not to hide HTTP.

The goal is to stop pretending HTTP status codes contain all the operational context your LLM app needs.

A simple handling matrix

Here is a practical starting point.

Error category Retry? Fallback? Product behavior
auth_error No No Ask admin to check API key
quota_exceeded No Maybe Explain capacity or billing issue
rate_limited Yes, with backoff Maybe Queue or show short delay
context_window_exceeded No Maybe Summarize, trim, or split task
provider_unavailable Yes Yes Retry or switch model
stream_interrupted Depends Depends Show incomplete state
content_refused No No Ask user to revise request
malformed_structured_output Yes, constrained Maybe Repair or regenerate
invalid_tool_call Maybe No Validate schema and retry carefully

Your matrix will vary by product, but even a rough version is much better than this:

if (status >= 500 || status === 429) {
  retry();
}
Enter fullscreen mode Exit fullscreen mode

Final thought

LLM APIs use HTTP, but LLM failures are not just HTTP failures.

A 429 might mean slow down, upgrade quota, switch models, queue the job, or stop retrying entirely.

A 400 might mean your app failed to manage context.

A streaming disconnect might mean the user already received half an answer.

A fallback might save the request, or quietly degrade quality.

If you are building anything serious on top of LLMs, create an error layer that understands LLM-specific failure modes.

Your users will see fewer broken states.

Your logs will become easier to reason about.

And your retry logic will stop turning small provider issues into expensive production incidents.

Top comments (0)