plasma

Posted on Jun 29

The LLM API Timeout Playbook I Wish I Had Before Production

The first time an LLM API timeout hurt us in production, I blamed the provider.

That was emotionally satisfying for about five minutes.

Then I looked at our own code and realized we had done the usual thing: one request, one timeout value, one retry policy, one optimistic assumption that every model call would behave like a normal REST API.

LLM APIs do not behave like normal REST APIs.

They can be slow because the model is busy, because your prompt is huge, because streaming is misconfigured, because the provider is degraded, because your own network path is weird, or because you asked a reasoning model to do something that takes real compute.

This is the timeout playbook I wish I had before shipping LLM calls into production.

The Problem

Most teams start with something like this:

const response = await fetch("https://api.openai.com/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Summarize this document..." }],
  }),
});

It works in local testing.

Then production happens.

A few requests take 30 seconds. Some hang until the platform kills the request. Users refresh the page. Your queue backs up. Retries accidentally double your cost. Logs show "timeout" but not why.

The mistake is treating "timeout" as one problem.

In production, there are at least five different timeout problems:

connection timeout
first-token timeout
total request timeout
idle stream timeout
background job timeout

Each one needs a different response.

Timeout Types You Actually Need

1. Connection Timeout

This is how long you are willing to wait to establish the HTTP connection.

If this fails, the provider probably never received the request body. Retrying is usually safe.

Typical range: 2-5 seconds.

2. First-Token Timeout

For streaming requests, this is how long you wait before the first token arrives.

This is the timeout users feel most directly. If the UI is blank for 20 seconds, people assume the app is broken.

Typical range: 5-15 seconds, depending on model and task.

3. Total Request Timeout

This is the maximum time the entire request is allowed to run.

For short chat completions, maybe 20-60 seconds. For long analysis, code generation, or reasoning-heavy tasks, this may need to be much higher.

Typical range: 30-180 seconds.

4. Idle Stream Timeout

For streaming responses, this is how long you wait between chunks.

The request may have started successfully, but if no tokens arrive for a while, you need to decide whether to keep waiting or abort.

Typical range: 10-30 seconds.

5. Job Timeout

If the LLM call is part of a background workflow, the job itself needs a timeout too.

This protects your queue. Even if the HTTP client behaves badly, your worker should not be stuck forever.

Typical range: depends on the workflow, but always set one.

A Minimal Timeout Wrapper

Here is a small Node.js wrapper using AbortController.

It works with any OpenAI-compatible chat completions endpoint. Set LLM_BASE_URL to your provider's base URL and LLM_API_KEY to your key.

// timeout-chat.js
const DEFAULT_BASE_URL = "https://api.openai.com/v1";

class LLMTimeoutError extends Error {
  constructor(message, details = {}) {
    super(message);
    this.name = "LLMTimeoutError";
    this.details = details;
  }
}

async function chatCompletion({
  baseUrl = process.env.LLM_BASE_URL || DEFAULT_BASE_URL,
  apiKey = process.env.LLM_API_KEY,
  model = "gpt-4o-mini",
  messages,
  timeoutMs = 45_000,
}) {
  if (!apiKey) {
    throw new Error("Missing LLM_API_KEY");
  }

  const controller = new AbortController();
  const timeout = setTimeout(() => {
    controller.abort();
  }, timeoutMs);

  const startedAt = Date.now();

  try {
    const response = await fetch(`${baseUrl}/chat/completions`, {
      method: "POST",
      signal: controller.signal,
      headers: {
        "Authorization": `Bearer ${apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model,
        messages,
        temperature: 0.2,
      }),
    });

    const elapsedMs = Date.now() - startedAt;

    if (!response.ok) {
      const errorText = await response.text();
      throw new Error(
        `LLM request failed: ${response.status} ${response.statusText} after ${elapsedMs}ms\n${errorText}`
      );
    }

    const data = await response.json();

    return {
      data,
      elapsedMs,
    };
  } catch (error) {
    const elapsedMs = Date.now() - startedAt;

    if (error.name === "AbortError") {
      throw new LLMTimeoutError(`LLM request timed out after ${elapsedMs}ms`, {
        timeoutMs,
        elapsedMs,
        model,
      });
    }

    throw error;
  } finally {
    clearTimeout(timeout);
  }
}

async function main() {
  const result = await chatCompletion({
    messages: [
      {
        role: "user",
        content: "Give me a 5-bullet checklist for debugging LLM API timeouts.",
      },
    ],
    timeoutMs: 30_000,
  });

  console.log(result.data.choices[0].message.content);
  console.log(`Completed in ${result.elapsedMs}ms`);
}

main().catch((error) => {
  console.error(error);
  process.exit(1);
});

Run it:

export LLM_API_KEY="your_api_key"
export LLM_BASE_URL="https://api.openai.com/v1"
node timeout-chat.js

If you use an OpenAI-compatible gateway, change only the base URL and model name.

That small abstraction already gives you three useful things:

every request has a deadline
timeout errors are distinguishable from provider errors
logs include model and elapsed time

But this is still only the first layer.

Do Not Retry Every Timeout

The most expensive timeout bug is the blind retry.

Something times out at 45 seconds, your app retries, the retry also runs for 45 seconds, and now you have doubled latency and maybe doubled spend.

Before retrying, ask one question:

Did the provider probably receive and start processing the request?

If the answer is "probably no," retrying is reasonable.

If the answer is "probably yes," retrying may create duplicate work.

A practical retry policy:

function shouldRetryLLMError(error) {
  if (error.name === "LLMTimeoutError") {
    const elapsed = error.details?.elapsedMs ?? 0;

    // Very short failures are often connection/network issues.
    // Longer timeouts may mean the model was already processing.
    return elapsed < 5_000;
  }

  const message = String(error.message || "");

  if (message.includes("429")) return true;
  if (message.includes("500")) return true;
  if (message.includes("502")) return true;
  if (message.includes("503")) return true;
  if (message.includes("504")) return true;

  return false;
}

This is intentionally conservative.

For user-facing chat, I would rather show a clear failure than silently retry a long-running request and make the user wait 90 seconds.

For background jobs, I may retry more aggressively, but only with idempotency and backoff.

Add Backoff With Jitter

If a provider is overloaded, retrying immediately is the worst possible move.

Use exponential backoff with jitter:

function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function backoffMs(attempt) {
  const base = 500 * Math.pow(2, attempt);
  const jitter = Math.floor(Math.random() * 250);
  return Math.min(base + jitter, 8_000);
}

async function withRetries(fn, { maxRetries = 2 } = {}) {
  let lastError;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;

      if (attempt === maxRetries || !shouldRetryLLMError(error)) {
        throw error;
      }

      await sleep(backoffMs(attempt));
    }
  }

  throw lastError;
}

Usage:

const result = await withRetries(() =>
  chatCompletion({
    messages: [{ role: "user", content: "Explain API retries in one paragraph." }],
    timeoutMs: 30_000,
  })
);

For production, also log every retry attempt. A retry that saves a user request is good. A retry policy that hides provider degradation is not.

Stream When Humans Are Waiting

For user-facing requests, streaming is usually better than waiting for the full response.

Streaming does not make the model faster, but it makes the experience feel alive. More importantly, it lets you measure first-token latency separately from total latency.

Here is a minimal streaming example:

// stream-chat.js
const baseUrl = process.env.LLM_BASE_URL || "https://api.openai.com/v1";
const apiKey = process.env.LLM_API_KEY;

if (!apiKey) {
  throw new Error("Missing LLM_API_KEY");
}

const controller = new AbortController();
const totalTimeout = setTimeout(() => controller.abort(), 60_000);

const startedAt = Date.now();
let firstChunkAt = null;

try {
  const response = await fetch(`${baseUrl}/chat/completions`, {
    method: "POST",
    signal: controller.signal,
    headers: {
      "Authorization": `Bearer ${apiKey}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: process.env.LLM_MODEL || "gpt-4o-mini",
      stream: true,
      messages: [
        {
          role: "user",
          content: "Write a short explanation of first-token latency.",
        },
      ],
    }),
  });

  if (!response.ok || !response.body) {
    throw new Error(`Streaming request failed: ${response.status}`);
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();

    if (done) break;

    if (!firstChunkAt) {
      firstChunkAt = Date.now();
      console.error(`First chunk in ${firstChunkAt - startedAt}ms`);
    }

    const chunk = decoder.decode(value, { stream: true });
    process.stdout.write(chunk);
  }

  console.error(`\nTotal time: ${Date.now() - startedAt}ms`);
} finally {
  clearTimeout(totalTimeout);
}

Run it:

export LLM_API_KEY="your_api_key"
node stream-chat.js

A production streaming implementation should parse Server-Sent Events properly instead of printing raw chunks, but the point is the measurement:

time to first byte
time to first token
total completion time
stream interruptions
idle gaps between chunks

Those numbers tell you much more than "request timed out."

Set Different Timeouts By Task

One timeout value for every LLM call is usually wrong.

A better pattern is to define timeout classes:

const TIMEOUTS = {
  autocomplete: 8_000,
  chat: 45_000,
  documentSummary: 120_000,
  backgroundExtraction: 180_000,
};

Then make the caller choose intentionally:

await chatCompletion({
  model: "gpt-4o-mini",
  messages,
  timeoutMs: TIMEOUTS.chat,
});

This forces the product decision into code.

If a feature needs a 180-second timeout, it probably should not block an HTTP request from the browser. Put it in a job queue and notify the user when it is done.

Use Fallbacks Carefully

Fallbacks are useful, but they can also make debugging harder.

A fallback from one model to another may change:

output quality
latency
cost
context window
tool-calling behavior
JSON reliability

So fallback should be explicit.

Example:

const MODEL_CHAIN = [
  { model: "primary-model", timeoutMs: 45_000 },
  { model: "fallback-model", timeoutMs: 30_000 },
];

async function runWithFallback(messages) {
  const errors = [];

  for (const option of MODEL_CHAIN) {
    try {
      return await chatCompletion({
        model: option.model,
        messages,
        timeoutMs: option.timeoutMs,
      });
    } catch (error) {
      errors.push({
        model: option.model,
        name: error.name,
        message: error.message,
      });
    }
  }

  throw new Error(`All model attempts failed: ${JSON.stringify(errors)}`);
}

Do not silently fallback on every request without tracking it.

If your fallback rate jumps from 2% to 25%, that is an incident signal.

Log The Right Fields

This is the minimum I want in logs for every LLM request:

{
  "event": "llm_request_completed",
  "provider": "openai-compatible",
  "model": "gpt-4o-mini",
  "timeout_ms": 45000,
  "elapsed_ms": 12842,
  "streaming": false,
  "retry_count": 0,
  "fallback_used": false,
  "status": "success"
}

For failures:

{
  "event": "llm_request_failed",
  "provider": "openai-compatible",
  "model": "gpt-4o-mini",
  "timeout_ms": 45000,
  "elapsed_ms": 45021,
  "streaming": false,
  "retry_count": 1,
  "fallback_used": false,
  "status": "timeout",
  "error_type": "LLMTimeoutError"
}

Do not log raw prompts by default.

Prompts can contain user data, private business context, credentials, internal code, or documents. If you need prompt logging for debugging, put it behind explicit controls and retention rules.

My Production Checklist

Before I ship an LLM API call now, I check:

Does this request have a total timeout?
Is the timeout appropriate for this task?
Is there a separate first-token timeout for streaming UX?
Do retries use backoff and jitter?
Are long timeouts moved to background jobs?
Are timeout errors distinguishable from provider errors?
Do logs include model, elapsed time, retry count, and fallback status?
Is fallback behavior explicit and observable?
Are prompts and responses excluded from default logs?
Does the UI tell the user what happened when a timeout occurs?

The last one matters more than people think.

A good timeout message is not:

Something went wrong.

A better one is:

The model took too long to respond. You can try again, or shorten the input and run it again.

That gives the user a next move.

Where Gateways Help

One reason teams use OpenAI-compatible gateways is operational flexibility.

If your app talks to one OpenAI-style interface, you can route across different models or providers without rewriting every integration. That does not remove the need for timeouts, retries, logging, and fallbacks. It just makes those controls easier to centralize.

At TokenBay, this is one of the problems we care about: giving developers one API key and one OpenAI-compatible interface across multiple model providers.

But gateway or no gateway, the production rule is the same:

Do not let LLM calls be mysterious black boxes.

Give them deadlines. Measure them. Retry carefully. Stream when humans are waiting. Move long work into jobs. Make fallbacks visible.

Your future incident review will be much shorter.

DEV Community