Retries look simple until an LLM call is allowed to do something.
For a normal read-only API request, retrying is usually boring:
if (status === 429 || status >= 500) {
retryWithBackoff();
}
But LLM APIs are often sitting inside workflows that are not purely read-only.
A failed LLM call might be part of:
- sending an email
- creating a support ticket
- updating a CRM record
- calling a tool
- writing to a database
- charging credits
- generating a document
- triggering another automation step
In those cases, retrying blindly can turn one user action into two, three, or four real-world actions.
That is the bug I try hardest to avoid.
The mistake
The first retry setup I used for LLM APIs was basically copied from normal HTTP APIs:
async function callWithRetry(fn: () => Promise<Response>) {
let lastError: unknown;
for (let attempt = 0; attempt < 3; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
await sleep(2 ** attempt * 1000);
}
}
throw lastError;
}
This works fine for some LLM calls.
For example:
- summarize this paragraph
- classify this ticket
- rewrite this title
- extract entities from this text
If the first attempt fails, retrying is probably fine.
But it gets risky once the LLM call is part of a workflow with side effects.
Imagine this agent step:
- Read customer message
- Decide whether refund is needed
- Call
issueRefund - Write reply draft
- Send email
If the request times out after step 3, your retry code may run the whole thing again.
Now you might issue two refunds.
That is not a model quality problem.
That is a retry design problem.
My rule: retry transport, not user intent
The main principle I use now is:
Retry technical failures, but do not replay user intent unless the operation is idempotent.
A user clicked "Send invoice" once.
That user action should have one logical operation ID.
Even if the system retries internally, the outside world should still see one invoice send attempt, not three.
So before thinking about retry count or backoff, I separate LLM work into operation types.
type LlmOperationKind =
| "read_only_generation"
| "structured_extraction"
| "streaming_chat"
| "tool_planning"
| "tool_execution"
| "external_side_effect";
Then I treat them differently.
function canRetryOperation(kind: LlmOperationKind) {
switch (kind) {
case "read_only_generation":
case "structured_extraction":
return true;
case "streaming_chat":
return "only_if_no_tokens_received";
case "tool_planning":
return true;
case "tool_execution":
case "external_side_effect":
return false;
}
}
This is the core distinction.
Retrying the model's planning step is usually fine.
Retrying the step that actually sends the email, charges the card, or updates the external system is not fine unless that step is idempotent.
Use one operation ID per user action
Every user-triggered workflow gets an operation ID.
type LlmOperation = {
operationId: string;
userId: string;
kind: LlmOperationKind;
createdAt: number;
};
That operationId follows the request through:
- LLM call logs
- retry attempts
- tool calls
- database writes
- external API calls
- user-visible status updates
A retry is not a new operation.
It is another attempt inside the same operation.
type LlmAttemptLog = {
operationId: string;
attempt: number;
provider: string;
model: string;
startedAt: number;
finishedAt?: number;
status: "success" | "failed";
errorCategory?: string;
};
This makes debugging much easier.
Instead of seeing four unrelated LLM failures, I can see:
One user action created one operation, which made four attempts.
That difference matters a lot during incidents.
Make tool calls idempotent
If an LLM can call tools, the tools need idempotency keys.
For example, do not let this happen:
await issueRefund({
customerId,
amount
});
Prefer this:
await issueRefund({
customerId,
amount,
idempotencyKey: `refund:${operationId}:${customerId}:${amount}`
});
Same for external actions:
await sendEmail({
to,
subject,
body,
idempotencyKey: `email:${operationId}:${to}`
});
The exact key depends on the action, but the idea is the same:
If the system retries, the external action should still happen at most once.
This is especially important because timeouts are ambiguous.
A timeout does not always mean "nothing happened."
It often means:
The client stopped waiting, but the provider or tool may still be working.
That ambiguity is where duplicate actions come from.
Split planning from execution
For agents, I try to split the workflow into two phases:
- LLM decides what should happen.
- Application code executes the action.
The LLM can retry the planning step.
The application controls execution.
Example:
type PlannedAction =
| {
type: "send_email";
to: string;
subject: string;
body: string;
}
| {
type: "create_ticket";
title: string;
priority: "low" | "medium" | "high";
}
| {
type: "no_action";
reason: string;
};
The LLM returns a plan.
Then the app validates and executes it.
async function executePlannedAction(
action: PlannedAction,
operationId: string
) {
if (action.type === "send_email") {
return sendEmail({
to: action.to,
subject: action.subject,
body: action.body,
idempotencyKey: `send_email:${operationId}:${action.to}`
});
}
if (action.type === "create_ticket") {
return createTicket({
title: action.title,
priority: action.priority,
idempotencyKey: `create_ticket:${operationId}:${action.title}`
});
}
return { skipped: true, reason: action.reason };
}
This keeps retry logic away from irreversible side effects.
If the LLM plan fails due to a transient error, I can retry the planning step.
If execution starts, I stop treating the whole workflow as freely retryable.
Retry only known transient failures
I do not retry every LLM failure.
I usually retry:
- temporary rate limits
- provider 5xx errors
- network timeouts
- connection resets
- stream interrupted before any useful output
- malformed structured output, sometimes
I usually do not retry:
- auth errors
- quota exhaustion
- context length errors
- content refusals
- invalid request shape
- completed tool execution
- user-cancelled operations
A simple classifier helps.
type LlmErrorCategory =
| "rate_limited"
| "provider_unavailable"
| "timeout"
| "network_error"
| "stream_interrupted"
| "malformed_structured_output"
| "quota_exceeded"
| "auth_error"
| "context_window_exceeded"
| "content_refused"
| "invalid_request"
| "unknown";
function isTransient(error: LlmErrorCategory) {
return [
"rate_limited",
"provider_unavailable",
"timeout",
"network_error",
"stream_interrupted"
].includes(error);
}
Then retry logic becomes more explicit.
function shouldRetryLlmCall(params: {
operationKind: LlmOperationKind;
error: LlmErrorCategory;
attempt: number;
receivedTokens?: number;
}) {
if (params.attempt >= 3) return false;
if (!isTransient(params.error)) return false;
if (params.operationKind === "external_side_effect") {
return false;
}
if (
params.operationKind === "streaming_chat" &&
params.receivedTokens &&
params.receivedTokens > 0
) {
return false;
}
return true;
}
This is not fancy.
But it prevents a lot of expensive nonsense.
Streaming retries need a different rule
Streaming responses are a special case.
If a stream fails before any tokens arrive, retrying is usually fine.
If a stream fails after the user has already seen half the answer, retrying from scratch may be weird.
The user might see:
- partial answer
- sudden reset
- different answer
- duplicate content
For chat UI, I track whether any content was received.
type StreamAttempt = {
operationId: string;
attempt: number;
receivedTokens: number;
partialText: string;
completed: boolean;
error?: LlmErrorCategory;
};
Then the behavior is:
- no tokens received: retry automatically
- some tokens received: show incomplete state
- structured output expected: discard or repair
- tool call involved: stop and ask for confirmation
For user-visible chat, I would rather show:
The response was interrupted. Continue?
than silently retry and produce a second answer.
Backoff still matters
Once I know a retry is safe, I still use normal backoff.
function retryDelayMs(attempt: number) {
const base = Math.min(1000 * 2 ** attempt, 10000);
const jitter = Math.floor(Math.random() * 500);
return base + jitter;
}
I also respect provider retry hints when available.
function getRetryDelay(error: {
retryAfterMs?: number;
}, attempt: number) {
if (error.retryAfterMs) {
return error.retryAfterMs;
}
return retryDelayMs(attempt);
}
The important part is that backoff is not the whole strategy.
Backoff answers:
How long should I wait before retrying?
It does not answer:
Is retrying safe?
That second question is where most LLM workflow bugs live.
Keep attempts visible to the user when needed
For background jobs, retries can be invisible.
For user-facing workflows, I prefer exposing state.
Example states:
type OperationStatus =
| "queued"
| "running"
| "retrying"
| "waiting_for_confirmation"
| "completed"
| "failed";
If an operation is safe to retry automatically, the UI can say:
Still working. Retrying after a temporary model error.
If the operation may cause a duplicate side effect, the UI should ask:
The email may have already been sent. Do you want to check status before trying again?
That sounds less smooth, but it is much better than accidentally sending two emails.
Log retries as part of one workflow
One of the worst debugging experiences is seeing retries as unrelated logs.
I want logs grouped under one operation.
{
"operation_id": "op_8f23",
"user_id": "user_123",
"operation_kind": "tool_planning",
"attempt": 2,
"max_attempts": 3,
"provider": "openai",
"model": "gpt-4.1-mini",
"error_category": "rate_limited",
"retryable": true,
"next_retry_ms": 2400,
"external_side_effect_started": false
}
For side effects, I log the idempotency key too.
{
"operation_id": "op_8f23",
"action": "send_email",
"idempotency_key": "send_email:op_8f23:customer@example.com",
"status": "completed"
}
This gives me a clear answer to the scary question:
Did the user action happen once, multiple times, or not at all?
My retry setup
The setup I use now is basically this:
- Create one operation ID per user action.
- Classify the LLM operation type.
- Classify the error.
- Retry only known transient failures.
- Do not retry external side effects unless they are idempotent.
- Use idempotency keys for tools and external APIs.
- Split LLM planning from application execution.
- Treat streaming partial output as a special state.
- Log all attempts under the same operation.
- Ask for confirmation when the system cannot know whether an action already happened.
In code, the retry wrapper looks less like a generic HTTP helper and more like an operation-aware executor.
async function runLlmAttempt<T>(params: {
operationId: string;
operationKind: LlmOperationKind;
maxAttempts: number;
call: () => Promise<T>;
}) {
let attempt = 0;
let lastError: unknown;
while (attempt < params.maxAttempts) {
try {
return await params.call();
} catch (error) {
const category = classifyLlmError(error);
const retryable = shouldRetryLlmCall({
operationKind: params.operationKind,
error: category,
attempt
});
logLlmAttempt({
operationId: params.operationId,
operationKind: params.operationKind,
attempt,
errorCategory: category,
retryable
});
if (!retryable) {
throw error;
}
await sleep(retryDelayMs(attempt));
attempt += 1;
lastError = error;
}
}
throw lastError;
}
This is more code than a simple retry helper.
But it has saved me from treating dangerous operations like harmless API reads.
Final thought
Retries are not automatically reliability.
For LLM apps, retries can improve reliability when the operation is read-only, transient, and safe to repeat.
They can create serious bugs when the model is part of a workflow that changes the outside world.
The question I ask now is not just:
Did the LLM call fail?
It is:
What has already happened because of this user action?
If nothing happened, retry.
If something may have happened, check, dedupe, or ask for confirmation.
That one distinction makes LLM retry logic a lot less scary.
For multi-model workflows, centralizing retries, fallback, and routing also helps keep this logic consistent. I use TokenBay for that kind of setup because it keeps the API surface familiar while making model/provider choices easier to manage in one place.
Top comments (0)