DEV Community

plasma
plasma

Posted on

The Retry Setup I Use for LLM APIs Without Accidentally Duplicating User Actions

Retries look simple until an LLM call is allowed to do something.

For a normal read-only API request, retrying is usually boring:

if (status === 429 || status >= 500) {
  retryWithBackoff();
}
Enter fullscreen mode Exit fullscreen mode

But LLM APIs are often sitting inside workflows that are not purely read-only.

A failed LLM call might be part of:

  • sending an email
  • creating a support ticket
  • updating a CRM record
  • calling a tool
  • writing to a database
  • charging credits
  • generating a document
  • triggering another automation step

In those cases, retrying blindly can turn one user action into two, three, or four real-world actions.

That is the bug I try hardest to avoid.

The mistake

The first retry setup I used for LLM APIs was basically copied from normal HTTP APIs:

async function callWithRetry(fn: () => Promise<Response>) {
  let lastError: unknown;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      await sleep(2 ** attempt * 1000);
    }
  }

  throw lastError;
}
Enter fullscreen mode Exit fullscreen mode

This works fine for some LLM calls.

For example:

  • summarize this paragraph
  • classify this ticket
  • rewrite this title
  • extract entities from this text

If the first attempt fails, retrying is probably fine.

But it gets risky once the LLM call is part of a workflow with side effects.

Imagine this agent step:

  1. Read customer message
  2. Decide whether refund is needed
  3. Call issueRefund
  4. Write reply draft
  5. Send email

If the request times out after step 3, your retry code may run the whole thing again.

Now you might issue two refunds.

That is not a model quality problem.

That is a retry design problem.

My rule: retry transport, not user intent

The main principle I use now is:

Retry technical failures, but do not replay user intent unless the operation is idempotent.

A user clicked "Send invoice" once.

That user action should have one logical operation ID.

Even if the system retries internally, the outside world should still see one invoice send attempt, not three.

So before thinking about retry count or backoff, I separate LLM work into operation types.

type LlmOperationKind =
  | "read_only_generation"
  | "structured_extraction"
  | "streaming_chat"
  | "tool_planning"
  | "tool_execution"
  | "external_side_effect";
Enter fullscreen mode Exit fullscreen mode

Then I treat them differently.

function canRetryOperation(kind: LlmOperationKind) {
  switch (kind) {
    case "read_only_generation":
    case "structured_extraction":
      return true;

    case "streaming_chat":
      return "only_if_no_tokens_received";

    case "tool_planning":
      return true;

    case "tool_execution":
    case "external_side_effect":
      return false;
  }
}
Enter fullscreen mode Exit fullscreen mode

This is the core distinction.

Retrying the model's planning step is usually fine.

Retrying the step that actually sends the email, charges the card, or updates the external system is not fine unless that step is idempotent.

Use one operation ID per user action

Every user-triggered workflow gets an operation ID.

type LlmOperation = {
  operationId: string;
  userId: string;
  kind: LlmOperationKind;
  createdAt: number;
};
Enter fullscreen mode Exit fullscreen mode

That operationId follows the request through:

  • LLM call logs
  • retry attempts
  • tool calls
  • database writes
  • external API calls
  • user-visible status updates

A retry is not a new operation.

It is another attempt inside the same operation.

type LlmAttemptLog = {
  operationId: string;
  attempt: number;
  provider: string;
  model: string;
  startedAt: number;
  finishedAt?: number;
  status: "success" | "failed";
  errorCategory?: string;
};
Enter fullscreen mode Exit fullscreen mode

This makes debugging much easier.

Instead of seeing four unrelated LLM failures, I can see:

One user action created one operation, which made four attempts.

That difference matters a lot during incidents.

Make tool calls idempotent

If an LLM can call tools, the tools need idempotency keys.

For example, do not let this happen:

await issueRefund({
  customerId,
  amount
});
Enter fullscreen mode Exit fullscreen mode

Prefer this:

await issueRefund({
  customerId,
  amount,
  idempotencyKey: `refund:${operationId}:${customerId}:${amount}`
});
Enter fullscreen mode Exit fullscreen mode

Same for external actions:

await sendEmail({
  to,
  subject,
  body,
  idempotencyKey: `email:${operationId}:${to}`
});
Enter fullscreen mode Exit fullscreen mode

The exact key depends on the action, but the idea is the same:

If the system retries, the external action should still happen at most once.

This is especially important because timeouts are ambiguous.

A timeout does not always mean "nothing happened."

It often means:

The client stopped waiting, but the provider or tool may still be working.

That ambiguity is where duplicate actions come from.

Split planning from execution

For agents, I try to split the workflow into two phases:

  1. LLM decides what should happen.
  2. Application code executes the action.

The LLM can retry the planning step.

The application controls execution.

Example:

type PlannedAction =
  | {
      type: "send_email";
      to: string;
      subject: string;
      body: string;
    }
  | {
      type: "create_ticket";
      title: string;
      priority: "low" | "medium" | "high";
    }
  | {
      type: "no_action";
      reason: string;
    };
Enter fullscreen mode Exit fullscreen mode

The LLM returns a plan.

Then the app validates and executes it.

async function executePlannedAction(
  action: PlannedAction,
  operationId: string
) {
  if (action.type === "send_email") {
    return sendEmail({
      to: action.to,
      subject: action.subject,
      body: action.body,
      idempotencyKey: `send_email:${operationId}:${action.to}`
    });
  }

  if (action.type === "create_ticket") {
    return createTicket({
      title: action.title,
      priority: action.priority,
      idempotencyKey: `create_ticket:${operationId}:${action.title}`
    });
  }

  return { skipped: true, reason: action.reason };
}
Enter fullscreen mode Exit fullscreen mode

This keeps retry logic away from irreversible side effects.

If the LLM plan fails due to a transient error, I can retry the planning step.

If execution starts, I stop treating the whole workflow as freely retryable.

Retry only known transient failures

I do not retry every LLM failure.

I usually retry:

  • temporary rate limits
  • provider 5xx errors
  • network timeouts
  • connection resets
  • stream interrupted before any useful output
  • malformed structured output, sometimes

I usually do not retry:

  • auth errors
  • quota exhaustion
  • context length errors
  • content refusals
  • invalid request shape
  • completed tool execution
  • user-cancelled operations

A simple classifier helps.

type LlmErrorCategory =
  | "rate_limited"
  | "provider_unavailable"
  | "timeout"
  | "network_error"
  | "stream_interrupted"
  | "malformed_structured_output"
  | "quota_exceeded"
  | "auth_error"
  | "context_window_exceeded"
  | "content_refused"
  | "invalid_request"
  | "unknown";

function isTransient(error: LlmErrorCategory) {
  return [
    "rate_limited",
    "provider_unavailable",
    "timeout",
    "network_error",
    "stream_interrupted"
  ].includes(error);
}
Enter fullscreen mode Exit fullscreen mode

Then retry logic becomes more explicit.

function shouldRetryLlmCall(params: {
  operationKind: LlmOperationKind;
  error: LlmErrorCategory;
  attempt: number;
  receivedTokens?: number;
}) {
  if (params.attempt >= 3) return false;
  if (!isTransient(params.error)) return false;

  if (params.operationKind === "external_side_effect") {
    return false;
  }

  if (
    params.operationKind === "streaming_chat" &&
    params.receivedTokens &&
    params.receivedTokens > 0
  ) {
    return false;
  }

  return true;
}
Enter fullscreen mode Exit fullscreen mode

This is not fancy.

But it prevents a lot of expensive nonsense.

Streaming retries need a different rule

Streaming responses are a special case.

If a stream fails before any tokens arrive, retrying is usually fine.

If a stream fails after the user has already seen half the answer, retrying from scratch may be weird.

The user might see:

  1. partial answer
  2. sudden reset
  3. different answer
  4. duplicate content

For chat UI, I track whether any content was received.

type StreamAttempt = {
  operationId: string;
  attempt: number;
  receivedTokens: number;
  partialText: string;
  completed: boolean;
  error?: LlmErrorCategory;
};
Enter fullscreen mode Exit fullscreen mode

Then the behavior is:

  • no tokens received: retry automatically
  • some tokens received: show incomplete state
  • structured output expected: discard or repair
  • tool call involved: stop and ask for confirmation

For user-visible chat, I would rather show:

The response was interrupted. Continue?

than silently retry and produce a second answer.

Backoff still matters

Once I know a retry is safe, I still use normal backoff.

function retryDelayMs(attempt: number) {
  const base = Math.min(1000 * 2 ** attempt, 10000);
  const jitter = Math.floor(Math.random() * 500);
  return base + jitter;
}
Enter fullscreen mode Exit fullscreen mode

I also respect provider retry hints when available.

function getRetryDelay(error: {
  retryAfterMs?: number;
}, attempt: number) {
  if (error.retryAfterMs) {
    return error.retryAfterMs;
  }

  return retryDelayMs(attempt);
}
Enter fullscreen mode Exit fullscreen mode

The important part is that backoff is not the whole strategy.

Backoff answers:

How long should I wait before retrying?

It does not answer:

Is retrying safe?

That second question is where most LLM workflow bugs live.

Keep attempts visible to the user when needed

For background jobs, retries can be invisible.

For user-facing workflows, I prefer exposing state.

Example states:

type OperationStatus =
  | "queued"
  | "running"
  | "retrying"
  | "waiting_for_confirmation"
  | "completed"
  | "failed";
Enter fullscreen mode Exit fullscreen mode

If an operation is safe to retry automatically, the UI can say:

Still working. Retrying after a temporary model error.

If the operation may cause a duplicate side effect, the UI should ask:

The email may have already been sent. Do you want to check status before trying again?

That sounds less smooth, but it is much better than accidentally sending two emails.

Log retries as part of one workflow

One of the worst debugging experiences is seeing retries as unrelated logs.

I want logs grouped under one operation.

{
  "operation_id": "op_8f23",
  "user_id": "user_123",
  "operation_kind": "tool_planning",
  "attempt": 2,
  "max_attempts": 3,
  "provider": "openai",
  "model": "gpt-4.1-mini",
  "error_category": "rate_limited",
  "retryable": true,
  "next_retry_ms": 2400,
  "external_side_effect_started": false
}
Enter fullscreen mode Exit fullscreen mode

For side effects, I log the idempotency key too.

{
  "operation_id": "op_8f23",
  "action": "send_email",
  "idempotency_key": "send_email:op_8f23:customer@example.com",
  "status": "completed"
}
Enter fullscreen mode Exit fullscreen mode

This gives me a clear answer to the scary question:

Did the user action happen once, multiple times, or not at all?

My retry setup

The setup I use now is basically this:

  1. Create one operation ID per user action.
  2. Classify the LLM operation type.
  3. Classify the error.
  4. Retry only known transient failures.
  5. Do not retry external side effects unless they are idempotent.
  6. Use idempotency keys for tools and external APIs.
  7. Split LLM planning from application execution.
  8. Treat streaming partial output as a special state.
  9. Log all attempts under the same operation.
  10. Ask for confirmation when the system cannot know whether an action already happened.

In code, the retry wrapper looks less like a generic HTTP helper and more like an operation-aware executor.

async function runLlmAttempt<T>(params: {
  operationId: string;
  operationKind: LlmOperationKind;
  maxAttempts: number;
  call: () => Promise<T>;
}) {
  let attempt = 0;
  let lastError: unknown;

  while (attempt < params.maxAttempts) {
    try {
      return await params.call();
    } catch (error) {
      const category = classifyLlmError(error);

      const retryable = shouldRetryLlmCall({
        operationKind: params.operationKind,
        error: category,
        attempt
      });

      logLlmAttempt({
        operationId: params.operationId,
        operationKind: params.operationKind,
        attempt,
        errorCategory: category,
        retryable
      });

      if (!retryable) {
        throw error;
      }

      await sleep(retryDelayMs(attempt));
      attempt += 1;
      lastError = error;
    }
  }

  throw lastError;
}
Enter fullscreen mode Exit fullscreen mode

This is more code than a simple retry helper.

But it has saved me from treating dangerous operations like harmless API reads.

Final thought

Retries are not automatically reliability.

For LLM apps, retries can improve reliability when the operation is read-only, transient, and safe to repeat.

They can create serious bugs when the model is part of a workflow that changes the outside world.

The question I ask now is not just:

Did the LLM call fail?

It is:

What has already happened because of this user action?

If nothing happened, retry.

If something may have happened, check, dedupe, or ask for confirmation.

That one distinction makes LLM retry logic a lot less scary.

For multi-model workflows, centralizing retries, fallback, and routing also helps keep this logic consistent. I use TokenBay for that kind of setup because it keeps the API surface familiar while making model/provider choices easier to manage in one place.

Top comments (0)