plasma

Posted on Jul 3

The Retry Setup I Use for LLM APIs Without Accidentally Duplicating User Actions

#llm #ai #api #webdev

Retries look simple until an LLM call is allowed to do something.

For a normal read-only API request, retrying is usually boring:

if (status === 429 || status >= 500) {
  retryWithBackoff();
}

But LLM APIs are often sitting inside workflows that are not purely read-only.

A failed LLM call might be part of:

sending an email
creating a support ticket
updating a CRM record
calling a tool
writing to a database
charging credits
generating a document
triggering another automation step

In those cases, retrying blindly can turn one user action into two, three, or four real-world actions.

That is the bug I try hardest to avoid.

The mistake

The first retry setup I used for LLM APIs was basically copied from normal HTTP APIs:

async function callWithRetry(fn: () => Promise<Response>) {
  let lastError: unknown;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      await sleep(2 ** attempt * 1000);
    }
  }

  throw lastError;
}

This works fine for some LLM calls.

For example:

summarize this paragraph
classify this ticket
rewrite this title
extract entities from this text

If the first attempt fails, retrying is probably fine.

But it gets risky once the LLM call is part of a workflow with side effects.

Imagine this agent step:

Read customer message
Decide whether refund is needed
Call issueRefund
Write reply draft
Send email

If the request times out after step 3, your retry code may run the whole thing again.

Now you might issue two refunds.

That is not a model quality problem.

That is a retry design problem.

My rule: retry transport, not user intent

The main principle I use now is:

Retry technical failures, but do not replay user intent unless the operation is idempotent.

A user clicked "Send invoice" once.

That user action should have one logical operation ID.

Even if the system retries internally, the outside world should still see one invoice send attempt, not three.

So before thinking about retry count or backoff, I separate LLM work into operation types.

type LlmOperationKind =
  | "read_only_generation"
  | "structured_extraction"
  | "streaming_chat"
  | "tool_planning"
  | "tool_execution"
  | "external_side_effect";

Then I treat them differently.

function canRetryOperation(kind: LlmOperationKind) {
  switch (kind) {
    case "read_only_generation":
    case "structured_extraction":
      return true;

    case "streaming_chat":
      return "only_if_no_tokens_received";

    case "tool_planning":
      return true;

    case "tool_execution":
    case "external_side_effect":
      return false;
  }
}

This is the core distinction.

Retrying the model's planning step is usually fine.

Retrying the step that actually sends the email, charges the card, or updates the external system is not fine unless that step is idempotent.

Use one operation ID per user action

Every user-triggered workflow gets an operation ID.

type LlmOperation = {
  operationId: string;
  userId: string;
  kind: LlmOperationKind;
  createdAt: number;
};

That operationId follows the request through:

LLM call logs
retry attempts
tool calls
database writes
external API calls
user-visible status updates

A retry is not a new operation.

It is another attempt inside the same operation.

type LlmAttemptLog = {
  operationId: string;
  attempt: number;
  provider: string;
  model: string;
  startedAt: number;
  finishedAt?: number;
  status: "success" | "failed";
  errorCategory?: string;
};

This makes debugging much easier.

Instead of seeing four unrelated LLM failures, I can see:

One user action created one operation, which made four attempts.

That difference matters a lot during incidents.

Make tool calls idempotent

If an LLM can call tools, the tools need idempotency keys.

For example, do not let this happen:

await issueRefund({
  customerId,
  amount
});

Prefer this:

await issueRefund({
  customerId,
  amount,
  idempotencyKey: `refund:${operationId}:${customerId}:${amount}`
});

Same for external actions:

await sendEmail({
  to,
  subject,
  body,
  idempotencyKey: `email:${operationId}:${to}`
});

The exact key depends on the action, but the idea is the same:

If the system retries, the external action should still happen at most once.

This is especially important because timeouts are ambiguous.

A timeout does not always mean "nothing happened."

It often means:

The client stopped waiting, but the provider or tool may still be working.

That ambiguity is where duplicate actions come from.

Split planning from execution

For agents, I try to split the workflow into two phases:

LLM decides what should happen.
Application code executes the action.

The LLM can retry the planning step.

The application controls execution.

Example:

type PlannedAction =
  | {
      type: "send_email";
      to: string;
      subject: string;
      body: string;
    }
  | {
      type: "create_ticket";
      title: string;
      priority: "low" | "medium" | "high";
    }
  | {
      type: "no_action";
      reason: string;
    };

The LLM returns a plan.

Then the app validates and executes it.

async function executePlannedAction(
  action: PlannedAction,
  operationId: string
) {
  if (action.type === "send_email") {
    return sendEmail({
      to: action.to,
      subject: action.subject,
      body: action.body,
      idempotencyKey: `send_email:${operationId}:${action.to}`
    });
  }

  if (action.type === "create_ticket") {
    return createTicket({
      title: action.title,
      priority: action.priority,
      idempotencyKey: `create_ticket:${operationId}:${action.title}`
    });
  }

  return { skipped: true, reason: action.reason };
}

This keeps retry logic away from irreversible side effects.

If the LLM plan fails due to a transient error, I can retry the planning step.

If execution starts, I stop treating the whole workflow as freely retryable.

Retry only known transient failures

I do not retry every LLM failure.

I usually retry:

temporary rate limits
provider 5xx errors
network timeouts
connection resets
stream interrupted before any useful output
malformed structured output, sometimes

I usually do not retry:

auth errors
quota exhaustion
context length errors
content refusals
invalid request shape
completed tool execution
user-cancelled operations

A simple classifier helps.

type LlmErrorCategory =
  | "rate_limited"
  | "provider_unavailable"
  | "timeout"
  | "network_error"
  | "stream_interrupted"
  | "malformed_structured_output"
  | "quota_exceeded"
  | "auth_error"
  | "context_window_exceeded"
  | "content_refused"
  | "invalid_request"
  | "unknown";

function isTransient(error: LlmErrorCategory) {
  return [
    "rate_limited",
    "provider_unavailable",
    "timeout",
    "network_error",
    "stream_interrupted"
  ].includes(error);
}

Then retry logic becomes more explicit.

function shouldRetryLlmCall(params: {
  operationKind: LlmOperationKind;
  error: LlmErrorCategory;
  attempt: number;
  receivedTokens?: number;
}) {
  if (params.attempt >= 3) return false;
  if (!isTransient(params.error)) return false;

  if (params.operationKind === "external_side_effect") {
    return false;
  }

  if (
    params.operationKind === "streaming_chat" &&
    params.receivedTokens &&
    params.receivedTokens > 0
  ) {
    return false;
  }

  return true;
}

This is not fancy.

But it prevents a lot of expensive nonsense.

Streaming retries need a different rule

Streaming responses are a special case.

If a stream fails before any tokens arrive, retrying is usually fine.

If a stream fails after the user has already seen half the answer, retrying from scratch may be weird.

The user might see:

partial answer
sudden reset
different answer
duplicate content

For chat UI, I track whether any content was received.

type StreamAttempt = {
  operationId: string;
  attempt: number;
  receivedTokens: number;
  partialText: string;
  completed: boolean;
  error?: LlmErrorCategory;
};

Then the behavior is:

no tokens received: retry automatically
some tokens received: show incomplete state
structured output expected: discard or repair
tool call involved: stop and ask for confirmation

For user-visible chat, I would rather show:

The response was interrupted. Continue?

than silently retry and produce a second answer.

Backoff still matters

Once I know a retry is safe, I still use normal backoff.

function retryDelayMs(attempt: number) {
  const base = Math.min(1000 * 2 ** attempt, 10000);
  const jitter = Math.floor(Math.random() * 500);
  return base + jitter;
}

I also respect provider retry hints when available.

function getRetryDelay(error: {
  retryAfterMs?: number;
}, attempt: number) {
  if (error.retryAfterMs) {
    return error.retryAfterMs;
  }

  return retryDelayMs(attempt);
}

The important part is that backoff is not the whole strategy.

Backoff answers:

How long should I wait before retrying?

It does not answer:

Is retrying safe?

That second question is where most LLM workflow bugs live.

Keep attempts visible to the user when needed

For background jobs, retries can be invisible.

For user-facing workflows, I prefer exposing state.

Example states:

type OperationStatus =
  | "queued"
  | "running"
  | "retrying"
  | "waiting_for_confirmation"
  | "completed"
  | "failed";

If an operation is safe to retry automatically, the UI can say:

Still working. Retrying after a temporary model error.

If the operation may cause a duplicate side effect, the UI should ask:

The email may have already been sent. Do you want to check status before trying again?

That sounds less smooth, but it is much better than accidentally sending two emails.

Log retries as part of one workflow

One of the worst debugging experiences is seeing retries as unrelated logs.

I want logs grouped under one operation.

{
  "operation_id": "op_8f23",
  "user_id": "user_123",
  "operation_kind": "tool_planning",
  "attempt": 2,
  "max_attempts": 3,
  "provider": "openai",
  "model": "gpt-4.1-mini",
  "error_category": "rate_limited",
  "retryable": true,
  "next_retry_ms": 2400,
  "external_side_effect_started": false
}

For side effects, I log the idempotency key too.

{
  "operation_id": "op_8f23",
  "action": "send_email",
  "idempotency_key": "send_email:op_8f23:customer@example.com",
  "status": "completed"
}

This gives me a clear answer to the scary question:

Did the user action happen once, multiple times, or not at all?

My retry setup

The setup I use now is basically this:

Create one operation ID per user action.
Classify the LLM operation type.
Classify the error.
Retry only known transient failures.
Do not retry external side effects unless they are idempotent.
Use idempotency keys for tools and external APIs.
Split LLM planning from application execution.
Treat streaming partial output as a special state.
Log all attempts under the same operation.
Ask for confirmation when the system cannot know whether an action already happened.

In code, the retry wrapper looks less like a generic HTTP helper and more like an operation-aware executor.

async function runLlmAttempt<T>(params: {
  operationId: string;
  operationKind: LlmOperationKind;
  maxAttempts: number;
  call: () => Promise<T>;
}) {
  let attempt = 0;
  let lastError: unknown;

  while (attempt < params.maxAttempts) {
    try {
      return await params.call();
    } catch (error) {
      const category = classifyLlmError(error);

      const retryable = shouldRetryLlmCall({
        operationKind: params.operationKind,
        error: category,
        attempt
      });

      logLlmAttempt({
        operationId: params.operationId,
        operationKind: params.operationKind,
        attempt,
        errorCategory: category,
        retryable
      });

      if (!retryable) {
        throw error;
      }

      await sleep(retryDelayMs(attempt));
      attempt += 1;
      lastError = error;
    }
  }

  throw lastError;
}

This is more code than a simple retry helper.

But it has saved me from treating dangerous operations like harmless API reads.

Final thought

Retries are not automatically reliability.

For LLM apps, retries can improve reliability when the operation is read-only, transient, and safe to repeat.

They can create serious bugs when the model is part of a workflow that changes the outside world.

The question I ask now is not just:

Did the LLM call fail?

It is:

What has already happened because of this user action?

If nothing happened, retry.

If something may have happened, check, dedupe, or ask for confirmation.

That one distinction makes LLM retry logic a lot less scary.

For multi-model workflows, centralizing retries, fallback, and routing also helps keep this logic consistent. I use TokenBay for that kind of setup because it keeps the API surface familiar while making model/provider choices easier to manage in one place.

DEV Community