Lars Winstand

Posted on Jun 26 • Originally published at standardcompute.com

My bot kept double-posting and the real bug wasn’t GPT-5

#ai #devops #automation #webdev

If your agent heartbeat looks healthy but your Telegram or Discord bot still double-posts, the usual culprit is not GPT-5 or Claude failing.

It’s usually a boring distributed-systems bug:

request times out at 30s
work actually succeeds at 51.7s
retry fires
same side effect happens twice

I ran into this pattern while reading an r/openclaw thread where someone described the exact failure mode in one line:

every time the timeout happened, the original message did go through after 50s, AND the retry goes through, so I end up w double messages.

That sentence explains a huge percentage of “my AI bot is flaky” bugs.

Not model instability. Not prompt weirdness. Not GPT-5 being moody.

Just unsafe retries.

The bug shape

Here’s the typical flow:

Your agent calls GPT-5, Claude Opus, or Qwen
Inference takes longer than expected
Your workflow sends the result to Telegram or Discord
The client times out before it gets the response
The send actually succeeds anyway
Your retry posts the same message again

From that OpenClaw thread, the numbers were the giveaway:

gateway timeout after 30000ms
message.action 51702ms

That means the caller gave up at 30 seconds, but the action appears to have completed at 51.702 seconds.

So the retry wasn’t crazy. It was doing exactly what the system told it to do.

The problem is that retries are only safe when the operation is idempotent.

The rule: retries are fine, side effects are the dangerous part

Retrying compute is usually good.

Retrying outbound side effects without dedup is how you get duplicate Telegram messages, duplicate Discord posts, duplicate emails, duplicate tickets, and eventually duplicate customer pain.

This is the distinction I wish more agent builders made:

Pattern	What actually happens
Retry model call	Usually safe if you can tolerate another inference
Retry webhook or message send	Dangerous if the first request may have already succeeded
Retry side effect with idempotency key	Safe because duplicate attempts resolve to the same operation

A lot of AI reliability bugs are really just distributed systems bugs wearing an LLM costume.

What idempotency actually means

The cleanest explanation still comes from Stripe.

You send a POST request with an Idempotency-Key. Stripe stores the first result for that key and returns the same status code and body on retries.

That means the client no longer has to guess whether the first request succeeded.

Example:

curl https://api.stripe.com/v1/customers \
  -u sk_test_...: \
  -H "Idempotency-Key: KG5LxwFBepaKHyUD" \
  -d description="My First Test Customer"

That pattern should be normal for agent side effects too.

If you’re sending to Telegram Bot API, Discord webhooks, Slack, email, or any external channel, every outbound action should have an operation identity.

If the API doesn’t support native idempotency, build your own dedup ledger.

Why agent frameworks make this worse

Because they’re trying to help.

Temporal retries Activities by default. That’s a good design. But if your Activity includes “post this message to Discord” and that operation isn’t idempotent, retries will happily create duplicates.

n8n has the same trap with friendlier UI.

You can turn on:

Retry On Fail
Wait Between Tries
error workflows
execution.retryOf for debugging

All useful.

None of that makes a Telegram send safe by itself.

Retry features are not dedup features.

The real-world failure mode with Discord

Discord rate limits make this even messier.

Their limits are dynamic, and the docs tell you to read headers like:

X-RateLimit-Limit
X-RateLimit-Remaining
X-RateLimit-Reset
X-RateLimit-Reset-After
X-RateLimit-Bucket

Now combine that with a slow LLM call.

Say GPT-5 takes 40 seconds because your context window is bloated. Your bot finally sends to Discord. Discord responds with a rate limit or the client times out. Your code treats all of these the same:

timeout
429
unknown delivery state

Then it retries immediately.

That’s how you get tickets like:

“Discord is randomly duplicating messages”
“OpenAI must be unstable”
“My bot posts twice when the model is slow”

No. Your system failed to separate compute retries from side-effect retries.

The practical fix

The best fix I saw in that Reddit discussion was also the least glamorous:

I built a Discord bot that kept double-posting under timeout. Logs were useless until I added a crude dedup key... My timeouts came from the LLM taking 40s+ for long context, so I set a 90s gateway timeout and handled inflight state explicitly.

That’s the playbook.

Pattern I’d use every time

Create an operation ID before sending
Store inflight state
Use a timeout budget that matches reality
On retry, check the ledger first
Treat 429 separately from ambiguous timeout
Record provider response details

A decent operation ID looks like this:

conversation_id + turn_id + channel + message_hash

A decent state model looks like this:

pending
sent
failed_unknown
failed_confirmed

Minimal Node example: dedup around a Discord send

Here’s a stripped-down example in Node.js.

import crypto from "node:crypto";
import fetch from "node-fetch";

const ledger = new Map();

function makeOperationId({ conversationId, turnId, channel, content }) {
  const hash = crypto.createHash("sha256").update(content).digest("hex").slice(0, 12);
  return `${conversationId}:${turnId}:${channel}:${hash}`;
}

async function sendDiscordMessage({ webhookUrl, conversationId, turnId, content }) {
  const opId = makeOperationId({
    conversationId,
    turnId,
    channel: "discord",
    content,
  });

  const existing = ledger.get(opId);
  if (existing?.status === "sent") {
    return { ok: true, deduped: true, messageId: existing.messageId };
  }

  ledger.set(opId, { status: "pending", updatedAt: Date.now() });

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 90000);

  try {
    const res = await fetch(webhookUrl, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ content }),
      signal: controller.signal,
    });

    clearTimeout(timeout);

    if (res.status === 429) {
      const retryAfter = res.headers.get("x-ratelimit-reset-after");
      ledger.set(opId, {
        status: "failed_confirmed",
        reason: "rate_limited",
        retryAfter,
        updatedAt: Date.now(),
      });
      throw new Error(`Discord rate limited. retry_after=${retryAfter}`);
    }

    if (!res.ok) {
      ledger.set(opId, {
        status: "failed_confirmed",
        reason: `http_${res.status}`,
        updatedAt: Date.now(),
      });
      throw new Error(`Discord send failed: ${res.status}`);
    }

    ledger.set(opId, {
      status: "sent",
      messageId: `discord:${Date.now()}`,
      updatedAt: Date.now(),
    });

    return { ok: true, deduped: false };
  } catch (err) {
    clearTimeout(timeout);

    if (err.name === "AbortError") {
      ledger.set(opId, {
        status: "failed_unknown",
        reason: "timeout_ambiguous",
        updatedAt: Date.now(),
      });
    }

    throw err;
  }
}

This example is intentionally simple, but the important behavior is there:

operation ID is created before send
send state is recorded
timeout is explicit
ambiguous timeout is not treated like confirmed failure
retries can consult the ledger before posting again

In production, that ledger should live in Redis, Postgres, DynamoDB, or whatever durable store you already trust.

A better retry decision tree

This is the decision tree I want in every bot codebase:

Did the model call fail?
  -> retry compute if appropriate

Did the outbound send fail with confirmed no-delivery?
  -> retry send

Did the outbound send time out and delivery is unknown?
  -> check ledger / provider state before retrying

Did the outbound send already succeed for this operation ID?
  -> return existing result, do not post again

That one distinction cleans up a lot of chaos.

How I’d wire this in n8n

If I were fixing this in n8n tomorrow, I’d do three things first:

1. Increase timeout budgets above known long-context inference times.
2. Generate a dedup key for every outbound message action.
3. Log retry lineage with execution.retryOf plus your own operation ID.

A practical n8n pattern:

Use a Code node to generate operationId
Check Redis/Postgres before the Telegram or Discord node
If already sent, short-circuit the workflow
If not sent, mark pending
Send message
Mark sent with provider response details
On timeout or ambiguous error, mark failed_unknown

That’s a lot more useful than staring at a green heartbeat and blaming Claude.

How I’d wire this in Temporal

In Temporal, I’d keep LLM calls and outbound side effects separate.

Put inference in an Activity with retries
Put message delivery in another Activity
Make the delivery Activity idempotent
Use an operation ID as part of the Activity input
Persist send results somewhere durable

The mistake is putting “generate + send” in one retrying Activity and hoping the retries behave nicely.

They won’t.

Sometimes the model really is slow

To be fair, sometimes the model is part of the problem.

OpenAI, Anthropic, local Qwen, local Llama, whatever you’re using—any of them can get slow under long context, load, memory pressure, or provider throttling.

Idempotency won’t make inference faster.

What it does do is stop your workflow from turning slow inference into duplicate side effects.

That matters even more when you’re running agents at scale.

If you’re using a setup with predictable flat-rate AI access instead of per-token billing, you’re usually more willing to let agents run, retry, and handle bigger workloads. That’s great for throughput. It also means you need better retry hygiene, because aggressive automation amplifies bad side-effect handling fast.

That’s one reason I like what Standard Compute is doing: it removes the per-token paranoia that makes teams under-build automations, but it also makes the engineering tradeoff more obvious. Once compute is cheap and predictable, workflow correctness becomes the bottleneck.

And workflow correctness starts with not posting the same message twice.

The boring takeaway that actually fixes the bug

If your bot talks to Telegram or Discord, treat every outbound message like a payment:

give it an identity
assume retries will happen
store delivery state
distinguish confirmed failure from unknown outcome
never confuse “I didn’t get a response” with “the action did not happen”

Most of the ugly “AI reliability” bugs I see are still old distributed-systems bugs.

Honestly, that’s good news.

Because you can fix those today.

You do not need GPT-6 to stop your bot from double-posting.

DEV Community