Few lines of code look more innocent than this:
retry(3)
It feels responsible.
Professional.
Resilient.
After all, networks fail.
Servers become unavailable.
Databases occasionally time out.
Retrying seems like the obvious solution.
And sometimes it is.
But after enough years building production systems, I've become convinced of something:
Retry is one of the most dangerous keywords in software.
Not because retries are bad.
Because retries amplify everything.
Good systems become more reliable.
Bad systems become disasters.
The problem is that many developers treat retries as a reliability feature when they're actually a distributed systems feature.
And distributed systems are where simple ideas go to become complicated.
Why Retries Exist
Imagine:
await fetch("/api/users");
The request fails.
Maybe:
- Network hiccup
- Temporary database issue
- Load balancer restart
- Service deployment
The operation might succeed if attempted again.
So we write:
retry(3)
Seems reasonable.
And in many cases:
It Works
Which is why retries become popular.
The Dangerous Assumption
Most developers unconsciously assume:
Failure
=
Operation Did Not Execute
Unfortunately that's not always true.
A request can:
Execute Successfully
↓
Response Never Arrives
From the client's perspective:
Failure
From the server's perspective:
Success
Now a retry becomes dangerous.
The Double Payment Problem
Imagine a payment service.
await chargeCard(order);
The card processor successfully charges:
$100
The response is lost due to a network issue.
Client sees:
Request Failed
and retries.
await chargeCard(order);
again.
Now:
Charge #1 = Success
Charge #2 = Success
The customer paid twice.
Nobody wrote bad logic.
The retry created the bug.
The Email Storm Problem
Consider:
await sendWelcomeEmail(user);
Email provider accepts the message.
Response times out.
Application retries.
await sendWelcomeEmail(user);
again.
Customer receives:
Welcome!
Welcome!
Welcome!
Welcome!
Support ticket created.
Marketing team confused.
The retry succeeded.
Too well.
Retries Amplify Side Effects
This is the core issue.
Pure operations:
2 + 2
can run forever.
Nothing changes.
Side effects are different.
Examples:
Charge Card
Create Order
Send Email
Book Seat
Reserve Inventory
Send SMS
Each execution changes reality.
Retries repeat reality.
And reality doesn't always appreciate repetition.
The Thundering Herd Problem
One failed request isn't scary.
Ten thousand retries are.
Imagine:
Service A
becomes slow.
Clients start retrying.
Traffic doubles.
Service becomes slower.
More retries occur.
Traffic doubles again.
Eventually:
Small Failure
↓
Massive Outage
This is known as:
The Thundering Herd Problem
And retries are often the cause.
When Retries Attack Databases
Suppose:
Database
is under heavy load.
Queries start timing out.
Application retries automatically.
Now:
More Queries
↓
More Load
↓
More Timeouts
↓
More Retries
You have accidentally built a denial-of-service attack against your own database.
Why Idempotency Matters
In the previous article we discussed:
Idempotency
This is where it becomes critical.
Without idempotency:
Retry
=
Repeat Side Effects
With idempotency:
Retry
=
Same Result
A retry becomes safe.
That's why reliable systems almost always combine:
Retries
+
Idempotency
rather than using retries alone.
Not Every Failure Should Be Retried
A common mistake:
retry(3)
for every error.
Consider:
400 Bad Request
Retrying won't help.
The request is invalid.
Or:
401 Unauthorized
Retrying won't magically authenticate the user.
Good retry policies distinguish between:
Transient Failures
and
Permanent Failures
Exponential Backoff Exists For A Reason
Bad:
Retry Immediately
Retry Immediately
Retry Immediately
Better:
1 Second
↓
2 Seconds
↓
4 Seconds
↓
8 Seconds
This is:
Exponential Backoff
and it prevents systems from overwhelming already struggling services.
Real World Example: Flight Booking
Imagine:
Reserve Seat
times out.
Client retries.
Without protection:
Seat Reserved Twice
or:
Two Different Seats Reserved
Now inventory becomes inconsistent.
Airlines spend enormous effort preventing these scenarios.
Because retries happen constantly.
Real World Example: Webhooks
Webhook providers often retry automatically.
For example:
Payment Completed
may arrive:
1 Time
2 Times
5 Times
depending on delivery conditions.
Systems that assume:
Exactly Once
processing usually fail eventually.
Systems that expect retries survive.
Real World Example: Message Queues
Kafka.
RabbitMQ.
SQS.
Azure Service Bus.
All assume:
Messages May Be Delivered Again
because reliability is more important than uniqueness.
Consumers must be designed accordingly.
Common Retry Mistakes
Retrying Everything
Not every failure is recoverable.
Retrying Immediately
Often makes outages worse.
Ignoring Idempotency
Creates duplicate side effects.
Infinite Retries
Eventually becomes infinite damage.
Hiding Failures
Retries should not become a substitute for monitoring.
Pros Of Retries
1. Better Reliability
Transient failures disappear.
2. Better User Experience
Temporary outages become invisible.
3. Improved Resilience
Systems tolerate instability.
4. Reduced Manual Intervention
Many failures self-heal.
5. Better Distributed Systems
Network failures become manageable.
Cons Of Retries
1. Duplicate Operations
Without idempotency.
2. Traffic Amplification
Can worsen outages.
3. Cascading Failures
One issue spreads across systems.
4. Increased Complexity
Backoff strategies become necessary.
5. Hidden Production Problems
Retries can mask deeper issues.
The Real Lesson
Most developers think retries exist to make software more reliable.
That's only partially true.
Retries don't eliminate failures.
They change failures.
Sometimes they transform:
Temporary Network Problem
into:
Duplicate Payment
Sometimes they transform:
Slow Database
into:
Full System Outage
That's why experienced engineers don't ask:
Should We Retry?
They ask:
What Happens If This Operation Executes Twice?
Because once retries enter the picture, duplicate execution is no longer an edge case.
It's a certainty.
And reliable systems are designed with that reality in mind.
What's Next?
In the next article we'll discuss:
The Myth Of Stateless Systems
Because many systems described as "stateless" are actually storing state somewhere else.
And that distinction turns out to be extremely important.
About The Author
Hi, I'm Amrish Khan.
I enjoy building developer tools, exploring software architecture, and writing about the deeper ideas behind everyday programming concepts.
I'm also building Aruvix — a growing ecosystem of local-first developer tools designed to process data directly in the browser without unnecessary uploads.
Here's a detailed blog on Aruvix:
https://dev.to/amrishkhan05/aruvix-the-ultimate-offline-first-developer-toolkit-e0i
You can follow my work and thoughts here:
Portfolio:
https://www.amrishkhan.dev
LinkedIn:
https://www.linkedin.com/in/amrishkhan
GitHub:
https://www.github.com/amrishkhan05
If you enjoyed this article, consider following for more deep dives into JavaScript, architecture, local-first software, and performance engineering.
Top comments (0)