I used to think retry logic was an implementation detail.
It isn’t.
Retries encode assumptions about failure, time, trust, and responsibility. When those assumptions are wrong, systems don’t crash. They lie quietly.
This post isn’t about elegance. It’s about being explicit.
The mistake people make
Most retry implementations answer the wrong question.
They ask: “How do I try again?”
The real question is: “Under what failures am I allowed to try again?”
Those are not the same thing.
Retries are not resilience by default
Blind retries are comforting. They make engineers feel proactive.
In reality, they often:
- Mask real outages
- Amplify load during partial failures
- Destroy observability
- Delay alerts until damage is done
A retry without a failure model is just noise with a sleep call.
What I learned building a monitoring primitive
While building an async endpoint checker, I was forced to confront a few uncomfortable truths.
- Parameters are contracts
If a function exposes a parameter that is not used, it is lying.
Dead parameters rot APIs. They create false confidence and future bugs. Removing them is not cleanup. It’s honesty.
- Catching Exception inside retries is negligence
Retrying on all exceptions means retrying on:
- Programmer errors
- Logic bugs
- Invalid states
Those failures should terminate execution immediately.
Retries are for expected, transient failures only. Anything else must fail fast.
- HTTP retries without backoff are hostile behavior
Retrying immediately on 500s or 429s is not resilience.
It’s pressure.
If your system retries aggressively during degradation, it becomes part of the outage. Good retry logic reduces harm. Bad retry logic accelerates it.
- Time must have a single owner
If multiple layers measure “total time”, metrics become contradictory.
Time is a resource, not a side effect.
Only one layer should own it. Everything else should report partial truth or nothing at all.
- Helpers should not know semantics
A retry helper that understands HTTP status codes is doing too much.
Helpers should be stupid and obedient. Policy belongs to the caller.
When helpers start making decisions, architecture leaks.
The most dangerous bug
On the final retry, it’s tempting to overwrite the result:
- Force failure
- Normalize fields
- “Clean things up”
That destroys information.
The last attempt is still truth. Corrupting it poisons analytics, alerting, and postmortems. These bugs don’t show up in logs. They show up in lost trust.
Why this matters in monitoring systems
Some failures justify retries.
Others demand immediate alerts.
Some should be recorded but not acted on.
If a monitoring system cannot explain why something failed, it cannot be trusted when it claims something is broken.
Closing thought
Retry logic is not a loop. It’s a statement about how you believe the world behaves under stress.
If that statement is vague, your system will be vague when it matters most.
Explicit beats clever. Every time.
Top comments (0)