SecEngineerX

Posted on Jan 31

Retry Logic Is a Policy Decision, Not a Code Pattern

#python #devops #observability #monitoring

I used to think retry logic was an implementation detail.

It isn’t.

Retries encode assumptions about failure, time, trust, and responsibility. When those assumptions are wrong, systems don’t crash. They lie quietly.

This post isn’t about elegance. It’s about being explicit.

The mistake people make

Most retry implementations answer the wrong question.

They ask: “How do I try again?”

The real question is: “Under what failures am I allowed to try again?”

Those are not the same thing.

Retries are not resilience by default

Blind retries are comforting. They make engineers feel proactive.

In reality, they often:

Mask real outages
Amplify load during partial failures
Destroy observability
Delay alerts until damage is done

A retry without a failure model is just noise with a sleep call.

What I learned building a monitoring primitive

While building an async endpoint checker, I was forced to confront a few uncomfortable truths.

Parameters are contracts

If a function exposes a parameter that is not used, it is lying.

Dead parameters rot APIs. They create false confidence and future bugs. Removing them is not cleanup. It’s honesty.

Catching Exception inside retries is negligence

Retrying on all exceptions means retrying on:

Programmer errors
Logic bugs
Invalid states

Those failures should terminate execution immediately.

Retries are for expected, transient failures only. Anything else must fail fast.

HTTP retries without backoff are hostile behavior

Retrying immediately on 500s or 429s is not resilience.

It’s pressure.

If your system retries aggressively during degradation, it becomes part of the outage. Good retry logic reduces harm. Bad retry logic accelerates it.

Time must have a single owner

If multiple layers measure “total time”, metrics become contradictory.

Time is a resource, not a side effect.

Only one layer should own it. Everything else should report partial truth or nothing at all.

Helpers should not know semantics

A retry helper that understands HTTP status codes is doing too much.

Helpers should be stupid and obedient. Policy belongs to the caller.

When helpers start making decisions, architecture leaks.

The most dangerous bug

On the final retry, it’s tempting to overwrite the result:

Force failure
Normalize fields
“Clean things up”

That destroys information.

The last attempt is still truth. Corrupting it poisons analytics, alerting, and postmortems. These bugs don’t show up in logs. They show up in lost trust.

Why this matters in monitoring systems

Some failures justify retries.
Others demand immediate alerts.
Some should be recorded but not acted on.

If a monitoring system cannot explain why something failed, it cannot be trusted when it claims something is broken.

Closing thought
Retry logic is not a loop. It’s a statement about how you believe the world behaves under stress.

If that statement is vague, your system will be vague when it matters most.

Explicit beats clever. Every time.

DEV Community

Retry Logic Is a Policy Decision, Not a Code Pattern

Top comments (0)