Otto Plane

Posted on Mar 3

Why Retry Logic Is Not Governance

#architecture #python #devops #microservices

Automation systems rarely fail by crashing. They fail by repeating.

Retries, backoff policies, and catch handlers are designed to recover from transient faults. They help when a request fails temporarily. But they do not answer a more fundamental question:

Should this operation run at all?

There is a structural difference between handling failure after execution and deciding before execution whether something is allowed to run.

That difference is governance.

The Structural Problem

In many automation systems, side effects occur immediately. A typical flow looks like this:

Trigger
   ↓
HTTP Request
   ↓
Database Write
   ↓
Retry / Error Handling

If something goes wrong—misconfigured thresholds, a recursive trigger, or a runaway loop—the system may:

repeatedly call external APIs
amplify writes
trigger model invocations
escalate cloud costs

Retry logic does not prevent this.

It reacts after the action has already happened.

Governance introduces a structural boundary before side effects occur.

The Pre-Execution Gate Pattern

Instead of executing first, introduce a deterministic admission step.

Trigger
   ↓
Build Context Payload
   ↓
POST /authorize
   ↓
Decision: ALLOW | DENY
   ↓
Route Execution or Stop

In this structure, every outbound side effect must pass through a policy decision.

If the request is denied, the operation never runs.

This pattern acts as a pre-execution gate.

Example: Threshold Enforcement

Consider a simple policy rule:

Execution is denied if:

request_count > 5

The workflow sends a context payload to an authorization endpoint.

Authorization Request

POST /authorize
Content-Type: application/json

{
  "policy_id": "threshold-policy",
  "input": {
    "context": {
      "request_count": 7
    }
  }
}

Deterministic Deny Response

{
  "decision": "DENY",
  "policy_id": "threshold-policy",
  "rule_id": "max-request-threshold",
  "reason": "request_count exceeds threshold (5)",
  "evaluated_input": {
    "context": {
      "request_count": 7
    }
  }
}

Because the decision is DENY, the workflow halts before any external call occurs.

No retry.
No backoff.
No side effects.

Why This Is Different From Retry Logic

Retry logic answers this question:

What should happen if the operation fails?

Pre-execution governance answers a different one:

Should the operation run in the first place?

Retries improve resilience.

Admission control defines boundaries.

If the rule states:

request_count = 7
threshold = 5

The outcome is always:

DENY

The decision is deterministic and rule-bound.

No number of retries will change it.

Where This Lives in the Architecture

In hexagonal architecture, this decision boundary sits at the inbound port.

The workflow, event trigger, or HTTP request acts as the driving adapter.

Before the request reaches application logic, a policy decision determines whether execution is permitted.

Trigger / Event
       │
       ▼
Policy Gate (Inbound Port)
       │
       ▼
Application Logic
       │
       ▼
External Side Effects

This keeps governance separate from business logic while ensuring every request passes through the same boundary.

Why This Matters in Practice

In distributed systems, unintended repetition is a common failure mode.

Examples include:

recursive workflow triggers
serverless functions calling themselves
runaway automation loops
misconfigured retry policies

In serverless environments this can become a cost amplifier.

A $0.01 API call repeated in a runaway loop can easily turn into a $1,000 mistake.

Retry logic does not prevent this scenario because the operation itself is still considered valid.

Pre-execution governance stops the action before it occurs.

When This Pattern Becomes Necessary

Pre-execution gates become especially important when systems:

trigger external APIs
execute AI or model calls
write to persistent stores
run in serverless environments
orchestrate automated workflows

In these contexts, post-execution detection is often too late.

The system must decide before execution.

Closing Thought

Retry logic improves resilience.

Governance defines boundaries.

If a system cannot decide whether an action is allowed before it runs, it is not governed — it is merely monitored.

Discussion

How are teams preventing runaway automation loops or unintended cost amplification in their workflows today?

DEV Community