Retries are one of those features that almost every distributed system eventually gets.
Downstream timeout?
Retry.
Temporary network issue?
Retry.
Intermittent dependency failure?
Retry.
The logic makes sense.
But here’s a question:
What happens when retries start generating more traffic than your users?
That sounds strange at first.
But in cloud-native payment systems, retries can become one of the fastest ways to amplify degradation.
Let’s walk through a realistic scenario.
⸻
The architecture
Consider a representative payment workflow:
API Gateway
↓
Payment Service
↓
Fraud Service
↓
Ledger Service
↓
Kafka
↓
Notification Service
Typical stack:
- Spring Boot microservices
- Kafka event communication
- Kubernetes
- Redis
- PostgreSQL / Oracle
- Resilience4j
- HikariCP
Looks straightforward.
⸻
The “safe” configuration change
Suppose intermittent downstream failures appear.
Someone increases retries:
resilience4j:
retry:
instances:
fraudService:
maxRetryAttempts: 10
waitDuration: 100ms
Originally:
maxRetryAttempts: 3
No redesign.
No architecture changes.
Just more retries.
Seems harmless.
⸻
Now introduce latency
Fraud Service latency increases:
50ms → 4s
Not failure.
Latency.
Pods remain healthy.
Readiness probes pass:
readinessProbe:
httpGet:
path: /actuator/health
port:8080
CPU remains normal.
HPA sees:
averageUtilization: 70
No scaling event.
Everything looks healthy.
⸻
But hidden pressure begins building
Payment Service threads begin waiting:
CompletableFuture<ScoreResponse> score =
fraudClient.getScore(request);
Threads remain occupied longer.
Consumers process records slower.
Kafka offsets stop advancing.
Retries kick in.
Traffic multiplies.
What started as:
100 requests
can become:
100 requests
- retries
- retry retries
- downstream calls
No new customers arrived.
The system generated extra load itself.
⸻
The propagation chain
Fraud latency
↓
Retry amplification
↓
Thread saturation
↓
Kafka consumer lag
↓
HikariCP exhaustion
↓
Authorization failures
This is why retries can become traffic generators.
⸻
Kafka consumer lag was probably the first warning
Many teams watch:
- CPU
- memory
- pod count
But Kafka consumer lag often moves first.
Example:
records-lag-max
Prometheus alert:
- alert: HighConsumerLag
expr: kafka_consumergroup_lag > 1000
for: 2m
Consumer lag frequently appears before users experience failures.
⸻
Add timeout boundaries
Retries without timeout boundaries become dangerous.
R
Resilience4j:
resilience4j:
timelimiter:
instances:
fraudService:
timeoutDuration: 500ms
retry:
instances:
fraudService:
maxRetryAttempts: 3
Retries should stop.
Not multiply indefinitely.
⸻
Add bulkheads
Separate downstream resource pools:
resilience4j:
thread-pool-bulkhead:
instances:
fraudService:
coreThreadPoolSize: 5
maxThreadPoolSize:10
Now Fraud Service degradation cannot consume all resources.
⸻
Add replay-safe idempotency
Retries + Kafka replay can create duplicate transactions.
Redis protection:
String key=
"txn:"+event.getTransactionId();
Boolean first=
redisTemplate
.opsForValue()
.setIfAbsent(
key,
"1",
Duration.ofHours(24)
);
if(Boolean.FALSE.equals(first)){
return;
}
Without idempotency:
duplicate ledger updates become possible.
In payment systems that becomes expensive.
⸻
Final takeaway
Retries still matter.
They’re useful.
But retries are not just recovery mechanisms.
They’re traffic generators.
When systems degrade, retries create additional work.
Additional work creates pressure.
Pressure creates propagation.
And propagation creates transaction failures.
The tricky part?
Kubernetes may never notice.
Top comments (0)