DEV Community

jhabindra pandey
jhabindra pandey

Posted on

Retry logic, Kafka consumer lag, and the hidden failure pattern that Kubernetes won’t catch

Retries are one of those features that almost every distributed system eventually gets.

Downstream timeout?

Retry.

Temporary network issue?

Retry.

Intermittent dependency failure?

Retry.

The logic makes sense.

But here’s a question:

What happens when retries start generating more traffic than your users?

That sounds strange at first.

But in cloud-native payment systems, retries can become one of the fastest ways to amplify degradation.

Let’s walk through a realistic scenario.

The architecture

Consider a representative payment workflow:

API Gateway

Payment Service

Fraud Service

Ledger Service

Kafka

Notification Service

Typical stack:

  • Spring Boot microservices
  • Kafka event communication
  • Kubernetes
  • Redis
  • PostgreSQL / Oracle
  • Resilience4j
  • HikariCP

Looks straightforward.

The “safe” configuration change

Suppose intermittent downstream failures appear.

Someone increases retries:

resilience4j:
 retry:
   instances:
      fraudService:
         maxRetryAttempts: 10
         waitDuration: 100ms
Enter fullscreen mode Exit fullscreen mode

Originally:

maxRetryAttempts: 3

No redesign.

No architecture changes.

Just more retries.

Seems harmless.

Now introduce latency

Fraud Service latency increases:

50ms → 4s

Not failure.

Latency.

Pods remain healthy.

Readiness probes pass:

readinessProbe:
   httpGet:
      path: /actuator/health
      port:8080
Enter fullscreen mode Exit fullscreen mode

CPU remains normal.

HPA sees:

averageUtilization: 70

No scaling event.

Everything looks healthy.

But hidden pressure begins building

Payment Service threads begin waiting:

CompletableFuture<ScoreResponse> score =
fraudClient.getScore(request);

Enter fullscreen mode Exit fullscreen mode

Threads remain occupied longer.

Consumers process records slower.

Kafka offsets stop advancing.

Retries kick in.

Traffic multiplies.

What started as:

100 requests

can become:

100 requests

  • retries
  • retry retries
  • downstream calls

No new customers arrived.

The system generated extra load itself.

The propagation chain

Fraud latency

Retry amplification

Thread saturation

Kafka consumer lag

HikariCP exhaustion

Authorization failures

This is why retries can become traffic generators.

Kafka consumer lag was probably the first warning

Many teams watch:

  • CPU
  • memory
  • pod count

But Kafka consumer lag often moves first.

Example:

records-lag-max

Prometheus alert:

- alert: HighConsumerLag
  expr: kafka_consumergroup_lag > 1000
  for: 2m

Enter fullscreen mode Exit fullscreen mode

Consumer lag frequently appears before users experience failures.

Add timeout boundaries

Retries without timeout boundaries become dangerous.

R

Resilience4j:

resilience4j:
 timelimiter:
   instances:
      fraudService:
         timeoutDuration: 500ms
 retry:
   instances:
      fraudService:
         maxRetryAttempts: 3

Enter fullscreen mode Exit fullscreen mode

Retries should stop.

Not multiply indefinitely.

Add bulkheads

Separate downstream resource pools:

resilience4j:
 thread-pool-bulkhead:
   instances:
      fraudService:
          coreThreadPoolSize: 5
          maxThreadPoolSize:10

Enter fullscreen mode Exit fullscreen mode

Now Fraud Service degradation cannot consume all resources.

Add replay-safe idempotency

Retries + Kafka replay can create duplicate transactions.

Redis protection:

String key=
"txn:"+event.getTransactionId();
Boolean first=
redisTemplate
.opsForValue()
.setIfAbsent(
key,
"1",
Duration.ofHours(24)
);
if(Boolean.FALSE.equals(first)){
   return;
}

Enter fullscreen mode Exit fullscreen mode

Without idempotency:

duplicate ledger updates become possible.

In payment systems that becomes expensive.

Final takeaway

Retries still matter.

They’re useful.

But retries are not just recovery mechanisms.

They’re traffic generators.

When systems degrade, retries create additional work.

Additional work creates pressure.

Pressure creates propagation.

And propagation creates transaction failures.

The tricky part?

Kubernetes may never notice.

Top comments (0)