jhabindra pandey

Posted on May 19

Healthy Pods, Broken Transactions: What Kubernetes Doesn’t Catch in Banking Systems

#microservices #fintech #softwareengineering #banking

A production scenario walkthrough with Spring Boot, Resilience4j, and Kafka

Your pods are running. Your readiness probes are green. Your HPA hasn’t moved. And your payment authorizations are silently failing.

This is not a hypothetical. It’s the failure class that cloud-native financial systems encounter repeatedly — and the one that standard Kubernetes observability is structurally unable to catch.

Let’s walk through exactly how this happens and what the engineering response looks like in code.

The Scenario

Standard card authorization pipeline:

API Gateway → Payment Service → Kafka → Fraud Service → Ledger Service → Notification Service

The Fraud Service slows from 50ms to 4 seconds. Nothing crashes. Everything looks healthy. But over the next 40 minutes:

• Kafka consumer lag on the fraud topic grows to 50,000+ messages
• Payment Service thread pool exhausts
• HikariCP connections on the Ledger Service saturate
• Card authorizations fail system-wide
• ACH settlement files are incomplete

Here’s why, and here’s the code that prevents it.

Problem 1: Your Readiness Probe Doesn’t Know Your Threads Are Exhausted

The default Spring Boot readiness probe checks /actuator/health. That endpoint checks database connectivity and disk space — not whether your service can actually process work.

Add a custom health indicator that reflects real processing capacity:

@Component
public class ThreadPoolHealthIndicator implements HealthIndicator {

private final ThreadPoolTaskExecutor executor;

public ThreadPoolHealthIndicator(ThreadPoolTaskExecutor executor) {
    this.executor = executor;
}

@Override
public Health health() {
    int active = executor.getActiveCount();
    int max = executor.getMaxPoolSize();
    double utilization = (double) active / max;

    if (utilization > 0.90) {
        return Health.down()
            .withDetail("activeThreads", active)
            .withDetail("maxThreads", max)
            .withDetail("utilization", utilization)
            .withDetail("reason", "Thread pool near exhaustion")
            .build();
    }
    return Health.up()
            .withDetail("activeThreads", active)
            .withDetail("utilization", utilization)
            .build();
}

}

Now your readiness probe will actually pull the pod from rotation when the service can’t process requests — giving Kubernetes real signal to act on.

Problem 2: Your Resilience4j Timeout Is Set for Tolerance, Not Behavior

A 6-second timeout on a service that normally responds in 50ms is not a safety net. It means degradation runs for 6 full seconds before any protective mechanism fires.

Configure timeouts relative to your operating envelope:

resilience4j:
timelimiter:
instances:
fraudService:
timeoutDuration: 500ms
circuitbreaker:
instances:
fraudService:
slidingWindowSize: 20
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 500ms
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5

The critical detail: slowCallDurationThreshold must match your timeoutDuration. If your circuit breaker considers anything under 6 seconds a fast call but your timeout fires at 500ms, you get retries without circuit breaking — which is retry amplification.

Problem 3: No Bulkhead Means One Slow Dependency Owns Your Entire Thread Pool

Without bulkhead isolation, the Fraud Service slowdown consumes threads from the same pool handling Ledger Service calls and API responses. Everything degrades together.

resilience4j:
thread-pool-bulkhead:
instances:
fraudService:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20
ledgerService:
maxThreadPoolSize: 15
coreThreadPoolSize: 8
queueCapacity: 30

@Service
public class FraudScoringClient {

private final WebClient webClient;

@Bulkhead(name = "fraudService", type = Bulkhead.Type.THREADPOOL)
@CircuitBreaker(name = "fraudService", fallbackMethod = "fraudScoringFallback")
@TimeLimiter(name = "fraudService")
public CompletableFuture<FraudScore> scoreFraudRisk(PaymentRequest request) {
    return webClient.post()
        .uri("/fraud/score")
        .bodyValue(request)
        .retrieve()
        .bodyToMono(FraudScore.class)
        .toFuture();
}

public CompletableFuture<FraudScore> fraudScoringFallback(
        PaymentRequest request, Exception ex) {
    return CompletableFuture.completedFuture(
        FraudScore.pendingManualReview(request.getTransactionId())
    );
}

}

Fraud Service degradation now saturates only its allocated bulkhead. Ledger Service calls keep their own pool.

Problem 4: No Idempotency Means Recovery Creates Duplicate Ledger Entries

When the Fraud Service recovers and the Payment Service replays its Kafka backlog, the Ledger Service receives duplicate authorization events. Without idempotency, it posts duplicate records — creating phantom transactions in the settlement batch.

@KafkaListener(topics = "payment.authorized", groupId = "ledger-service")
public void processAuthorization(AuthorizationEvent event) {
String idempotencyKey = "ledger:" + event.getTransactionId();

Boolean isNew = redisTemplate.opsForValue()
    .setIfAbsent(idempotencyKey, "processed", Duration.ofHours(24));

if (Boolean.FALSE.equals(isNew)) {
    log.info("Duplicate authorization event skipped: {}", 
        event.getTransactionId());
    return;
}

try {
    ledgerRepository.postTransaction(event);
} catch (Exception e) {
    redisTemplate.delete(idempotencyKey);
    throw e;
}

}

The delete-on-failure matters. Block duplicates, not retries. They are different things.

Problem 5: Consumer Lag Isn’t in Your SLO

Lag grew for 15 minutes before the first authorization failed. If it had been a primary alert, the incident window shrinks dramatically.

@Scheduled(fixedDelay = 30000)
public void recordConsumerLag() {
adminClient.listConsumerGroupOffsets("payment-service-group")
.partitionsToOffsetAndMetadata().get()
.forEach((partition, offsetMeta) -> {
long lag = calculateLag(partition, offsetMeta.offset());
meterRegistry.gauge(
"kafka.consumer.lag",
Tags.of(
"topic", partition.topic(),
"partition", String.valueOf(partition.partition())
),
lag
);
});
}

Alert on lag before CPU, before error rate, before your dashboards turn red.

The Pattern

Every problem here has the same shape:

1.  A localized degradation Kubernetes cannot see
2.  A missing isolation boundary that lets it spread
3.  A recovery path that produces incorrect state without idempotency

Green pods do not mean reliable transactions. In financial systems, that distinction is the difference between a recoverable incident and a two-day reconciliation problem.

DEV Community

Healthy Pods, Broken Transactions: What Kubernetes Doesn’t Catch in Banking Systems

Top comments (0)