DEV Community

Cover image for How We Improved Payment System Throughput by 25% Using Apache Kafka at a Fortune 500 FinTech
Disha Sune
Disha Sune

Posted on

How We Improved Payment System Throughput by 25% Using Apache Kafka at a Fortune 500 FinTech

How We Improved Payment System Throughput by 25% Using Apache Kafka at a Fortune 500 FinTech

By Disha Sune — Java Backend Engineer | Spring Boot | Kafka | AWS | Fiserv


The Problem

At Fiserv, our payment processing platform handled millions of financial transactions daily for 600+ enterprise clients including McDonald's, Google, and Domino's.

As transaction volumes grew, our legacy synchronous REST API architecture started showing cracks:

  • API response times were spiking during peak load periods
  • Downstream services were tightly coupled — if one failed, everything failed
  • No fault tolerance — a single service timeout caused transaction failures
  • Scaling was painful — we had to scale everything together even when only one component was under load

We needed a fundamental architectural shift. The answer was Apache Kafka.


Why Kafka?

We evaluated several messaging solutions — RabbitMQ, ActiveMQ, AWS SQS — but chose Kafka for three reasons:

1. High throughput at scale
Kafka can handle millions of messages per second with minimal latency. For a payment system processing millions of daily transactions, this was non-negotiable.

2. Message durability and replay
Unlike traditional message queues, Kafka retains messages on disk. If a consumer fails, it can replay from the last committed offset — critical for financial data where you cannot lose a single transaction.

3. Decoupled architecture
Kafka naturally decouples producers from consumers. Our authorization service could publish events without knowing anything about the settlement or reporting services consuming them.


The Architecture We Built

Before Kafka, our architecture looked like this:

Client Request
     ↓
Authorization Service ──→ Settlement Service (synchronous REST)
                      ──→ Reporting Service (synchronous REST)
                      ──→ Chargeback Service (synchronous REST)
Enter fullscreen mode Exit fullscreen mode

Every downstream call was synchronous. One slow service meant the entire request chain slowed down.

After Kafka, it looked like this:

Client Request
     ↓
Authorization Service ──→ Kafka Topic: payment-events
                                ↓              ↓              ↓
                         Settlement      Reporting      Chargeback
                          Consumer        Consumer       Consumer
Enter fullscreen mode Exit fullscreen mode

The authorization service publishes one event. Multiple consumers process it independently and asynchronously.


Implementation — Key Code Patterns

Producer Configuration (Spring Boot)

@Configuration
public class KafkaProducerConfig {

    @Bean
    public ProducerFactory<String, PaymentEvent> producerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);

        // Critical for financial systems — ensure no message loss
        config.put(ProducerConfig.ACKS_CONFIG, "all");
        config.put(ProducerConfig.RETRIES_CONFIG, 3);
        config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

        return new DefaultKafkaProducerFactory<>(config);
    }

    @Bean
    public KafkaTemplate<String, PaymentEvent> kafkaTemplate() {
        return new KafkaTemplate<>(producerFactory());
    }
}
Enter fullscreen mode Exit fullscreen mode

Key settings explained:

  • ACKS_CONFIG = "all" — producer waits for all replicas to acknowledge before considering a message sent. No message loss.
  • ENABLE_IDEMPOTENCE_CONFIG = true — prevents duplicate messages even if the producer retries.
  • RETRIES_CONFIG = 3 — automatic retry on transient failures.

Publishing a Payment Event

@Service
public class PaymentEventPublisher {

    private final KafkaTemplate<String, PaymentEvent> kafkaTemplate;
    private static final String TOPIC = "payment-events";

    public void publishPaymentAuthorized(PaymentEvent event) {
        // Use transactionId as key — ensures ordering per transaction
        CompletableFuture<SendResult<String, PaymentEvent>> future =
            kafkaTemplate.send(TOPIC, event.getTransactionId(), event);

        future.whenComplete((result, ex) -> {
            if (ex != null) {
                log.error("Failed to publish payment event: {}", event.getTransactionId(), ex);
                // Handle failure — fallback or alert
            } else {
                log.info("Published payment event to partition {} offset {}",
                    result.getRecordMetadata().partition(),
                    result.getRecordMetadata().offset());
            }
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

Why use transactionId as the message key?

Kafka guarantees ordering within a partition. By using transactionId as the key, all events for the same transaction always go to the same partition — ensuring they are processed in order.


Consumer Configuration

@Configuration
public class KafkaConsumerConfig {

    @Bean
    public ConsumerFactory<String, PaymentEvent> consumerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(ConsumerConfig.GROUP_ID_CONFIG, "settlement-service");
        config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);

        // Manual offset commit — we control when a message is "done"
        config.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

        // Read from beginning if no committed offset exists
        config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        return new DefaultKafkaConsumerFactory<>(config);
    }
}
Enter fullscreen mode Exit fullscreen mode

Why disable auto commit?

Auto commit can mark a message as processed before your business logic finishes. In a payment system, this is dangerous — you could lose a transaction. Manual commit ensures you only mark a message as processed after it's fully handled.


Consuming Payment Events

@Service
public class SettlementConsumer {

    @KafkaListener(
        topics = "payment-events",
        groupId = "settlement-service",
        containerFactory = "kafkaListenerContainerFactory"
    )
    public void handlePaymentEvent(
            PaymentEvent event,
            Acknowledgment acknowledgment) {
        try {
            log.info("Processing settlement for transaction: {}", 
                event.getTransactionId());

            // Process the settlement
            settlementService.processSettlement(event);

            // Only commit offset after successful processing
            acknowledgment.acknowledge();

        } catch (RecoverableException ex) {
            // Retry logic — put back on retry topic
            log.warn("Retryable error, will retry: {}", event.getTransactionId());
            retryPublisher.publishToRetryTopic(event);
            acknowledgment.acknowledge(); // Acknowledge so we don't reprocess

        } catch (Exception ex) {
            // Non-recoverable — send to dead letter topic for manual review
            log.error("Failed to process settlement: {}", event.getTransactionId(), ex);
            deadLetterPublisher.publishToDeadLetterTopic(event);
            acknowledgment.acknowledge();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The Dead Letter Queue Pattern

One of the most important patterns we implemented was the Dead Letter Queue (DLQ).

In a financial system, you cannot just discard a failed message. Every failed transaction needs to be investigated and retried.

payment-events (main topic)
        ↓
   [Consumer fails]
        ↓
payment-events-retry (retry topic — retried after 5 minutes)
        ↓
   [Still failing after 3 retries]
        ↓
payment-events-dlq (dead letter queue — human review)
Enter fullscreen mode Exit fullscreen mode

This pattern gave us:

  • Zero message loss — every transaction is accounted for
  • Automatic recovery — transient failures auto-retry
  • Visibility — operations team monitors the DLQ for critical failures

Results We Achieved

After migrating to Kafka-based event-driven architecture:

Metric Before Kafka After Kafka Improvement
System throughput Baseline +25% 25% improvement
Service coupling Tight synchronous Fully decoupled Independent scaling
Failure impact Cascading failures Isolated per service Fault tolerant
Peak load handling Degraded performance Consistent throughput Stable under load
Transaction loss Occasional on failures Zero 100% durability

Key Lessons Learned

1. Always use idempotent consumers

Your consumer must be able to process the same message twice without side effects. Network issues can cause duplicate deliveries. We added a processed_transactions table checked before processing:

if (transactionRepository.isAlreadyProcessed(event.getTransactionId())) {
    log.info("Duplicate message, skipping: {}", event.getTransactionId());
    acknowledgment.acknowledge();
    return;
}
Enter fullscreen mode Exit fullscreen mode

2. Monitor consumer lag religiously

Consumer lag = how far behind your consumer is from the latest message. In a payment system, high lag means transactions are being delayed. We set CloudWatch alerts for lag > 1000 messages.

3. Partition count matters

More partitions = more parallelism = higher throughput. But you cannot reduce partition count later. We started with 12 partitions per topic based on our expected throughput — plan ahead.

4. Schema evolution with Avro or JSON Schema

As your event schema evolves, old consumers must still work with new messages. Use schema registry and design schemas for backward compatibility from day one.


Conclusion

Migrating from synchronous REST calls to an event-driven Kafka architecture was one of the highest-impact technical decisions on our platform. It gave us throughput improvements, fault tolerance, and the ability to scale individual services independently.

If you're building payment systems or any high-throughput distributed system and still relying entirely on synchronous REST communication — Kafka is worth seriously evaluating.

Happy to answer any questions in the comments. Connect with me on LinkedIn if you want to discuss distributed systems, Java, or FinTech architecture.


Disha Sune — Java Backend Engineer | Spring Boot | Apache Kafka | AWS | Microservices
linkedin.com/in/disha-sune-168b661b8
github.com/dishasune-git


Tags: #Java #Kafka #SpringBoot #Microservices #FinTech #BackendEngineering #DistributedSystems #PaymentSystems #AWS #SoftwareEngineering

Top comments (0)