How We Improved Payment System Throughput by 25% Using Apache Kafka at a Fortune 500 FinTech
By Disha Sune — Java Backend Engineer | Spring Boot | Kafka | AWS | Fiserv
The Problem
At Fiserv, our payment processing platform handled millions of financial transactions daily for 600+ enterprise clients including McDonald's, Google, and Domino's.
As transaction volumes grew, our legacy synchronous REST API architecture started showing cracks:
- API response times were spiking during peak load periods
- Downstream services were tightly coupled — if one failed, everything failed
- No fault tolerance — a single service timeout caused transaction failures
- Scaling was painful — we had to scale everything together even when only one component was under load
We needed a fundamental architectural shift. The answer was Apache Kafka.
Why Kafka?
We evaluated several messaging solutions — RabbitMQ, ActiveMQ, AWS SQS — but chose Kafka for three reasons:
1. High throughput at scale
Kafka can handle millions of messages per second with minimal latency. For a payment system processing millions of daily transactions, this was non-negotiable.
2. Message durability and replay
Unlike traditional message queues, Kafka retains messages on disk. If a consumer fails, it can replay from the last committed offset — critical for financial data where you cannot lose a single transaction.
3. Decoupled architecture
Kafka naturally decouples producers from consumers. Our authorization service could publish events without knowing anything about the settlement or reporting services consuming them.
The Architecture We Built
Before Kafka, our architecture looked like this:
Client Request
↓
Authorization Service ──→ Settlement Service (synchronous REST)
──→ Reporting Service (synchronous REST)
──→ Chargeback Service (synchronous REST)
Every downstream call was synchronous. One slow service meant the entire request chain slowed down.
After Kafka, it looked like this:
Client Request
↓
Authorization Service ──→ Kafka Topic: payment-events
↓ ↓ ↓
Settlement Reporting Chargeback
Consumer Consumer Consumer
The authorization service publishes one event. Multiple consumers process it independently and asynchronously.
Implementation — Key Code Patterns
Producer Configuration (Spring Boot)
@Configuration
public class KafkaProducerConfig {
@Bean
public ProducerFactory<String, PaymentEvent> producerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);
// Critical for financial systems — ensure no message loss
config.put(ProducerConfig.ACKS_CONFIG, "all");
config.put(ProducerConfig.RETRIES_CONFIG, 3);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
return new DefaultKafkaProducerFactory<>(config);
}
@Bean
public KafkaTemplate<String, PaymentEvent> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
}
Key settings explained:
-
ACKS_CONFIG = "all"— producer waits for all replicas to acknowledge before considering a message sent. No message loss. -
ENABLE_IDEMPOTENCE_CONFIG = true— prevents duplicate messages even if the producer retries. -
RETRIES_CONFIG = 3— automatic retry on transient failures.
Publishing a Payment Event
@Service
public class PaymentEventPublisher {
private final KafkaTemplate<String, PaymentEvent> kafkaTemplate;
private static final String TOPIC = "payment-events";
public void publishPaymentAuthorized(PaymentEvent event) {
// Use transactionId as key — ensures ordering per transaction
CompletableFuture<SendResult<String, PaymentEvent>> future =
kafkaTemplate.send(TOPIC, event.getTransactionId(), event);
future.whenComplete((result, ex) -> {
if (ex != null) {
log.error("Failed to publish payment event: {}", event.getTransactionId(), ex);
// Handle failure — fallback or alert
} else {
log.info("Published payment event to partition {} offset {}",
result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
}
});
}
}
Why use transactionId as the message key?
Kafka guarantees ordering within a partition. By using transactionId as the key, all events for the same transaction always go to the same partition — ensuring they are processed in order.
Consumer Configuration
@Configuration
public class KafkaConsumerConfig {
@Bean
public ConsumerFactory<String, PaymentEvent> consumerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(ConsumerConfig.GROUP_ID_CONFIG, "settlement-service");
config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
// Manual offset commit — we control when a message is "done"
config.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
// Read from beginning if no committed offset exists
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
return new DefaultKafkaConsumerFactory<>(config);
}
}
Why disable auto commit?
Auto commit can mark a message as processed before your business logic finishes. In a payment system, this is dangerous — you could lose a transaction. Manual commit ensures you only mark a message as processed after it's fully handled.
Consuming Payment Events
@Service
public class SettlementConsumer {
@KafkaListener(
topics = "payment-events",
groupId = "settlement-service",
containerFactory = "kafkaListenerContainerFactory"
)
public void handlePaymentEvent(
PaymentEvent event,
Acknowledgment acknowledgment) {
try {
log.info("Processing settlement for transaction: {}",
event.getTransactionId());
// Process the settlement
settlementService.processSettlement(event);
// Only commit offset after successful processing
acknowledgment.acknowledge();
} catch (RecoverableException ex) {
// Retry logic — put back on retry topic
log.warn("Retryable error, will retry: {}", event.getTransactionId());
retryPublisher.publishToRetryTopic(event);
acknowledgment.acknowledge(); // Acknowledge so we don't reprocess
} catch (Exception ex) {
// Non-recoverable — send to dead letter topic for manual review
log.error("Failed to process settlement: {}", event.getTransactionId(), ex);
deadLetterPublisher.publishToDeadLetterTopic(event);
acknowledgment.acknowledge();
}
}
}
The Dead Letter Queue Pattern
One of the most important patterns we implemented was the Dead Letter Queue (DLQ).
In a financial system, you cannot just discard a failed message. Every failed transaction needs to be investigated and retried.
payment-events (main topic)
↓
[Consumer fails]
↓
payment-events-retry (retry topic — retried after 5 minutes)
↓
[Still failing after 3 retries]
↓
payment-events-dlq (dead letter queue — human review)
This pattern gave us:
- Zero message loss — every transaction is accounted for
- Automatic recovery — transient failures auto-retry
- Visibility — operations team monitors the DLQ for critical failures
Results We Achieved
After migrating to Kafka-based event-driven architecture:
| Metric | Before Kafka | After Kafka | Improvement |
|---|---|---|---|
| System throughput | Baseline | +25% | 25% improvement |
| Service coupling | Tight synchronous | Fully decoupled | Independent scaling |
| Failure impact | Cascading failures | Isolated per service | Fault tolerant |
| Peak load handling | Degraded performance | Consistent throughput | Stable under load |
| Transaction loss | Occasional on failures | Zero | 100% durability |
Key Lessons Learned
1. Always use idempotent consumers
Your consumer must be able to process the same message twice without side effects. Network issues can cause duplicate deliveries. We added a processed_transactions table checked before processing:
if (transactionRepository.isAlreadyProcessed(event.getTransactionId())) {
log.info("Duplicate message, skipping: {}", event.getTransactionId());
acknowledgment.acknowledge();
return;
}
2. Monitor consumer lag religiously
Consumer lag = how far behind your consumer is from the latest message. In a payment system, high lag means transactions are being delayed. We set CloudWatch alerts for lag > 1000 messages.
3. Partition count matters
More partitions = more parallelism = higher throughput. But you cannot reduce partition count later. We started with 12 partitions per topic based on our expected throughput — plan ahead.
4. Schema evolution with Avro or JSON Schema
As your event schema evolves, old consumers must still work with new messages. Use schema registry and design schemas for backward compatibility from day one.
Conclusion
Migrating from synchronous REST calls to an event-driven Kafka architecture was one of the highest-impact technical decisions on our platform. It gave us throughput improvements, fault tolerance, and the ability to scale individual services independently.
If you're building payment systems or any high-throughput distributed system and still relying entirely on synchronous REST communication — Kafka is worth seriously evaluating.
Happy to answer any questions in the comments. Connect with me on LinkedIn if you want to discuss distributed systems, Java, or FinTech architecture.
Disha Sune — Java Backend Engineer | Spring Boot | Apache Kafka | AWS | Microservices
linkedin.com/in/disha-sune-168b661b8
github.com/dishasune-git
Tags: #Java #Kafka #SpringBoot #Microservices #FinTech #BackendEngineering #DistributedSystems #PaymentSystems #AWS #SoftwareEngineering
Top comments (0)