_How to guarantee no transaction event is ever silently lost, even when Kafka goes down
_
Every payment system eventually faces the same problem. A transaction completes. Downstream systems need to know. An SMS must be sent. An audit log must be written. A fraud analysis must run. A compliance record must be created.
The obvious solution is to publish an event to Kafka after the payment is processed and let downstream consumers handle each task asynchronously. It works perfectly in development. In production, Kafka goes down during peak processing, and you discover your architecture has a gap that no amount of retry logic can fix.
This article explains that gap precisely, why it exists, and how the outbox pattern closes it permanently. It covers the database schema, the Spring Boot implementation, the failure scenarios the pattern handles, and the exactly-once delivery nuance that most engineers miss the first time they implement this.
The problem: direct Kafka publishing
The naive implementation of async event publishing looks like this:
@Service
public class PaymentService {
@Autowired
private KafkaTemplate<String, PaymentEvent> kafkaTemplate;
@Transactional
public void processPayment(Payment payment) {
// Step 1: Save payment to database
paymentRepository.save(payment);
// Step 2: Publish event to Kafka
kafkaTemplate.send("payment.completed", new PaymentEvent(payment));
// Step 3: Return success to user
}
}
This works until Kafka is unavailable. When that happens, you have two bad options:
Fail the entire payment because Kafka is unreachable. The user gets an error for a problem that has nothing to do with their payment.
Complete the payment but swallow the Kafka exception. The payment is recorded but the event is silently lost. No SMS sent, no audit log written, no compliance record created.
Neither is acceptable. The first creates a false dependency between payment processing and Kafka availability. The second creates silent data loss in a regulated financial system.
There is also a third failure mode that is easy to miss. If your application crashes between step 1 and step 2, the payment is recorded in the database but the Kafka publish never happened. The transaction committed but the event was never produced. You have no way to know this happened, and no automatic recovery mechanism.
The solution: the outbox pattern
The outbox pattern eliminates these failure modes by removing Kafka from the critical path of payment processing entirely. The payment service never publishes to Kafka directly. Instead it writes to a dedicated outbox table in PostgreSQL, and a separate background processor handles the Kafka publishing independently.
The core insight
Writing to the outbox table and writing the payment record happen in the same database transaction. Both succeed together or both fail together. PostgreSQL's ACID guarantees make this atomic. There is no window where the payment is recorded but the event is not, because they are committed as a single operation.
Kafka publishing moves outside the transaction entirely. It becomes a best-effort background operation that can fail, retry, and eventually succeed without ever affecting the user-facing payment flow.
The database schema
First, create the outbox table:
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id VARCHAR(100) NOT NULL, -- e.g. transaction ID
aggregate_type VARCHAR(100) NOT NULL, -- e.g. 'PAYMENT'
event_type VARCHAR(100) NOT NULL, -- e.g. 'PAYMENT_COMPLETED'
topic VARCHAR(100) NOT NULL, -- Kafka topic to publish to
payload JSONB NOT NULL, -- message content
sent BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
sent_at TIMESTAMP
);
-- Index for the processor to efficiently find unsent events
CREATE INDEX idx_outbox_unsent ON outbox_events (sent, created_at)
WHERE sent = FALSE;
The index on sent and created_at is important. At scale this table will have millions of rows. Without the index, the processor scans the entire table on every run. With the partial index filtering on sent = FALSE, the database only scans the small subset of rows that actually need processing.
The payment service implementation
The payment service writes both the transaction record and the outbox event in a single atomic transaction:
@Service
public class PaymentService {
@Autowired private PaymentRepository paymentRepository;
@Autowired private OutboxRepository outboxRepository;
@Transactional
public PaymentResponse processPayment(PaymentRequest request) {
// Step 1: Validate and build payment
Payment payment = buildPayment(request);
// Step 2: Write payment record
paymentRepository.save(payment);
// Step 3: Write outbox event IN THE SAME TRANSACTION
OutboxEvent event = OutboxEvent.builder()
.aggregateId(payment.getId())
.aggregateType("PAYMENT")
.eventType("PAYMENT_COMPLETED")
.topic("payment.completed")
.payload(buildPayload(payment))
.sent(false)
.build();
outboxRepository.save(event);
// Step 4: Commit transaction
// Both payment and outbox event committed atomically
// Step 5: Return success
// Kafka not mentioned anywhere in this method
return PaymentResponse.success(payment.getId());
}
}
Notice that Kafka does not appear anywhere in this class. The payment service has no Kafka dependency. It only knows about PostgreSQL. This is the decoupling that makes the pattern resilient.
The outbox processor
A separate Spring component runs on a schedule and handles all Kafka publishing:
@Component
public class OutboxProcessor {
@Autowired private OutboxRepository outboxRepository;
@Autowired private KafkaTemplate<String, String> kafkaTemplate;
@Scheduled(fixedDelay = 5000) // runs every 5 seconds
public void processOutbox() {
// Read unsent events, ordered by creation time
List<OutboxEvent> events = outboxRepository
.findBySentFalseOrderByCreatedAtAsc();
for (OutboxEvent event : events) {
try {
// Publish to Kafka
kafkaTemplate.send(
event.getTopic(),
event.getAggregateId(),
event.getPayload()
).get(); // wait for broker acknowledgement
// Mark as sent only after Kafka confirms receipt
event.setSent(true);
event.setSentAt(LocalDateTime.now());
outboxRepository.save(event);
} catch (Exception e) {
// Kafka unavailable or publish failed
// Log and continue to next event
// This event will be retried on the next run
log.error("Failed to publish event {}: {}",
event.getId(), e.getMessage());
}
}
}
}
The processor runs every 5 seconds. In normal operation events are published within seconds of being written. When Kafka is unavailable, events accumulate in the outbox table with sent=false and are published in order once Kafka recovers. Nothing is lost.
Failure scenarios and how the pattern handles each
Kafka is down when payment is processed
Payment service:
Write payment record to transactions table [success]
Write event to outbox table, sent=false [success]
Return success to user [success]
Outbox processor runs:
Read event from outbox table [success]
Attempt to publish to Kafka [fails - Kafka down]
Log error, event stays sent=false [retries next run]
Kafka recovers:
Outbox processor reads same event, sent=false
Publishes to Kafka [success]
Marks sent=true
Downstream consumers receive event
Payment service crashes before returning success
Payment service:
Write payment record [success]
Write outbox event, sent=false [success]
CRASH before returning response to user
User retries payment:
Idempotency key check: has this been processed?
Yes - same payment record exists in database
Return success, do not create duplicate payment
Outbox processor:
Finds original event, sent=false
Publishes to Kafka
Marks sent=true
Single event published, single outcome
Processor crashes after publishing but before marking sent
Outbox processor:
Read event, sent=false
Publish to Kafka [success]
CRASH before marking sent=true
Processor restarts:
Reads same event, still sent=false
Publishes to Kafka again - DUPLICATE
Marks sent=true
Kafka now has this event twice.
This is why consumers must be idempotent.
This is the most important failure scenario to understand. The outbox pattern guarantees at-least-once delivery, not exactly-once delivery. Duplicate messages are possible in crash scenarios. The solution is idempotent consumers, not attempting to prevent duplicates at the producer level.
Idempotent consumers: the final safety net
Every consumer that receives events from Kafka must handle duplicates gracefully. The pattern is consistent across all consumers: check whether this event ID has already been processed before doing any work.
@Service
public class SmsNotificationConsumer {
@Autowired private SmsService smsService;
@Autowired private ProcessedEventRepository processedEvents;
@KafkaListener(topics = "payment.completed")
@Transactional
public void handlePaymentCompleted(PaymentEvent event) {
// Check if already processed
if (processedEvents.existsByEventId(event.getId())) {
log.info("Duplicate event ignored: {}", event.getId());
return;
}
// Process the event
smsService.send(
event.getUserPhone(),
"Your payment of " + event.getAmount() + " was successful"
);
// Record that we have processed this event
// This and the SMS send are in one transaction
processedEvents.save(new ProcessedEvent(event.getId()));
}
}
Each consumer maintains its own processed events table. The check and the processing happen in one database transaction, making the consumer idempotent even in the face of duplicates.
Exactly-once delivery: the honest picture
The outbox pattern is frequently described as providing exactly-once delivery. This is imprecise and worth correcting.
What the outbox pattern actually guarantees
Every event will be published to Kafka at least once. No events are silently lost.
Duplicate events are possible in crash scenarios, specifically when the processor crashes after publishing but before marking sent=true.
The frequency of duplicates is very low in practice but cannot be eliminated entirely without coordination overhead that makes the system impractical.
What Kafka's enable.idempotence covers
Enabling idempotence on the Kafka producer eliminates duplicates caused by network retries within a single producer session. Kafka assigns each producer a unique ID and tracks sequence numbers. If the same message arrives twice from the same producer session, Kafka deduplicates it automatically.
props.put("enable.idempotence", "true");
props.put("acks", "all");
props.put("retries", Integer.MAX_VALUE);
However, enable.idempotence does not cover the crash scenario described above. When the processor restarts, it receives a new producer ID. Kafka treats it as a completely different producer and has no way to know that the message it is receiving is a duplicate of one sent in the previous session.
The practical solution
The industry standard is to combine at-least-once delivery from the outbox pattern with idempotent consumers. This gives you the correct outcome in every scenario:
Outbox pattern: no events lost regardless of Kafka availability
enable.idempotence: no duplicates from network retries within a session
Idempotent consumers: no duplicate outcomes even when duplicates arrive
The goal is not exactly-once publishing. The goal is exactly-once outcomes. These are different things, and the combination above achieves the second without the impractical overhead of attempting the first.
Operational considerations
Outbox table growth
The outbox table grows continuously. Sent events should be cleaned up periodically to prevent it from becoming a performance problem:
-- Run on a schedule, e.g. daily
DELETE FROM outbox_events
WHERE sent = TRUE
AND sent_at < NOW() - INTERVAL '7 days';
Keep sent events for a few days before deleting them. This gives you a window to investigate any consumer issues and replay events if needed.
Multiple processor instances
If you run multiple instances of your application, multiple outbox processors will run simultaneously. Without coordination, they will all try to publish the same events at the same time.
The solution is to use SELECT FOR UPDATE SKIP LOCKED when reading from the outbox table. This is a PostgreSQL feature that allows concurrent processors to claim events exclusively without blocking each other:
@Query(value = "SELECT * FROM outbox_events " +
"WHERE sent = FALSE " +
"ORDER BY created_at ASC " +
"LIMIT 100 " +
"FOR UPDATE SKIP LOCKED",
nativeQuery = true)
List findUnsentEventsWithLock();
Each processor instance claims a batch of events exclusively. Other instances skip locked rows and claim their own batches. Multiple instances process events in parallel without conflicts.
Monitoring
Monitor these two metrics to catch outbox problems early:
Outbox table depth: the number of rows where sent=false. Under normal conditions this should be near zero. A growing backlog indicates either Kafka is unavailable or the processor has stopped running.
Outbox event age: the oldest created_at timestamp among unsent events. Events more than a few minutes old indicate a problem that needs investigation.
Summary
Direct Kafka publishing from the payment service creates a hidden dependency on Kafka availability and a data loss window on application crashes.
The outbox pattern removes Kafka from the critical payment path by writing events to a PostgreSQL table atomically with the payment record.
A separate background processor reads the outbox table and publishes to Kafka independently of payment processing.
Kafka downtime delays downstream processing but never prevents payment confirmation or causes event loss.
The outbox pattern guarantees at-least-once delivery, not exactly-once. Duplicate messages are possible.
Idempotent consumers handle duplicates gracefully, making the end-to-end outcome correct even when duplicates arrive.
Use SELECT FOR UPDATE SKIP LOCKED when running multiple application instances to prevent concurrent processors from duplicating work.
Monitor outbox table depth and event age to catch problems before they affect users.
Top comments (0)