A few weeks ago I published an article about an event-driven order pipeline I built in .NET with Kafka and Azure Service Bus. Someone left a comment that stopped me in my tracks.
Andrew Tan wrote:
"One thing I'd watch: you now have two sources of truth in flight, PostgreSQL and Kafka. If the API crashes after writing to Postgres but before publishing, you've got an order that never gets processed. Have you considered using an outbox pattern or transactional writes to close that gap?"
He was right. I had not thought about it properly.
The gap he spotted
Here is what the original code did when a new order came in:
- Save the order to PostgreSQL
- Publish an event to Kafka
Two separate operations. No transaction between them. If the API crashed, ran out of memory, got killed by Kubernetes, or just had a bad moment between step 1 and step 2, the order would exist in the database with a Pending status and never move forward.
Nobody would know. No error. No alert. The order would just sit there.
At low volume this probably never causes a visible problem. At scale, or in production with real money on the line, it is a serious reliability issue.
The outbox pattern
The fix is called the outbox pattern. The idea is simple.
Instead of writing to the database and then publishing to Kafka as two separate operations, you write the order and an outbox record in the same database transaction. The outbox record is just a row in a table that says "this event needs to be published."
A separate background service then reads unprocessed outbox records, publishes them to Kafka, and marks them as processed. If publishing fails, the record stays unprocessed and gets retried. If the background service crashes mid-publish, it picks up the same record on restart.
The database transaction is the source of truth. Either both the order and the outbox record are committed together, or neither is. There is no window where one exists without the other.
What I built
First I added an OutboxMessage model:
public class OutboxMessage
{
public Guid Id { get; set; } = Guid.NewGuid();
public Guid OrderId { get; set; }
public string EventType { get; set; } = string.Empty;
public string Payload { get; set; } = string.Empty;
public DateTime CreatedAt { get; set; } = DateTime.UtcNow;
public DateTime? ProcessedAt { get; set; }
public bool Processed { get; set; } = false;
public int RetryCount { get; set; } = 0;
public string? Error { get; set; }
}
Then I updated the controller to write both in the same transaction:
public async Task<IActionResult> CreateOrder([FromBody] CreateOrderRequest request)
{
var order = new Order { ... };
var outboxMessage = new OutboxMessage
{
OrderId = order.Id,
EventType = nameof(OrderEventType.OrderCreated),
Payload = JsonSerializer.Serialize(order)
};
db.Orders.Add(order);
db.OutboxMessages.Add(outboxMessage);
await db.SaveChangesAsync(); // one transaction, both or neither
return CreatedAtAction(nameof(GetOrder), new { id = order.Id }, order);
}
No Kafka publish in the controller anymore. The controller just writes to the database and returns.
Then I built the OutboxProcessorService as a BackgroundService that polls every 5 seconds:
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
await ProcessOutboxMessagesAsync();
await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);
}
}
private async Task ProcessOutboxMessagesAsync()
{
var messages = await db.OutboxMessages
.Where(m => !m.Processed && m.RetryCount < 3)
.OrderBy(m => m.CreatedAt)
.ToListAsync();
foreach (var message in messages)
{
try
{
await PublishToKafkaAsync(message);
message.Processed = true;
message.ProcessedAt = DateTime.UtcNow;
}
catch (Exception ex)
{
message.RetryCount++;
message.Error = ex.Message;
}
await db.SaveChangesAsync();
}
}
What the logs look like now
When an order comes in:
INSERT INTO "Orders" ...
INSERT INTO "OutboxMessages" ...
Order f21613da created and outbox message queued
Five seconds later:
Processing 1 unprocessed outbox messages
Order f21613da published to Kafka topic orders at offset 5
Outbox message published successfully
UPDATE "OutboxMessages" SET "Processed" = true ...
The gap is closed. The order and the outbox record live or die together in the same transaction. Kafka gets the event eventually, guaranteed.
What Andrew also flagged
After I posted the fix, Andrew came back with more good points. He mentioned that with a 5-second polling interval, you are trading latency for database load. Fine at low volume. At scale you want FOR UPDATE SKIP LOCKED so multiple poller instances do not step on each other.
He also asked what happens if Kafka is down. Currently unprocessed records pile up and the poller keeps retrying every 5 seconds with no backoff. That is worth fixing. A dead letter path and an alert on outbox message age would make this production-ready.
Both are on the backlog. The current implementation is correct for a single poller. Horizontal scaling and dead letters are the next iteration.
The bigger point
I almost shipped this without the outbox pattern. The original code worked perfectly in testing. Kafka and PostgreSQL both got their data. No errors. No warnings.
The failure mode only shows up when something crashes between two operations that look like one. That is exactly the kind of bug that stays invisible until it costs someone something real.
Public code review from people who know what they are looking at is genuinely valuable. Andrew's comment was worth more than any linter or test suite would have caught here.
Source code: github.com/aftabkh4n/order-pipeline
If you are building event-driven systems and not using the outbox pattern, it is worth understanding. The implementation is not complicated. The reliability guarantee it gives you is significant.
Top comments (0)