I kept improving my .NET order pipeline after a CTO left feedback. Here is where it ended up.

#dotnet #kafka #microservices #architecture

A few weeks ago I published an article about an event-driven order pipeline I built in .NET. A CTO named Andrew Tan left a comment pointing out that my outbox pattern had a gap - the polling interval was trading latency for database load, and I had no protection against multiple poller instances stepping on each other.

I fixed the outbox gap in a follow-up post. But Andrew also flagged two more things worth addressing. This is where the pipeline stands now after working through all of them.

Where we started

The original pipeline had a working outbox pattern. Orders and outbox records written in the same PostgreSQL transaction. A background service polling every 5 seconds and publishing to Kafka. Messages marked as processed after a successful publish.

It worked. But it had three gaps Andrew spotted:

No protection for horizontal scaling - two poller instances would grab the same message
No backoff when Kafka was down - just constant retrying every 5 seconds
No dead letter path - messages that failed repeatedly just sat there forever

Fix 1 - FOR UPDATE SKIP LOCKED

The original query just fetched unprocessed messages. If you ran two instances of the service, both would grab the same messages and try to publish them twice.

The fix is a raw SQL query with FOR UPDATE SKIP LOCKED:

SELECT * FROM "OutboxMessages"
WHERE "Processed" = false AND "RetryCount" < 3
ORDER BY "CreatedAt"
FOR UPDATE SKIP LOCKED

FOR UPDATE locks the rows for the duration of the transaction. SKIP LOCKED means any other poller instance skips rows that are already locked rather than waiting. Two instances running in parallel will never claim the same message. Scale horizontally as much as you want.

The transaction stays open across the entire fetch, process, and save cycle. Only when the message is marked as processed and the transaction commits do the locks release.

Fix 2 - Exponential backoff for Kafka failures

The original service retried every 5 seconds regardless of what was happening. If Kafka was down, it would hammer the broker with connection attempts at a fixed rate.

The updated service tracks whether any publish failed and adjusts the wait interval accordingly:

if (anyKafkaFailure)
{
    _logger.LogWarning("Kafka publish failed. Backing off for {Seconds}s", _currentBackoff);
    await Task.Delay(TimeSpan.FromSeconds(_currentBackoff), stoppingToken);
    _currentBackoff = Math.Min(_currentBackoff * 2, MaxBackoffSeconds);
}
else
{
    _currentBackoff = BaseBackoffSeconds;
    await Task.Delay(TimeSpan.FromSeconds(BaseBackoffSeconds), stoppingToken);
}

First failure waits 5 seconds, then 10, then 20, then caps at 60. Any successful publish resets to 5 seconds. The service backs off gracefully when Kafka is struggling instead of making things worse.

Bad payloads that fail to deserialize increment the retry count but do not trigger backoff - only Kafka connection failures do. That distinction matters because you do not want a single corrupt message to slow down processing of everything else.

Fix 3 - Dead letter path

The original implementation stopped retrying after 3 attempts and left the message sitting in the outbox with RetryCount = 3. It was effectively dead but invisible.

Now when a message hits the retry limit, it moves to a DeadLetterMessages table:

if (message.RetryCount >= MaxRetries)
{
    var deadLetter = new DeadLetterMessage
    {
        OrderId = message.OrderId,
        EventType = message.EventType,
        Payload = message.Payload,
        OriginalCreatedAt = message.CreatedAt,
        DeadLetteredAt = DateTime.UtcNow,
        FailureReason = message.Error ?? "Max retries exceeded",
        RetryCount = message.RetryCount
    };

    context.DeadLetterMessages.Add(deadLetter);
    context.OutboxMessages.Remove(message);

    _logger.LogWarning(
        "Message {MessageId} for order {OrderId} moved to dead letter after {RetryCount} retries",
        message.Id, message.OrderId, message.RetryCount);
}

The outbox stays clean. Dead messages go somewhere visible. There is also a GET /api/deadletters endpoint so operators can inspect what failed and why without touching the database directly.

What the full picture looks like now

The outbox processor now handles four scenarios cleanly:

Happy path - message fetched, published to Kafka, marked as processed. Next poll in 5 seconds.

Kafka is down - publish fails, retry count increments, backoff doubles. Service waits progressively longer and tries again when Kafka recovers.

Multiple instances - FOR UPDATE SKIP LOCKED ensures each message is claimed by exactly one instance. No duplicate publishes.

Persistent failure - after 3 retries, message moves to dead letters. Outbox stays clean. Operator can inspect and replay manually.

The honest reflection

None of these improvements would have happened without Andrew's comment. The original implementation worked in testing. All three gaps only show up under specific production conditions - horizontal scaling, broker failures, persistent bad messages.

This is why public code review matters. A fresh pair of eyes from someone who has hit these problems before is worth more than any amount of solo review.

Source code: github.com/aftabkh4n/order-pipeline

If you are building event-driven systems this is worth reading alongside the original article. The outbox pattern is the foundation. These three additions are what make it production-ready.