We Replaced REST with Kafka and Cut API Failures 90% — .NET 9 Event-Driven Architecture

#microservices #kafka #dotnet #architecture

TL;DR — We swapped Mattrx's inter-service REST calls for Kafka topics (multi-tenant marketing analytics SaaS, .NET 9 / ASP.NET Core, Azure SQL, ~3,200 req/sec peak). Over 8 weeks: end-to-end ingestion failures 1.9% → 0.18% (−90%), ingestion p95 180 ms → 8 ms (−96%), events lost on downstream outage tens of thousands → 0, cascading-failure incidents ~3/month → 0, time to add a new consumer ~1 sprint → ~1 day.

🎥 3-minute video walkthrough: https://youtu.be/O21rbuQdM1Y
👉 Full deep-dive (architecture, code, pre-adoption checklist, when NOT to reach for Kafka): https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent

The one mental shift

The reflex in a .NET shop is: service A needs something to happen in service B, so A calls B's REST endpoint and waits. That's correct for a query ("give me this tenant's plan") — A genuinely needs the answer. It's wrong for an event ("this campaign event occurred") — A doesn't need anything back, yet it's now blocked on B, C, and D being healthy.

Distinguish commands/queries (need an answer → REST) from events (fire-and-forget → log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time — the producer succeeds the instant the event is durably written, consumers process whenever they can.

Once you stop treating "something happened" as a function call and start treating it as a fact appended to a log, cascading failures, retry storms, and burst overload mostly disappear — because nothing downstream is on the request's critical path anymore.

Before — synchronous REST chain (the chain that breaks)

customer site ──► Collector ──► Enrichment ──► Analytics ──► Persister
                  (waits)       (waits)       (waits)      (waits)

// BEFORE — the collector waits on the whole chain. Any failure loses the event.
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var enriched = await _enrichClient.EnrichAsync(ev, ct);  // can fail
    await _analyticsClient.RollupAsync(enriched, ct);        // can fail
    await _persistClient.SaveAsync(enriched, ct);            // can fail
    return Results.Ok();                                      // only if ALL three succeed
}

Availability multiplies. Four 99.9%s in series = ~99.6% uptime. Analytics slow at month-end → Collector times out → customer's event is gone → customer retries → MORE load on the struggling service → retry storm.

After — Kafka log decouples the pipeline

customer site ──► Collector ──► [Kafka: events.raw]
                  returns 202        │
                  in ~8ms            ├──► [analytics group]   (independent)
                                     ├──► [persister group]   (independent)
                                     └──► [enrichment group]  (independent)

// AFTER — produce ONE event to Kafka and return. No downstream on the critical path.
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var msg = new Message<string, byte[]> {
        Key   = ev.TenantId.ToString(),                  // per-tenant ordering
        Value = JsonSerializer.SerializeToUtf8Bytes(ev),
    };
    await producer.ProduceAsync("events.raw", msg, ct);  // durable, ~ms
    return Results.Accepted();                            // 202 — collector p95: 8 ms
}

The collector succeeds the instant the event is in the log. The customer NEVER waits on a consumer. Analytics down? Its consumer group lags and catches up later. Zero loss, zero customer impact.

The consumer — groups, manual commit, retry, DLQ

Three things matter: a consumer group (so partitions parallelize), manual offset commit (commit only after successful processing — at-least-once delivery), and a retry + dead-letter path (so a poison message can't block the partition forever).

protected override async Task ExecuteAsync(CancellationToken ct)
{
    consumer.Subscribe("events.raw");
    while (!ct.IsCancellationRequested)
    {
        var result = consumer.Consume(ct);
        var ev = JsonSerializer.Deserialize<CampaignEvent>(result.Message.Value)!;
        try {
            await rollup.ApplyAsync(ev, ct);                              // do the work
            consumer.Commit(result);                                       // ✅ commit AFTER success
        }
        catch (TransientException) when (Attempts(result) < 5) {
            await producer.ProduceAsync("events.retry", result.Message, ct);
            consumer.Commit(result);
        }
        catch (Exception ex) {
            log.LogError(ex, "Poison → DLQ at offset {Offset}", result.Offset);
            await producer.ProduceAsync("events.dlq", result.Message, ct);
            consumer.Commit(result);                                       // don't block the partition
        }
    }
}

Consumer lag is the one health metric. If lag climbs, the consumer can't keep up. Scale it. The producer is untouched.

The outbox — atomic produce with a DB write

There's one trap. If the same handler writes to Azure SQL and produces to Kafka, those are two systems. A crash between them loses or duplicates the event. For anything tied to a committed DB change, use the outbox pattern: write the event to an outbox table in the same SQL transaction as the business data; a background relay reads unsent rows and publishes them.

// Atomic: the campaign row AND the event commit in ONE SQL transaction.
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Campaigns.Add(campaign);
db.Outbox.Add(new OutboxMessage(
    Topic: "events.raw",
    Payload: JsonSerializer.Serialize(ev)));
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// Background relay reads unsent outbox rows, produces to Kafka, marks them sent.

Dual-write inconsistencies (the "DB says published, no event fired" bug class) → 0.

Idempotent consumers — because delivery is at-least-once

Kafka gives you at-least-once delivery. A consumer can crash after processing but before committing and reprocess on restart. So consumers must be idempotent. The simplest tool is a dedup key with a unique constraint.

public async Task ApplyAsync(CampaignEvent ev, CancellationToken ct)
{
    var firstTime = await _db.ProcessedEvents
        .ExecuteInsertIfAbsentAsync(ev.Id, ct);                 // unique index on EventId
    if (!firstTime) return;                                     // duplicate → no-op, safe

    await _db.Aggregates.IncrementAsync(ev.CampaignId, ev.Type, ct);
}

At-least-once delivery + idempotent processing = effectively-once, and it's far simpler than chasing true exactly-once.

The two superpowers REST never gave us

Replay. A bug corrupted yesterday's analytics rollup? Reset that one consumer group's offset and reprocess. The other groups are untouched. The raw events are still in the log — the log is the source of truth.

kafka-consumer-groups --bootstrap-server $BROKERS --group analytics \
  --reset-offsets --to-datetime 2026-06-10T00:00:00.000 --topic events.raw --execute

Adding a consumer for free. When we shipped fraud scoring, we created a new consumer group on the same events.raw topic. The producer didn't change. ~1 day instead of ~1 sprint.

Aggregate metrics

Metric	Before (REST)	After (Kafka)	Delta
Ingestion failures (incident windows)	1.9%	0.18%	−90%
Events lost on downstream outage	tens of thousands	0	eliminated
Ingestion p95	180 ms	8 ms	−96%
Cascading-failure incidents / mo	~3	0	eliminated
Retry-storm load during incidents	severe	none (202 + walk away)	eliminated
Time to add a new consumer	~1 sprint	~1 day	−80%+
Replay a bad day of processing	impossible	one command	new
Dual-write inconsistencies (outbox)	recurring	0	eliminated

Most of the win isn't throughput — it's failure isolation. The producer succeeds against a log that's almost always up, and every downstream problem became "a consumer is lagging" instead of "the pipeline is down and we're losing data."

When NOT to reach for Kafka (honest section)

Don't replace request/response with Kafka. If the caller needs the answer, that's a query — REST/gRPC is correct.
Kafka is operational weight. Brokers, partitions, schema registry, lag monitoring. For modest volume, Azure Service Bus or a DB-backed queue gives most of the decoupling with far less to operate. We used managed Kafka (Confluent Cloud) precisely because a 5-person team doesn't run brokers.
"Exactly-once" is a trap. Plan for at-least-once + idempotent consumers.
Ordering is per-partition, not global. Design your key around the ordering you actually need.
You traded synchronous errors for asynchronous lag. Watch lag, or you've hidden the problem.
On Azure, weigh Service Bus / Event Hubs first. Kafka was a deliberate choice (replay + ecosystem), not a default.

The mental model in one line

A synchronous call couples you to the callee being up; an event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time — and most of resilience, burst absorption, and extensibility falls out of that one change.

3-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y

Full deep-dive with architecture diagrams, the complete outbox teardown, idempotency patterns, replay commands, and the honest list of when NOT to reach for Kafka:

👉 https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent

If this saved you from a "the pipeline went down at 3 AM" page, a ❤️ or 🦄 helps it reach more backend engineers.

What's the worst cascading-failure incident you've inherited because someone Kafka-ed a query — or REST-ed an event?