DEV Community

kirandeepjassal-crypto
kirandeepjassal-crypto

Posted on • Originally published at prepstack.co.in

We Replaced REST with Kafka and Cut API Failures 90% — .NET 9 Event-Driven Architecture

TL;DR — We swapped Mattrx's inter-service REST calls for Kafka topics (multi-tenant marketing analytics SaaS, .NET 9 / ASP.NET Core, Azure SQL, ~3,200 req/sec peak). Over 8 weeks: end-to-end ingestion failures 1.9% → 0.18% (−90%), ingestion p95 180 ms → 8 ms (−96%), events lost on downstream outage tens of thousands → 0, cascading-failure incidents ~3/month → 0, time to add a new consumer ~1 sprint → ~1 day.

🎥 3-minute video walkthrough: https://youtu.be/O21rbuQdM1Y
👉 Full deep-dive (architecture, code, pre-adoption checklist, when NOT to reach for Kafka): https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent

The one mental shift

The reflex in a .NET shop is: service A needs something to happen in service B, so A calls B's REST endpoint and waits. That's correct for a query ("give me this tenant's plan") — A genuinely needs the answer. It's wrong for an event ("this campaign event occurred") — A doesn't need anything back, yet it's now blocked on B, C, and D being healthy.

Distinguish commands/queries (need an answer → REST) from events (fire-and-forget → log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time — the producer succeeds the instant the event is durably written, consumers process whenever they can.

Once you stop treating "something happened" as a function call and start treating it as a fact appended to a log, cascading failures, retry storms, and burst overload mostly disappear — because nothing downstream is on the request's critical path anymore.

Before — synchronous REST chain (the chain that breaks)

customer site ──► Collector ──► Enrichment ──► Analytics ──► Persister
                  (waits)       (waits)       (waits)      (waits)
Enter fullscreen mode Exit fullscreen mode
// BEFORE — the collector waits on the whole chain. Any failure loses the event.
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var enriched = await _enrichClient.EnrichAsync(ev, ct);  // can fail
    await _analyticsClient.RollupAsync(enriched, ct);        // can fail
    await _persistClient.SaveAsync(enriched, ct);            // can fail
    return Results.Ok();                                      // only if ALL three succeed
}
Enter fullscreen mode Exit fullscreen mode

Availability multiplies. Four 99.9%s in series = ~99.6% uptime. Analytics slow at month-end → Collector times out → customer's event is gone → customer retries → MORE load on the struggling service → retry storm.

After — Kafka log decouples the pipeline

customer site ──► Collector ──► [Kafka: events.raw]
                  returns 202        │
                  in ~8ms            ├──► [analytics group]   (independent)
                                     ├──► [persister group]   (independent)
                                     └──► [enrichment group]  (independent)
Enter fullscreen mode Exit fullscreen mode
// AFTER — produce ONE event to Kafka and return. No downstream on the critical path.
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var msg = new Message<string, byte[]> {
        Key   = ev.TenantId.ToString(),                  // per-tenant ordering
        Value = JsonSerializer.SerializeToUtf8Bytes(ev),
    };
    await producer.ProduceAsync("events.raw", msg, ct);  // durable, ~ms
    return Results.Accepted();                            // 202 — collector p95: 8 ms
}
Enter fullscreen mode Exit fullscreen mode

The collector succeeds the instant the event is in the log. The customer NEVER waits on a consumer. Analytics down? Its consumer group lags and catches up later. Zero loss, zero customer impact.

The consumer — groups, manual commit, retry, DLQ

Three things matter: a consumer group (so partitions parallelize), manual offset commit (commit only after successful processing — at-least-once delivery), and a retry + dead-letter path (so a poison message can't block the partition forever).

protected override async Task ExecuteAsync(CancellationToken ct)
{
    consumer.Subscribe("events.raw");
    while (!ct.IsCancellationRequested)
    {
        var result = consumer.Consume(ct);
        var ev = JsonSerializer.Deserialize<CampaignEvent>(result.Message.Value)!;
        try {
            await rollup.ApplyAsync(ev, ct);                              // do the work
            consumer.Commit(result);                                       // ✅ commit AFTER success
        }
        catch (TransientException) when (Attempts(result) < 5) {
            await producer.ProduceAsync("events.retry", result.Message, ct);
            consumer.Commit(result);
        }
        catch (Exception ex) {
            log.LogError(ex, "Poison → DLQ at offset {Offset}", result.Offset);
            await producer.ProduceAsync("events.dlq", result.Message, ct);
            consumer.Commit(result);                                       // don't block the partition
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Consumer lag is the one health metric. If lag climbs, the consumer can't keep up. Scale it. The producer is untouched.

The outbox — atomic produce with a DB write

There's one trap. If the same handler writes to Azure SQL and produces to Kafka, those are two systems. A crash between them loses or duplicates the event. For anything tied to a committed DB change, use the outbox pattern: write the event to an outbox table in the same SQL transaction as the business data; a background relay reads unsent rows and publishes them.

// Atomic: the campaign row AND the event commit in ONE SQL transaction.
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Campaigns.Add(campaign);
db.Outbox.Add(new OutboxMessage(
    Topic: "events.raw",
    Payload: JsonSerializer.Serialize(ev)));
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// Background relay reads unsent outbox rows, produces to Kafka, marks them sent.
Enter fullscreen mode Exit fullscreen mode

Dual-write inconsistencies (the "DB says published, no event fired" bug class) → 0.

Idempotent consumers — because delivery is at-least-once

Kafka gives you at-least-once delivery. A consumer can crash after processing but before committing and reprocess on restart. So consumers must be idempotent. The simplest tool is a dedup key with a unique constraint.

public async Task ApplyAsync(CampaignEvent ev, CancellationToken ct)
{
    var firstTime = await _db.ProcessedEvents
        .ExecuteInsertIfAbsentAsync(ev.Id, ct);                 // unique index on EventId
    if (!firstTime) return;                                     // duplicate → no-op, safe

    await _db.Aggregates.IncrementAsync(ev.CampaignId, ev.Type, ct);
}
Enter fullscreen mode Exit fullscreen mode

At-least-once delivery + idempotent processing = effectively-once, and it's far simpler than chasing true exactly-once.

The two superpowers REST never gave us

Replay. A bug corrupted yesterday's analytics rollup? Reset that one consumer group's offset and reprocess. The other groups are untouched. The raw events are still in the log — the log is the source of truth.

kafka-consumer-groups --bootstrap-server $BROKERS --group analytics \
  --reset-offsets --to-datetime 2026-06-10T00:00:00.000 --topic events.raw --execute
Enter fullscreen mode Exit fullscreen mode

Adding a consumer for free. When we shipped fraud scoring, we created a new consumer group on the same events.raw topic. The producer didn't change. ~1 day instead of ~1 sprint.

Aggregate metrics

Metric Before (REST) After (Kafka) Delta
Ingestion failures (incident windows) 1.9% 0.18% −90%
Events lost on downstream outage tens of thousands 0 eliminated
Ingestion p95 180 ms 8 ms −96%
Cascading-failure incidents / mo ~3 0 eliminated
Retry-storm load during incidents severe none (202 + walk away) eliminated
Time to add a new consumer ~1 sprint ~1 day −80%+
Replay a bad day of processing impossible one command new
Dual-write inconsistencies (outbox) recurring 0 eliminated

Most of the win isn't throughput — it's failure isolation. The producer succeeds against a log that's almost always up, and every downstream problem became "a consumer is lagging" instead of "the pipeline is down and we're losing data."

When NOT to reach for Kafka (honest section)

  1. Don't replace request/response with Kafka. If the caller needs the answer, that's a query — REST/gRPC is correct.
  2. Kafka is operational weight. Brokers, partitions, schema registry, lag monitoring. For modest volume, Azure Service Bus or a DB-backed queue gives most of the decoupling with far less to operate. We used managed Kafka (Confluent Cloud) precisely because a 5-person team doesn't run brokers.
  3. "Exactly-once" is a trap. Plan for at-least-once + idempotent consumers.
  4. Ordering is per-partition, not global. Design your key around the ordering you actually need.
  5. You traded synchronous errors for asynchronous lag. Watch lag, or you've hidden the problem.
  6. On Azure, weigh Service Bus / Event Hubs first. Kafka was a deliberate choice (replay + ecosystem), not a default.

The mental model in one line

A synchronous call couples you to the callee being up; an event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time — and most of resilience, burst absorption, and extensibility falls out of that one change.


3-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y

Full deep-dive with architecture diagrams, the complete outbox teardown, idempotency patterns, replay commands, and the honest list of when NOT to reach for Kafka:

👉 https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent


If this saved you from a "the pipeline went down at 3 AM" page, a ❤️ or 🦄 helps it reach more backend engineers.

What's the worst cascading-failure incident you've inherited because someone Kafka-ed a query — or REST-ed an event?

Top comments (0)