Onkar

Posted on May 1

Why I Stopped Using SQS and Built a Kafka System From Scratch

#kafka #python #systemdesign #distributedsystems

Why I Stopped Using SQS and Built a Kafka System From Scratch

TL;DR: I'd been using SQS and Sidekiq at work for a year. They worked fine — until I needed five services to independently react to the same event without knowing each other existed. This is the story of what broke, what I built, and what Kafka taught me that queues fundamentally cannot.

The Setup: What I Already Knew

I'm a backend engineer at a PropTech startup. Over the past year I've shipped a lot of infrastructure — event-driven pipelines using AWS SQS, background job queues with Sidekiq and Resque, dead-letter queues, circuit breakers, the works.

I thought I understood event-driven architecture.

I didn't. I understood queues. Those are different things.

The difference didn't click until I tried to build a system where one payment event needed to trigger five completely independent downstream effects — and I realised that with SQS, the architecture I had in my head was simply not possible.

That realisation sent me down a two-week rabbit hole building TxFlow — a payment event orchestrator built entirely from scratch with Kafka (specifically Redpanda), FastAPI, PostgreSQL, Redis, and a Next.js observability dashboard.

This post is what I learned.

The Problem: One Event, Five Reactions

Here's the scenario that broke my mental model.

A payment comes in. You need five things to happen:

Fraud scoring — check if the amount is suspicious
Wallet update — debit the sender's balance
Notification — email the user
Audit log — write an immutable record for compliance
Analytics — increment payment counters

With SQS, my instinct was: fire four separate messages, one per job type. But that immediately creates problems:

What if the wallet update succeeds but the audit log message never gets enqueued?
What if you add a sixth service next month — you have to go back and modify the producer code
If the analytics service is down, does it block everything? Or does that message just disappear?
You want to replay three days of events through a new fraud model — how do you do that if the messages were deleted after consumption?

The SQS mental model is a pipe. One message goes in, one consumer picks it up, it's gone.

Kafka's mental model is a log. The message is written and stays. Any number of independent consumers can read it at their own offset, at their own pace, completely unaware of each other.

SQS:
Producer → [queue] → Consumer A
                   (message deleted)

Kafka:
Producer → [topic: payments.initiated]
                ↓              ↓              ↓              ↓              ↓
          Consumer A      Consumer B      Consumer C      Consumer D      Consumer E
          (fraud)         (wallet)        (notify)        (audit)         (analytics)
          offset: 42      offset: 38      offset: 42      offset: 41      offset: 40

Each consumer has its own offset — its own "bookmark" in the log. They move independently. Consumer B being slow doesn't affect Consumer A. Consumer D being down for an hour doesn't lose any messages — it just falls behind, and catches up when it restarts.

This is the thing that took me a week to fully internalise.

Building TxFlow: The Architecture

The system is simple by design. It runs entirely with docker compose up — no cloud, no real money, no external services. The point is to understand the concepts, not manage infrastructure.

curl → FastAPI (POST /payment)
         │
         ▼
    PostgreSQL        ← outbox_events table (more on this shortly)
         │
         ▼
    Redpanda (Kafka-compatible broker)
    Topic: payments.initiated (3 partitions, 7-day retention)
         │
    ─────┼──────────────────────────────────────────────────
         │         │           │            │           │
    fraud       wallet      notifier      audit     analytics
    consumer    consumer    consumer    consumer    consumer
    group:fraud group:wallet group:notify group:audit group:analytics
         │         │
         ▼         ▼
    PostgreSQL  PostgreSQL
    fraud_      wallets
    assessments table
         │
    failures → payments.dlq topic → dlq_handler consumer

Stack:

Redpanda (Kafka-compatible, runs as a single Docker container — no Zookeeper)
FastAPI + Python (producer + DLQ handler API)
PostgreSQL (state: wallets, fraud assessments, audit log, outbox)
Redis (deduplication keys + analytics counters)
Next.js + TypeScript (observability dashboard)

Let me walk through the three decisions that taught me the most.

Lesson 1: The Dual-Write Problem (and why the Outbox pattern exists)

When I first wrote the payment producer, I did what felt natural:

# The naive approach — DO NOT DO THIS
@app.post("/payment")
def create_payment(req: PaymentRequest):
    payment = save_to_database(req)   # Step 1: write to DB
    publish_to_kafka(payment)          # Step 2: publish to Kafka
    return {"status": "accepted"}

This looks fine. It has a critical flaw.

These are two separate systems. They are not in the same transaction. If the app crashes between step 1 and step 2 — power cut, OOM kill, deployment — the DB has the payment record, but Kafka never received the event. Five consumers are waiting. None of them will ever process this payment. Silently. No error. The data just never flows.

This is called the dual-write problem. You cannot atomically write to two systems that don't share a transaction boundary.

The solution is the Outbox pattern:

@app.post("/payment")
def create_payment(req: PaymentRequest):
    with db.transaction():
        save_to_database(req)

        # Write to outbox IN THE SAME TRANSACTION
        insert_outbox_event(event_id=..., payload=req, published=False)

    # Transaction committed — the outbox record is the source of truth
    # NOW publish to Kafka (outside the transaction)
    try:
        publish_to_kafka(event)
        mark_outbox_published(event_id)
    except KafkaException:
        pass  # Background poller will retry this

    return {"status": "accepted"}

And a background poller runs every 30 seconds:

async def poll_outbox():
    while True:
        unpublished = get_unpublished_events()  # WHERE published = FALSE
        for event in unpublished:
            publish_to_kafka(event)
            mark_outbox_published(event.event_id)
        await asyncio.sleep(30)

Now if the app crashes between the Kafka publish and the mark_outbox_published call, the poller picks it up on the next cycle. The event might be published twice — but that's intentional. "At-least-once delivery" is the guarantee. The consumers handle idempotency (more on that next).

The insight: the DB is your single source of truth. Kafka is your delivery mechanism. Never treat them as equals — make one subordinate to the other.

Lesson 2: Manual Offset Commits (the thing that actually guarantees delivery)

This was the hardest concept to get right, and it's the one most tutorials gloss over.

Kafka tracks where each consumer group is in the log via an offset — a simple incrementing integer. Consumer group "wallet" is at offset 42 means it has processed messages 0 through 41, and is waiting for message 42.

By default, Kafka auto-commits offsets on a timer (every 5 seconds). This creates a dangerous window:

T=0s  — message 42 polled from Kafka
T=1s  — processing begins
T=3s  — AUTO-COMMIT fires — offset 42 committed as "done"
T=4s  — processing throws an exception
T=4s  — consumer restarts
T=4s  — consumer reads from offset 43
          message 42 is GONE. Permanently skipped.

A payment was silently dropped because the offset was committed before the side effect completed.

With manual offset commit:

T=0s  — message 42 polled
T=1s  — processing begins
T=4s  — processing throws an exception
T=4s  — consumer restarts (offset NOT committed — still at 41)
T=4s  — consumer reads message 42 AGAIN and retries it

In code, the base consumer pattern looks like this:

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue

    event = json.loads(msg.value())

    # Check Redis dedup first
    dedup_key = f"wallet:processed:{event['event_id']}"
    if redis.exists(dedup_key):
        consumer.commit(msg)   # already processed — commit and skip
        continue

    # Try to process
    success = False
    for attempt in range(MAX_RETRIES):
        try:
            process_payment(event)   # DB write happens here
            success = True
            break
        except Exception as e:
            wait = BASE_DELAY_MS * (2 ** attempt) / 1000
            time.sleep(wait)

    if success:
        redis.setex(dedup_key, DEDUP_TTL, "1")
    else:
        publish_to_dlq(event, consumer_group="wallet", error=str(e))

    # ALWAYS commit offset — even on failure (DLQ handles the event now)
    consumer.commit(msg)

Two things to notice:

1. Commit happens last, always. Even on failure — once we've published to DLQ, we don't want to keep reprocessing a poison pill event that will never succeed. Commit it, move on.

2. Redis deduplication is required. Because we have at-least-once delivery (producer can publish the same event twice via the outbox poller), and because a consumer might process an event before crashing before committing the offset, the same event can arrive at a consumer multiple times. Without Redis dedup, a payment could debit a wallet twice.

The contract is: Kafka guarantees at-least-once. Your consumer guarantees exactly-once side effects via idempotency.

Lesson 3: The Fan-Out Test (the moment Kafka clicked)

This is the test I ran that made everything concrete.

I fired a single payment event:

curl -X POST http://localhost:8000/payment \
  -H "Content-Type: application/json" \
  -d '{"user_id":"user_001","amount":500,"currency":"USD","idempotency_key":"test-fanout-001"}'

Then I checked every downstream system:

-- Wallet debited
SELECT balance FROM wallets WHERE user_id = 'user_001';
-- 9500.00 ✓

-- Fraud assessed
SELECT risk_level FROM fraud_assessments WHERE event_id = '...';
-- CLEAR ✓

-- Audit record written (immutable, append-only)
SELECT * FROM audit_log WHERE event_id = '...';
-- row present ✓

# Notification logged
docker compose logs consumer-notifier | grep test-fanout-001
# {"event":"notification_sent","user_id":"user_001","amount":500.0,...} ✓

# Analytics incremented
docker exec txflow-redis redis-cli GET analytics:total_payments
# 1 ✓

One API call. Five side effects. Five completely independent services. Zero coupling — none of the consumers know the others exist. The fraud scorer doesn't call the wallet updater. The audit logger doesn't call the notifier. They all just read from the same Kafka topic, each at their own pace.

Then I ran the replayability test.

# Stop the audit consumer
docker compose stop consumer-audit

# Fire 10 more events
./scripts/fire_bulk.sh

# Redpanda Console shows: audit consumer group lag = 10
# Every other consumer: lag = 0 (they processed normally)

# Restart audit consumer
docker compose start consumer-audit

# Watch it replay all 10 missed events from its last committed offset
docker compose logs -f consumer-audit

Every single missed event was processed. The audit log caught up completely.

This is impossible with SQS. When a message is consumed from SQS, it's deleted. There's no log to replay. There's no offset to reset. If the audit service was down when the messages were consumed by the other workers, those events are gone from the queue perspective. You'd have to implement your own replay mechanism from scratch.

With Kafka, replay is not a feature you implement. It's the default behaviour. The log is the database.

Lesson 4: Partitions Are the Scaling Unit

The payments.initiated topic has 3 partitions. Every payment event is published with user_id as the partition key.

This means:

All events for user_001 always land on partition 0 (hash of "user_001" % 3 = 0)
All events for user_002 always land on partition 2
Within a partition, events are strictly ordered

Why does this matter? Because the wallet consumer needs to process user_001's events in order. If two payments arrive simultaneously and get processed in wrong order, the balance calculations could be wrong. Partitioning by user_id gives you a per-user ordering guarantee for free.

The other thing partitions determine: how many parallel consumers you can have in a group.

# Scale wallet consumer to 4 instances
docker compose up --scale consumer-wallet=4

What happens: 3 instances each own 1 partition. The 4th sits idle. You can never have more active consumers in a group than you have partitions. This is Kafka's fundamental scaling model — you scale by adding partitions, not just by adding consumers.

I ran this test with 4 instances, watched the Redpanda Console, and saw exactly one instance sitting there doing nothing. 10 minutes of reading about this is worth less than watching it happen once.

The Observability Dashboard

One thing that separates "I ran Kafka" from "I understand Kafka operations" is knowing what to watch.

The most important metric in Kafka is consumer lag — the difference between the latest message in a topic and the consumer group's current position. Lag = 0 means the consumer is keeping up. Lag = 500 means the consumer is 500 messages behind the producer.

Consumer lag is Kafka's equivalent of Sidekiq queue depth. If it's rising, something is wrong — either your consumer is too slow, there's an exception in the processing logic, or the consumer is down entirely.

The dashboard polls the Redpanda Admin API every 5 seconds and colour-codes lag per consumer group:

Green — lag = 0, all caught up
Amber — lag 1–20, slight backlog
Red — lag > 20, needs attention

I also wired in the DLQ event table and analytics counters. Running fire_bulk.sh (20 events in rapid succession) and watching the lag spike and drain in real time made the whole system feel alive in a way that just reading logs never does.

What I'd Do Differently

Use a schema registry from the start. I serialised events as plain JSON. That's fine for a POC, but in production it's a trap — change the event schema and you silently break consumers. Redpanda includes a schema registry compatible with Avro. I plan to add this as a stretch goal, but I wish I'd built it in from the start.

Add consumer lag alerting earlier. Rising lag is the first signal of a problem. I'd add a simple threshold alert (lag > 50 for > 60 seconds = log a structured alert) as part of the base consumer, not as an afterthought.

Test the poison pill scenario deliberately. A poison pill is a message that will never succeed — malformed data, a downstream service that's permanently broken for that record type. Without handling it explicitly, a poison pill will cause your consumer to retry forever, never advancing its offset, completely blocking that partition. I added DLQ handling but I should've stress-tested it earlier.

SQS vs Kafka — When to Use Which

I want to be clear: this isn't "Kafka is better than SQS." They solve different problems.

Use SQS when:

You have one consumer per job type
Messages can be deleted after processing
You don't need replayability
You're already on AWS and want managed infrastructure
The scale is modest and operational simplicity matters

Use Kafka when:

Multiple independent consumers need to react to the same event
You need replayability (new service, bug fix, backfill)
Event ordering within a key matters
You want to decouple producers from consumers at an architectural level
You're building something that will grow to high throughput

The payment system I described — where fraud scoring, wallet updates, notifications, audit logging, and analytics all react to the same event independently — is a textbook Kafka use case. SQS would work, but you'd be fighting the tool.

What Building This Taught Me About Backend Engineering

The deeper lesson isn't about Kafka specifically. It's about what "event-driven architecture" actually means.

Before this project, I used the phrase "event-driven" to mean "I use a message queue." That's not wrong, but it's incomplete. True event-driven architecture means the event is the primary citizen. Services don't call each other — they react to facts that have already been recorded. The payment happened. The event is a statement of that fact. Any service that cares about it can react. Any service that doesn't, ignores it. Adding a new service next month requires touching zero existing code.

The log is the system. Everything else is a view.

That's the mental model shift that Kafka forces, and it's worth building a project from scratch just to internalise it.

The Code

Everything is on GitHub: github.com/Deonkar/txFlow_build_tasks

It runs with a single docker compose up. Fire a payment with ./scripts/fire_event.sh. Watch five things happen simultaneously.

If you're coming from SQS or RabbitMQ or Sidekiq, the thing to run first is the replayability test — stop one consumer, fire 10 events, restart it, watch it catch up. That 30-second test will reframe how you think about message processing.

If you found this useful or have questions about any of the implementation decisions, drop a comment. I'm particularly interested in hearing from people who've run Kafka at production scale — there's definitely more to learn about partition rebalancing, exactly-once semantics, and schema evolution.

Tags: kafka backend python systemdesign distributedsystems

DEV Community

Why I Stopped Using SQS and Built a Kafka System From Scratch

Why I Stopped Using SQS and Built a Kafka System From Scratch

The Setup: What I Already Knew

The Problem: One Event, Five Reactions

Building TxFlow: The Architecture

Lesson 1: The Dual-Write Problem (and why the Outbox pattern exists)

Lesson 2: Manual Offset Commits (the thing that actually guarantees delivery)

Lesson 3: The Fan-Out Test (the moment Kafka clicked)

Lesson 4: Partitions Are the Scaling Unit

The Observability Dashboard

What I'd Do Differently

SQS vs Kafka — When to Use Which

What Building This Taught Me About Backend Engineering

The Code

Top comments (0)