Why We Switched from Direct API Calls to Kafka and What Broke Along the Way

#webdev #tutorial #programming #architecture

TL;DR: We migrated 10+ microservices from direct HTTP calls to Kafka event-driven communication. Reliability improved massively but the migration was harder than expected. Here are the real lessons including the mistakes.

Our system started as a monolith. Then we split it into microservices. The services talked to each other using direct HTTP calls. Service A would POST to Service B which would POST to Service C. It worked fine when we had 3 services.

Then we had 10.

The Day Everything Cascaded

One Tuesday morning our notification service crashed because of a memory leak. No big deal right? Restart it and move on.

But the order service was calling the notification service directly during checkout. When notification service was down the order endpoint started timing out. Users could not place orders. The billing service was also calling notification service to confirm payment receipts. Billing started failing too.

One crashed service took down three other services because they were all directly dependent on it.

That was the day we decided to move to event-driven architecture.

How We Set It Up

The concept is simple. Instead of Service A calling Service B directly Service A publishes an event to Kafka. Service B listens for that event and processes it on its own time.

// Before: Direct coupling
class OrderService {
    public function complete($order) {
        $order->markComplete();
        Http::post('billing-service/invoice', $order->toArray());
        Http::post('notification-service/email', $order->toArray());
        Http::post('analytics-service/track', $order->toArray());
    }
}

// After: Event-driven
class OrderService {
    public function complete($order) {
        $order->markComplete();
        KafkaProducer::publish('order.completed', [
            'order_id' => $order->id,
            'tenant_id' => $order->tenant_id,
            'total' => $order->total,
            'completed_at' => now()->toIso8601String(),
        ]);
    }
}

The order service does not know or care who listens to that event. Billing creates an invoice. Notifications send an email. Analytics tracks a metric. Each service subscribes to the event independently.

What Broke During Migration

I wish I could say the migration was smooth. It was not.

Problem 1: Event ordering. We assumed events would arrive in the order they were published. They mostly did. But when we had high throughput some consumers processed events out of order. An "order.updated" event arrived before "order.created" and the consumer crashed because the order did not exist yet.

The fix was adding an event version number and having consumers check if they had already processed a newer version before applying changes.

Problem 2: Duplicate events. Kafka guarantees at-least-once delivery. That means consumers can receive the same event twice. We had a bug where a payment was processed twice because the consumer was not idempotent.

The fix was adding a unique event ID and checking if we had already processed that ID before taking action.

class InvoiceConsumer {
    public function handle($event) {
        if (ProcessedEvent::where('event_id', $event['id'])->exists()) {
            return;
        }

        $invoice = Invoice::createFromOrder($event['order_id']);

        ProcessedEvent::create(['event_id' => $event['id']]);
    }
}

Problem 3: Debugging was harder. With direct API calls you could trace a request from start to finish in one log. With events the flow is split across multiple services and multiple time periods. Finding out why an invoice was not created required checking logs in three different services.

We solved this by adding a correlation ID to every event. When the order service publishes an event it includes a unique request ID. Every downstream consumer includes that same ID in their logs. Now you can search for one ID and see the entire flow across all services.

The Patterns That Saved Us

Dead letter queue. When a consumer fails to process an event after 3 retries it goes to a dead letter topic. We have a dashboard that shows failed events and lets us replay them after fixing the bug.

Schema registry. We define the structure of every event in a shared schema. If a producer tries to publish an event that does not match the schema it fails at publish time not at consume time. This prevented so many bugs.

Consumer lag monitoring. We track how far behind each consumer is. If the notification consumer falls 10,000 events behind we get an alert. This caught performance issues before users noticed them.

The Results

After 3 months on event-driven architecture:

Zero cascading failures. One service going down does not affect any other service.
We can deploy services independently without coordinating with other teams.
Adding a new consumer takes 30 minutes instead of modifying 5 different services.
Event replay lets us reprocess historical data when we add new features.

What I Would Do Differently

I would have implemented idempotency from day one not after the duplicate payment bug. Every consumer should be idempotent by default.

I would have invested in better tooling earlier. A good event viewer that shows the flow of events across services would have saved weeks of debugging time.

And I would not have migrated everything at once. We tried to move all 10 services in one sprint. It should have been gradual. Start with the least critical services and work toward the most critical.

Event-driven architecture is powerful but it adds complexity. If you have 3 services that rarely fail direct API calls are probably fine. If you have 10+ services and reliability matters events are worth the investment.

Have you migrated from direct API calls to event-driven architecture? What surprised you the most?