refaat Al Ktifan

Posted on Apr 25 • Originally published at oronts.com

Event-Driven Architecture in Practice: What Actually Goes Wrong

#architecture #backend #eventdriven #devops

The Event Storm Nobody Expects

Event-driven architecture looks clean on a whiteboard. Service A emits an event. Service B consumes it. Loose coupling. Independent scaling. Beautiful diagrams.

Then you deploy to production. One product save triggers 8 event subscribers. Each subscriber dispatches 3 async messages. Each message handler loads the product, modifies a field, and saves. Each save triggers 8 subscribers again. Within seconds, your message queue has thousands of messages for a single product update, your workers are saturated, and your database is under lock contention.

We've operated four different event-driven systems in production. Each used a different message broker. Each had different failure patterns. This article covers what actually goes wrong and how to fix it. For broader context on how we approach system architecture, that guide covers our methodology.

The Four Failure Patterns

Every event-driven system eventually encounters these:

Pattern	What Happens	Severity
Event storm	One action cascades into thousands of messages	Critical (can saturate infrastructure)
Bidirectional sync loop	System A syncs to B, B syncs back to A, infinite loop	Critical (infinite message growth)
Duplicate processing	Same message processed twice, creates duplicate data	High (data integrity)
Dead letter accumulation	Failed messages pile up with no resolution strategy	Medium (operational debt)

1. Event Storms

The most common and most dangerous pattern. It happens when event handlers trigger new events, and those events trigger more handlers.

Product save
  -> EventSubscriber A dispatches MessageA
  -> EventSubscriber B dispatches MessageB
  -> EventSubscriber C dispatches MessageC
  -> ...8 subscribers total

WorkerA processes MessageA
  -> Modifies product, calls save()
  -> Product save triggers 8 subscribers again
  -> Each dispatches more messages

WorkerB processes MessageB (same pattern)
  -> 8 more messages

Result: 1 save -> 8 messages -> 8 saves -> 64 messages -> ...

The solution: EventSubscriberSupervisor. A process-scoped control that disables event subscribers during worker saves.

// Worker handler: disable subscribers before saving
async function handleGenerateThumbnail(message: ThumbnailMessage) {
    const product = await productService.findById(message.productId);

    subscriberSupervisor.disableAll();
    try {
        product.thumbnail = await generateThumbnail(product);
        await product.save(); // No subscribers fire, no new messages dispatched
    } finally {
        subscriberSupervisor.enableAll();
    }
}

The order matters. The save must happen inside the disabled scope. If you re-enable subscribers before saving, the save triggers the cascade.

For a deeper look at how we apply this pattern specifically in Pimcore, see our Pimcore workflow design guide.

2. Bidirectional Sync Loops

When two systems sync data bidirectionally, each update from A triggers a sync to B, which triggers a sync back to A. Without loop prevention, this runs forever.

System A (Commerce)         System B (Operations)
     │                           │
     │  Customer updated         │
     │ ────────────────────────▶ │
     │                           │  Sync received, save customer
     │                           │  Customer updated event fires
     │  Sync received            │
     │ ◀──────────────────────── │
     │  Save customer            │
     │  Customer updated event   │
     │ ────────────────────────▶ │
     │           ...forever...   │

The solution: source tracking. Every message carries an x-source header that identifies the originating system. When a system receives a message from itself (via the other system), it ignores it.

// When publishing a sync message
async function publishCustomerSync(customer: Customer, source: string) {
    await messageQueue.publish('customer.updated', {
        customerId: customer.id,
        data: customer,
        source: source,  // "commerce" or "operations"
    });
}

// When consuming a sync message
async function handleCustomerSync(message: CustomerSyncMessage) {
    // Ignore messages that originated from this system
    if (message.source === THIS_SYSTEM_ID) {
        logger.debug('Ignoring self-originated sync', { customerId: message.customerId });
        return; // ACK the message, don't process
    }

    const customer = await customerService.update(message.customerId, message.data);

    // When saving, mark the source so downstream events carry it
    await publishCustomerSync(customer, THIS_SYSTEM_ID);
}

An alternative approach: message deduplication by content hash. Hash the message payload and check if the same hash was processed recently. If the data hasn't changed, the sync is a no-op.

3. Duplicate Processing

Network failures, worker restarts, and at-least-once delivery guarantees mean the same message can be delivered more than once. Without idempotency, you get duplicate orders, duplicate emails, duplicate database records.

The solution: idempotency stores with business-key deduplication.

// Idempotency store with business key
class IdempotencyStore {
    async checkAndAcquire(key: string, scope: string): Promise<boolean> {
        try {
            await this.db.insert('idempotency_keys', {
                key,
                scope,
                status: 'PROCESSING',
                created_at: new Date(),
                expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24h TTL
            });
            return true; // Acquired, proceed with processing
        } catch (error) {
            if (isDuplicateKeyError(error)) {
                return false; // Already processed, skip
            }
            throw error;
        }
    }

    async complete(key: string, scope: string): Promise<void> {
        await this.db.update('idempotency_keys',
            { status: 'COMPLETED', completed_at: new Date() },
            { key, scope }
        );
    }
}

// Usage in a worker
async function handleNotification(message: NotificationMessage) {
    const dedupeKey = `${message.recipientEmail}:${message.category}:${message.entityRef}:${todayBucket()}`;

    const acquired = await idempotencyStore.checkAndAcquire(dedupeKey, 'notification');
    if (!acquired) {
        logger.info('Duplicate notification skipped', { dedupeKey });
        return; // ACK the message
    }

    try {
        await emailService.send(message);
        await idempotencyStore.complete(dedupeKey, 'notification');
    } catch (error) {
        await idempotencyStore.fail(dedupeKey, 'notification');
        throw error; // NACK, let the queue retry
    }
}

The dedupe key must be business-meaningful. For notifications: recipient + category + entity + day bucket (prevents sending the same notification twice in one day but allows it the next day). For orders: tenant + product + date. For imports: source record ID + import batch ID.

4. Dead Letter Handling

Messages that fail repeatedly end up in a dead letter queue. Most teams set up the DLQ and then ignore it. Dead letters accumulate. Months later, someone discovers thousands of unprocessed messages.

// Dead letter handler with classification
async function processDeadLetter(message: DeadLetterMessage) {
    const failureType = classifyFailure(message.error);

    switch (failureType) {
        case 'TRANSIENT':
            // Network timeout, service temporarily down
            // Requeue with exponential backoff
            await requeue(message, { delay: calculateBackoff(message.retryCount) });
            break;

        case 'PERMANENT':
            // Invalid data, schema mismatch, business rule violation
            // Log, alert, archive. Do not retry.
            await archiveDeadLetter(message);
            await alertOps('Permanent failure in dead letter', message);
            break;

        case 'POISON':
            // Message itself causes crashes (malformed payload)
            // Archive immediately, never retry
            await archiveDeadLetter(message);
            await alertOps('Poison message detected', message);
            break;
    }
}

function classifyFailure(error: string): 'TRANSIENT' | 'PERMANENT' | 'POISON' {
    if (error.includes('ECONNREFUSED') || error.includes('timeout')) return 'TRANSIENT';
    if (error.includes('SyntaxError') || error.includes('Cannot parse')) return 'POISON';
    return 'PERMANENT';
}

The key insight: not all dead letters are the same. Transient failures should be retried. Permanent failures should be archived and investigated. Poison messages should be quarantined immediately.

For how we handle failure patterns in AI systems specifically, see our AI failure modes guide.

Choosing a Message Broker

We've used four different brokers in production. Each has a distinct sweet spot.

Broker	Best For	Not Great For	Deployment Complexity
Kafka	High-throughput event streaming, event sourcing, replay	Simple job queues, low-volume systems	High (ZooKeeper/KRaft, partitions, consumer groups)
RabbitMQ	Routing, dead letters, priority queues, complex topologies	Very high throughput (millions/sec), event replay	Medium (clustering, HA policies)
BullMQ (Redis)	Job queues, delayed jobs, rate limiting, simple setups	Complex routing, message replay, multi-consumer patterns	Low (just Redis)
Symfony Messenger	PHP/Symfony apps, Pimcore, simple async dispatch	Non-PHP ecosystems, complex broker features	Low (uses Doctrine, AMQP, or Redis as transport)

When to Use Kafka

You need event replay (consumers can re-read from any offset)
You have multiple consumer groups processing the same events differently
Throughput exceeds 100K messages per second
You're doing event sourcing or building a change data capture pipeline
You need guaranteed ordering within a partition

When to Use RabbitMQ

You need complex routing (topic exchanges, headers-based routing)
Dead letter queues with automatic retry policies are important
Message priority matters (urgent messages processed first)
You need request-reply patterns (RPC over messages)
Your system has multiple services in different languages

When to Use BullMQ

You're in a Node.js/TypeScript ecosystem (Vendure, NestJS)
Job scheduling with delays and cron patterns
Rate limiting per queue or per job type
You already have Redis for caching and sessions
Your system is small to medium scale (under 10K messages/sec)

When to Use Symfony Messenger

You're in a PHP/Symfony/Pimcore ecosystem
You need simple async dispatch for background tasks
Worker infrastructure follows Symfony conventions (supervisord)
Transport flexibility (switch between Doctrine, AMQP, Redis without code changes)

For our Pimcore projects, we use Symfony Messenger with RabbitMQ transport. For Vendure projects, we use BullMQ. For high-throughput data ingestion platforms, we use Kafka. The broker choice follows the ecosystem, not the other way around.

Message Ordering

Most teams assume message ordering matters. Usually it doesn't.

Scenario	Ordering Needed?	Why
Send notification email	No	Emails are inherently unordered
Generate thumbnail	No	Latest version wins
Update search index	No	Latest state wins (eventual consistency)
Process payment steps	Yes	Charge must happen before capture
Order state transitions	Yes	Can't ship before payment confirmed
Event sourcing replay	Yes	Events must replay in causal order

When ordering matters, use FIFO queues (SQS FIFO, Kafka partitions keyed by entity ID, RabbitMQ with single consumer). When it doesn't, parallel processing is faster and simpler.

// Kafka: ordering within a partition (keyed by product ID)
await producer.send({
    topic: 'product-updates',
    messages: [{
        key: productId,  // All updates for same product go to same partition
        value: JSON.stringify(event),
    }],
});

// BullMQ: no ordering needed for thumbnails
await thumbnailQueue.add('generate', { productId }, {
    removeOnComplete: true,
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
});

Retry Strategies

Not all retries are equal. The strategy depends on the failure type.

Strategy	When to Use	Example
Immediate retry	Transient glitch (race condition)	Optimistic lock failure
Exponential backoff	Service temporarily down	External API timeout
Fixed delay	Rate limiting	API returns 429
No retry	Permanent failure	Invalid payload, business rule violation

// BullMQ retry configuration
const queue = new Queue('notifications', {
    defaultJobOptions: {
        attempts: 5,
        backoff: {
            type: 'exponential',
            delay: 60000, // 1min, 2min, 4min, 8min, 16min
        },
        removeOnComplete: { count: 1000 },
        removeOnFail: false, // Keep for dead letter analysis
    },
});

After all retries are exhausted, the message moves to a dead letter queue. Never retry infinitely. Set a maximum attempt count and handle the failure explicitly.

Monitoring Event-Driven Systems

The hardest part of event-driven systems is debugging. When something fails, the error might be 5 services and 12 messages away from the original cause.

Correlation IDs

Every message carries a correlation ID that links it to the original action:

// Original API request generates correlation ID
const correlationId = generateUUID();

// Every message in the chain carries it
await queue.add('process-order', {
    orderId: order.id,
    correlationId,
});

// Workers pass it to downstream messages
async function handleProcessOrder(job) {
    const { orderId, correlationId } = job.data;

    // ... process order ...

    // Downstream messages carry the same correlation ID
    await notificationQueue.add('send-confirmation', {
        orderId,
        correlationId, // Same ID, traceable back to the original request
    });
}

When debugging, query all logs and events by correlation ID to see the full chain. For more on distributed tracing patterns, see our AI observability guide.

Queue Health Metrics

Metric	Healthy	Warning	Critical
Queue depth	< 100	100-1000	> 1000
Processing time (p95)	< 5s	5-30s	> 30s
Failure rate	< 1%	1-5%	> 5%
Dead letter count	0	1-10	> 10
Consumer lag (Kafka)	< 1000	1K-10K	> 10K

Alert on queue depth trends, not absolute values. A queue depth of 500 that's been stable for hours is fine. A queue depth of 50 that was 0 an hour ago and is growing is a problem.

Common Pitfalls

No subscriber control during worker saves. The single biggest source of event storms. Worker saves must not trigger the same event subscribers that dispatch work to workers.
No source tracking on bidirectional sync. Without x-source headers, bidirectional sync between two systems loops forever.
Using message ID for idempotency. Message IDs change on redelivery in some brokers. Use business keys (entity ID + action + time bucket) instead.
Infinite retries. Set a maximum retry count. After exhaustion, move to dead letter queue. Never retry forever.
Ignoring dead letter queues. Dead letters are operational debt. Classify them (transient, permanent, poison) and handle each type differently.
Assuming message ordering. Most workloads don't need ordering. Parallel processing is faster. Only use FIFO when state transitions or causal ordering require it.
Same retry strategy for all failures. A timeout needs exponential backoff. An invalid payload needs zero retries. A rate limit needs fixed delay.
No correlation IDs. Without them, debugging a failure chain across 5 services is impossible.

Key Takeaways

Event storms are the most dangerous failure mode. One uncontrolled save cascade can saturate your entire infrastructure. The EventSubscriberSupervisor is not optional.
Bidirectional sync needs source tracking. Every message carries the originating system ID. When you receive a message from yourself (via the other system), ignore it.
Idempotency uses business keys, not message IDs. Recipient + category + entity + day bucket for notifications. Tenant + product + date for orders. The key must be business-meaningful.
Dead letters need classification. Transient failures get retried. Permanent failures get archived. Poison messages get quarantined. Never ignore the dead letter queue.
The broker follows the ecosystem. Kafka for high-throughput streaming, RabbitMQ for complex routing, BullMQ for Node.js job queues, Symfony Messenger for PHP. Don't pick a broker and then build around it.
Most workloads don't need ordering. Parallel processing is simpler and faster. Use FIFO only when state transitions demand it.

DEV Community