DEV Community

refaat Al Ktifan
refaat Al Ktifan

Posted on • Originally published at oronts.com

Event-Driven Architecture in Practice: What Actually Goes Wrong

Oronts Event-Driven Architecture

The Event Storm Nobody Expects

Event-driven architecture looks clean on a whiteboard. Service A emits an event. Service B consumes it. Loose coupling. Independent scaling. Beautiful diagrams.

Then you deploy to production. One product save triggers 8 event subscribers. Each subscriber dispatches 3 async messages. Each message handler loads the product, modifies a field, and saves. Each save triggers 8 subscribers again. Within seconds, your message queue has thousands of messages for a single product update, your workers are saturated, and your database is under lock contention.

We've operated four different event-driven systems in production. Each used a different message broker. Each had different failure patterns. This article covers what actually goes wrong and how to fix it. For broader context on how we approach system architecture, that guide covers our methodology.

The Four Failure Patterns

Every event-driven system eventually encounters these:

Pattern What Happens Severity
Event storm One action cascades into thousands of messages Critical (can saturate infrastructure)
Bidirectional sync loop System A syncs to B, B syncs back to A, infinite loop Critical (infinite message growth)
Duplicate processing Same message processed twice, creates duplicate data High (data integrity)
Dead letter accumulation Failed messages pile up with no resolution strategy Medium (operational debt)

1. Event Storms

The most common and most dangerous pattern. It happens when event handlers trigger new events, and those events trigger more handlers.

Product save
  -> EventSubscriber A dispatches MessageA
  -> EventSubscriber B dispatches MessageB
  -> EventSubscriber C dispatches MessageC
  -> ...8 subscribers total

WorkerA processes MessageA
  -> Modifies product, calls save()
  -> Product save triggers 8 subscribers again
  -> Each dispatches more messages

WorkerB processes MessageB (same pattern)
  -> 8 more messages

Result: 1 save -> 8 messages -> 8 saves -> 64 messages -> ...
Enter fullscreen mode Exit fullscreen mode

The solution: EventSubscriberSupervisor. A process-scoped control that disables event subscribers during worker saves.

// Worker handler: disable subscribers before saving
async function handleGenerateThumbnail(message: ThumbnailMessage) {
    const product = await productService.findById(message.productId);

    subscriberSupervisor.disableAll();
    try {
        product.thumbnail = await generateThumbnail(product);
        await product.save(); // No subscribers fire, no new messages dispatched
    } finally {
        subscriberSupervisor.enableAll();
    }
}
Enter fullscreen mode Exit fullscreen mode

The order matters. The save must happen inside the disabled scope. If you re-enable subscribers before saving, the save triggers the cascade.

For a deeper look at how we apply this pattern specifically in Pimcore, see our Pimcore workflow design guide.

2. Bidirectional Sync Loops

When two systems sync data bidirectionally, each update from A triggers a sync to B, which triggers a sync back to A. Without loop prevention, this runs forever.

System A (Commerce)         System B (Operations)
     │                           │
     │  Customer updated         │
     │ ────────────────────────▶ │
     │                           │  Sync received, save customer
     │                           │  Customer updated event fires
     │  Sync received            │
     │ ◀──────────────────────── │
     │  Save customer            │
     │  Customer updated event   │
     │ ────────────────────────▶ │
     │           ...forever...   │
Enter fullscreen mode Exit fullscreen mode

The solution: source tracking. Every message carries an x-source header that identifies the originating system. When a system receives a message from itself (via the other system), it ignores it.

// When publishing a sync message
async function publishCustomerSync(customer: Customer, source: string) {
    await messageQueue.publish('customer.updated', {
        customerId: customer.id,
        data: customer,
        source: source,  // "commerce" or "operations"
    });
}

// When consuming a sync message
async function handleCustomerSync(message: CustomerSyncMessage) {
    // Ignore messages that originated from this system
    if (message.source === THIS_SYSTEM_ID) {
        logger.debug('Ignoring self-originated sync', { customerId: message.customerId });
        return; // ACK the message, don't process
    }

    const customer = await customerService.update(message.customerId, message.data);

    // When saving, mark the source so downstream events carry it
    await publishCustomerSync(customer, THIS_SYSTEM_ID);
}
Enter fullscreen mode Exit fullscreen mode

An alternative approach: message deduplication by content hash. Hash the message payload and check if the same hash was processed recently. If the data hasn't changed, the sync is a no-op.

3. Duplicate Processing

Network failures, worker restarts, and at-least-once delivery guarantees mean the same message can be delivered more than once. Without idempotency, you get duplicate orders, duplicate emails, duplicate database records.

The solution: idempotency stores with business-key deduplication.

// Idempotency store with business key
class IdempotencyStore {
    async checkAndAcquire(key: string, scope: string): Promise<boolean> {
        try {
            await this.db.insert('idempotency_keys', {
                key,
                scope,
                status: 'PROCESSING',
                created_at: new Date(),
                expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24h TTL
            });
            return true; // Acquired, proceed with processing
        } catch (error) {
            if (isDuplicateKeyError(error)) {
                return false; // Already processed, skip
            }
            throw error;
        }
    }

    async complete(key: string, scope: string): Promise<void> {
        await this.db.update('idempotency_keys',
            { status: 'COMPLETED', completed_at: new Date() },
            { key, scope }
        );
    }
}

// Usage in a worker
async function handleNotification(message: NotificationMessage) {
    const dedupeKey = `${message.recipientEmail}:${message.category}:${message.entityRef}:${todayBucket()}`;

    const acquired = await idempotencyStore.checkAndAcquire(dedupeKey, 'notification');
    if (!acquired) {
        logger.info('Duplicate notification skipped', { dedupeKey });
        return; // ACK the message
    }

    try {
        await emailService.send(message);
        await idempotencyStore.complete(dedupeKey, 'notification');
    } catch (error) {
        await idempotencyStore.fail(dedupeKey, 'notification');
        throw error; // NACK, let the queue retry
    }
}
Enter fullscreen mode Exit fullscreen mode

The dedupe key must be business-meaningful. For notifications: recipient + category + entity + day bucket (prevents sending the same notification twice in one day but allows it the next day). For orders: tenant + product + date. For imports: source record ID + import batch ID.

4. Dead Letter Handling

Messages that fail repeatedly end up in a dead letter queue. Most teams set up the DLQ and then ignore it. Dead letters accumulate. Months later, someone discovers thousands of unprocessed messages.

// Dead letter handler with classification
async function processDeadLetter(message: DeadLetterMessage) {
    const failureType = classifyFailure(message.error);

    switch (failureType) {
        case 'TRANSIENT':
            // Network timeout, service temporarily down
            // Requeue with exponential backoff
            await requeue(message, { delay: calculateBackoff(message.retryCount) });
            break;

        case 'PERMANENT':
            // Invalid data, schema mismatch, business rule violation
            // Log, alert, archive. Do not retry.
            await archiveDeadLetter(message);
            await alertOps('Permanent failure in dead letter', message);
            break;

        case 'POISON':
            // Message itself causes crashes (malformed payload)
            // Archive immediately, never retry
            await archiveDeadLetter(message);
            await alertOps('Poison message detected', message);
            break;
    }
}

function classifyFailure(error: string): 'TRANSIENT' | 'PERMANENT' | 'POISON' {
    if (error.includes('ECONNREFUSED') || error.includes('timeout')) return 'TRANSIENT';
    if (error.includes('SyntaxError') || error.includes('Cannot parse')) return 'POISON';
    return 'PERMANENT';
}
Enter fullscreen mode Exit fullscreen mode

The key insight: not all dead letters are the same. Transient failures should be retried. Permanent failures should be archived and investigated. Poison messages should be quarantined immediately.

For how we handle failure patterns in AI systems specifically, see our AI failure modes guide.

Choosing a Message Broker

We've used four different brokers in production. Each has a distinct sweet spot.

Broker Best For Not Great For Deployment Complexity
Kafka High-throughput event streaming, event sourcing, replay Simple job queues, low-volume systems High (ZooKeeper/KRaft, partitions, consumer groups)
RabbitMQ Routing, dead letters, priority queues, complex topologies Very high throughput (millions/sec), event replay Medium (clustering, HA policies)
BullMQ (Redis) Job queues, delayed jobs, rate limiting, simple setups Complex routing, message replay, multi-consumer patterns Low (just Redis)
Symfony Messenger PHP/Symfony apps, Pimcore, simple async dispatch Non-PHP ecosystems, complex broker features Low (uses Doctrine, AMQP, or Redis as transport)

When to Use Kafka

  • You need event replay (consumers can re-read from any offset)
  • You have multiple consumer groups processing the same events differently
  • Throughput exceeds 100K messages per second
  • You're doing event sourcing or building a change data capture pipeline
  • You need guaranteed ordering within a partition

When to Use RabbitMQ

  • You need complex routing (topic exchanges, headers-based routing)
  • Dead letter queues with automatic retry policies are important
  • Message priority matters (urgent messages processed first)
  • You need request-reply patterns (RPC over messages)
  • Your system has multiple services in different languages

When to Use BullMQ

  • You're in a Node.js/TypeScript ecosystem (Vendure, NestJS)
  • Job scheduling with delays and cron patterns
  • Rate limiting per queue or per job type
  • You already have Redis for caching and sessions
  • Your system is small to medium scale (under 10K messages/sec)

When to Use Symfony Messenger

  • You're in a PHP/Symfony/Pimcore ecosystem
  • You need simple async dispatch for background tasks
  • Worker infrastructure follows Symfony conventions (supervisord)
  • Transport flexibility (switch between Doctrine, AMQP, Redis without code changes)

For our Pimcore projects, we use Symfony Messenger with RabbitMQ transport. For Vendure projects, we use BullMQ. For high-throughput data ingestion platforms, we use Kafka. The broker choice follows the ecosystem, not the other way around.

Message Ordering

Most teams assume message ordering matters. Usually it doesn't.

Scenario Ordering Needed? Why
Send notification email No Emails are inherently unordered
Generate thumbnail No Latest version wins
Update search index No Latest state wins (eventual consistency)
Process payment steps Yes Charge must happen before capture
Order state transitions Yes Can't ship before payment confirmed
Event sourcing replay Yes Events must replay in causal order

When ordering matters, use FIFO queues (SQS FIFO, Kafka partitions keyed by entity ID, RabbitMQ with single consumer). When it doesn't, parallel processing is faster and simpler.

// Kafka: ordering within a partition (keyed by product ID)
await producer.send({
    topic: 'product-updates',
    messages: [{
        key: productId,  // All updates for same product go to same partition
        value: JSON.stringify(event),
    }],
});

// BullMQ: no ordering needed for thumbnails
await thumbnailQueue.add('generate', { productId }, {
    removeOnComplete: true,
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
});
Enter fullscreen mode Exit fullscreen mode

Retry Strategies

Not all retries are equal. The strategy depends on the failure type.

Strategy When to Use Example
Immediate retry Transient glitch (race condition) Optimistic lock failure
Exponential backoff Service temporarily down External API timeout
Fixed delay Rate limiting API returns 429
No retry Permanent failure Invalid payload, business rule violation
// BullMQ retry configuration
const queue = new Queue('notifications', {
    defaultJobOptions: {
        attempts: 5,
        backoff: {
            type: 'exponential',
            delay: 60000, // 1min, 2min, 4min, 8min, 16min
        },
        removeOnComplete: { count: 1000 },
        removeOnFail: false, // Keep for dead letter analysis
    },
});
Enter fullscreen mode Exit fullscreen mode

After all retries are exhausted, the message moves to a dead letter queue. Never retry infinitely. Set a maximum attempt count and handle the failure explicitly.

Monitoring Event-Driven Systems

The hardest part of event-driven systems is debugging. When something fails, the error might be 5 services and 12 messages away from the original cause.

Correlation IDs

Every message carries a correlation ID that links it to the original action:

// Original API request generates correlation ID
const correlationId = generateUUID();

// Every message in the chain carries it
await queue.add('process-order', {
    orderId: order.id,
    correlationId,
});

// Workers pass it to downstream messages
async function handleProcessOrder(job) {
    const { orderId, correlationId } = job.data;

    // ... process order ...

    // Downstream messages carry the same correlation ID
    await notificationQueue.add('send-confirmation', {
        orderId,
        correlationId, // Same ID, traceable back to the original request
    });
}
Enter fullscreen mode Exit fullscreen mode

When debugging, query all logs and events by correlation ID to see the full chain. For more on distributed tracing patterns, see our AI observability guide.

Queue Health Metrics

Metric Healthy Warning Critical
Queue depth < 100 100-1000 > 1000
Processing time (p95) < 5s 5-30s > 30s
Failure rate < 1% 1-5% > 5%
Dead letter count 0 1-10 > 10
Consumer lag (Kafka) < 1000 1K-10K > 10K

Alert on queue depth trends, not absolute values. A queue depth of 500 that's been stable for hours is fine. A queue depth of 50 that was 0 an hour ago and is growing is a problem.

Common Pitfalls

  1. No subscriber control during worker saves. The single biggest source of event storms. Worker saves must not trigger the same event subscribers that dispatch work to workers.

  2. No source tracking on bidirectional sync. Without x-source headers, bidirectional sync between two systems loops forever.

  3. Using message ID for idempotency. Message IDs change on redelivery in some brokers. Use business keys (entity ID + action + time bucket) instead.

  4. Infinite retries. Set a maximum retry count. After exhaustion, move to dead letter queue. Never retry forever.

  5. Ignoring dead letter queues. Dead letters are operational debt. Classify them (transient, permanent, poison) and handle each type differently.

  6. Assuming message ordering. Most workloads don't need ordering. Parallel processing is faster. Only use FIFO when state transitions or causal ordering require it.

  7. Same retry strategy for all failures. A timeout needs exponential backoff. An invalid payload needs zero retries. A rate limit needs fixed delay.

  8. No correlation IDs. Without them, debugging a failure chain across 5 services is impossible.

Key Takeaways

  • Event storms are the most dangerous failure mode. One uncontrolled save cascade can saturate your entire infrastructure. The EventSubscriberSupervisor is not optional.

  • Bidirectional sync needs source tracking. Every message carries the originating system ID. When you receive a message from yourself (via the other system), ignore it.

  • Idempotency uses business keys, not message IDs. Recipient + category + entity + day bucket for notifications. Tenant + product + date for orders. The key must be business-meaningful.

  • Dead letters need classification. Transient failures get retried. Permanent failures get archived. Poison messages get quarantined. Never ignore the dead letter queue.

  • The broker follows the ecosystem. Kafka for high-throughput streaming, RabbitMQ for complex routing, BullMQ for Node.js job queues, Symfony Messenger for PHP. Don't pick a broker and then build around it.

  • Most workloads don't need ordering. Parallel processing is simpler and faster. Use FIFO only when state transitions demand it.

Top comments (0)