Benard Otieno

Posted on May 24

Event-Driven Architecture: An Honest Assessment

#devops #tooling #architecture #eventdriven

Event-driven systems are elegant in talks and brutal in production. After building and operating them across multiple companies, here is what nobody tells you before you commit to the pattern.

Every few years the industry rediscovers event-driven architecture and
decides it is the answer. The talks are compelling. Services decoupled
from each other. No direct dependencies. Producers that emit events
and never think about who consumes them. Consumers that react to what
happened and never worry about who caused it. The system as a whole
becomes a collection of independent actors responding to a shared stream
of facts about the world.

In the talk, this is clean. In production, it is one of the most
operationally demanding patterns in software engineering, and the
gap between how it is pitched and what it costs to run it well is
wider than almost any other architectural pattern I can think of.

I have built event-driven systems that worked well and event-driven
systems that were disasters. The difference was not the technology
and it was not the team's capability. It was whether the team went
in with an accurate picture of what they were buying. Most teams
do not get that picture before they commit. This article is an
attempt to provide it.

What you actually get

Start with what is genuinely good, because there is genuine good here.

Decoupling that is real. When the order service publishes an
OrderPlaced event and knows nothing about who consumes it, and when
the inventory service consumes OrderPlaced and knows nothing about
who published it, you have achieved something meaningful. Either
service can be redeployed without the other. Either can evolve
its internal implementation without negotiating with the other.
A new service can start consuming OrderPlaced tomorrow without
touching the order service at all.

This decoupling is the thing that makes large organisations with
many teams possible. The team that owns the order service does not
need to be in a meeting with every team that cares about orders.
They publish the event. Every consumer team builds and maintains
their own reaction to it. The coordination that would have been
synchronous and blocking becomes asynchronous and independent.

Audit trails that emerge naturally. If your events are your
source of truth, you have a complete record of everything that
happened in your system in the order it happened. Not just the
current state, but the history of how you got there. This is
genuinely useful for debugging, for compliance, and for the class
of bugs that are almost impossible to diagnose without knowing
what sequence of events preceded them.

Load handling that is structural rather than bolted on. A
consumer that reads from a queue processes work at the rate it
can handle, regardless of the rate at which work arrives. The
queue absorbs the spike. The consumer processes the backlog when
capacity is available. This is structurally different from a
synchronous system where a traffic spike hits the service directly
and the service either handles it or falls over.

These are real benefits. They are worth having. They are also
not free.

The cost nobody quotes you upfront

Eventual consistency is not a configuration option, it is a
commitment.

When you move from synchronous calls to events, you give up the
ability to know, at any given moment, that all parts of your system
agree on the current state. The order was placed. The OrderPlaced
event was published. The inventory service will consume it and
reserve the stock. When? Soon. How soon? Depends on the consumer's
lag. What if the user queries their order status right now, before
the inventory service has processed the event? The order exists
in the order service's view. The inventory has not yet been
reserved. The system is in an intermediate state that is internally
consistent but not yet globally consistent.

For many use cases this is acceptable. For some it is not. The
teams that adopt event-driven architecture without thinking carefully
about which of their use cases fall into which category discover
the hard way that "eventually" can mean milliseconds, seconds,
or minutes depending on what else is happening, and that users
do not have the same patience for eventual consistency that
architects do.

# The problem that bites teams who haven't thought this through:

# User places an order. Order service publishes event.
async def place_order(user_id: str, items: list) -> Order:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))
    return order

# User immediately queries order status.
# Order service returns order with status "pending".
# Inventory service has not yet processed the event.
# Frontend shows "pending" with a spinner.
# User refreshes. Still pending. Refreshes again.
# Inventory event processes. Status updates to "confirmed".
# User has refreshed four times and is on the phone with support.

# The system was correct the entire time.
# The user experience was broken the entire time.
# These are not contradictory.

# What teams often do to address this:
# Read-your-own-writes consistency for the immediate response,
# combined with a clear UI state that communicates processing is happening.

async def place_order_with_ux_in_mind(user_id: str, items: list) -> dict:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))

    return {
        "order_id": order.id,
        "status": "processing",
        "message": "Your order is being confirmed. This usually takes a few seconds.",
        "poll_url": f"/orders/{order.id}/status",
        # Give the client a signal about what to do next,
        # rather than leaving them in ambiguous pending state
    }

Debugging across services is a different discipline entirely.

In a synchronous system, a bug has a call stack. You look at the
stack trace and you see exactly what called what in what order.
The sequence is right there.

In an event-driven system, the equivalent of a call stack is a
trace across multiple services, potentially across multiple events,
potentially hours or days apart. OrderPlaced fires. InventoryReserved
fires. PaymentProcessed fires. FulfillmentCreated fires. ShipmentCreated
fires. The user reports that their order is stuck. You need to find
where in this sequence something went wrong, knowing that each step
is in a different service with different logs, possibly with events
that were consumed out of order, possibly with a consumer that failed
silently and moved on.

Without distributed tracing that propagates correlation IDs across
every event, this debugging is archaeology. You are sifting through
log files from multiple services trying to reconstruct what happened
to a specific order that a specific user placed at a specific time.

# Event envelope that makes debugging survivable.
# Every event carries a correlation ID from the originating request.
# Every subsequent event in the chain inherits it.
# Every service logs it with every operation related to the event.
# When debugging, filter all service logs by correlation ID
# and you reconstruct the full sequence.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any
import uuid


@dataclass
class Event:
    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    event_type: str = ""
    correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    causation_id: str = ""    # ID of the event that caused this one
    occurred_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )
    schema_version: str = "1.0"
    payload: dict = field(default_factory=dict)

    def caused_by(self, parent_event: "Event") -> "Event":
        self.correlation_id = parent_event.correlation_id
        self.causation_id = parent_event.event_id
        return self


# When the inventory service handles OrderPlaced and emits InventoryReserved:
async def handle_order_placed(event: Event) -> None:
    order_id = event.payload["order_id"]

    log.info(
        "inventory.reservation.started",
        correlation_id=event.correlation_id,
        order_id=order_id,
    )

    await reserve_inventory(order_id)

    reserved_event = Event(
        event_type="InventoryReserved",
        payload={"order_id": order_id},
    ).caused_by(event)   # Inherits correlation_id, sets causation_id

    await event_bus.publish(reserved_event)

    log.info(
        "inventory.reservation.completed",
        correlation_id=event.correlation_id,
        order_id=order_id,
        caused_event_id=reserved_event.event_id,
    )

With this envelope, every event in a chain shares a correlation ID.
Every log line from every service that handled any event in the chain
includes that correlation ID. Debugging a stuck order is a single
log query: show me every log line with this correlation ID, across
all services, sorted by time.

Without this, you do not have an event-driven system you can operate.
You have a system that works until it breaks and then you cannot
find out why.

Consumer failures are invisible by default.

In a synchronous system, a failure is loud. The call throws an
exception. The caller gets an error. Someone notices.

In an event-driven system, a consumer failure can be completely
silent. The consumer reads the event, fails to process it, and
depending on how it is configured, either requeues the event, moves
it to a dead letter queue, or discards it and moves on. The producer
never knows. The other consumers never know. The user whose order
triggered the event never knows until they notice that something
downstream has not happened.

Dead letter queues are the standard answer to this and they are
the right answer, but they are only useful if someone is watching
them. A dead letter queue that nobody monitors is not a safety net.
It is a place where failed events go to be forgotten.

# Dead letter queue monitoring that actually alerts

import boto3
from prometheus_client import Gauge

sqs = boto3.client("sqs")

DLQ_DEPTH = Gauge(
    "sqs_dlq_message_count",
    "Number of messages in dead letter queue",
    ["queue_name"]
)


async def check_dlq_depths():
    queues = [
        ("order-processing-dlq", "https://sqs.region.amazonaws.com/account/order-processing-dlq"),
        ("inventory-dlq", "https://sqs.region.amazonaws.com/account/inventory-dlq"),
        ("payment-dlq", "https://sqs.region.amazonaws.com/account/payment-dlq"),
        ("fulfillment-dlq", "https://sqs.region.amazonaws.com/account/fulfillment-dlq"),
    ]

    for queue_name, queue_url in queues:
        response = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=["ApproximateNumberOfMessages"]
        )
        depth = int(response["Attributes"]["ApproximateNumberOfMessages"])
        DLQ_DEPTH.labels(queue_name=queue_name).set(depth)


# Alert rule: any DLQ with more than 0 messages is an incident.
# Not a warning. An incident.
# A message in the DLQ means an event failed to process.
# That means something the user expected to happen did not happen.
# That is always worth investigating immediately.

Any message in a dead letter queue is a symptom of a real problem.
Not a might-be-a-problem. A real problem. Treating DLQ depth as
a metric that alerts at zero normalises the expectation that failures
are real and visible, rather than the expectation that failures are
background noise to be managed.

The schema problem that grows until it bites you

Events are a public interface. Once a consumer is reading your events,
the schema of those events is a contract. Changing the schema breaks
the consumer.

In a synchronous API, schema evolution is managed through versioning.
The producer runs V1 and V2 of the endpoint simultaneously. Consumers
migrate at their own pace. When all consumers have migrated, V1 is
deprecated.

In an event-driven system, the equivalent is possible but operationally
harder. If you change the OrderPlaced event schema, you need every
consumer of OrderPlaced to be updated before you change the schema, or
the consumer needs to handle both old and new schemas simultaneously,
or you need to maintain two event types in parallel during migration.
None of these options is cheap, and they are cheaper if you planned
for them than if you did not.

The teams that handle this well establish schema governance before
they have a schema problem. Not after.

# Schema versioning that makes evolution manageable.
# Every event schema has an explicit version.
# Consumers declare which versions they can handle.
# The event bus routes accordingly.

from pydantic import BaseModel
from typing import Literal
from datetime import datetime


# Version 1 of OrderPlaced
class OrderPlacedV1(BaseModel):
    schema_version: Literal["1.0"] = "1.0"
    order_id: str
    user_id: str
    total: float
    occurred_at: datetime


# Version 2 adds line items and changes total to be in cents
# to avoid floating point issues that V1 had
class OrderPlacedV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    order_id: str
    user_id: str
    total_cents: int          # Changed: was "total: float"
    currency: str             # Added
    items: list[dict]         # Added
    occurred_at: datetime


# Consumer that handles both versions during migration period
class InventoryConsumer:
    async def handle(self, raw_event: dict) -> None:
        version = raw_event.get("schema_version", "1.0")

        if version == "1.0":
            event = OrderPlacedV1(**raw_event)
            order_id = event.order_id
            # V1 doesn't have items, so we have to fetch them
            items = await self.order_service.get_items(order_id)
        elif version == "2.0":
            event = OrderPlacedV2(**raw_event)
            order_id = event.order_id
            items = event.items
        else:
            log.error(
                "inventory.consumer.unknown_schema_version",
                version=version,
                event_id=raw_event.get("event_id"),
            )
            raise UnknownSchemaVersionError(version)

        await self.reserve_inventory(order_id=order_id, items=items)

Schema versioning adds boilerplate. The alternative is finding out
during an incident that a schema change broke a consumer that nobody
knew was depending on the old format.

When event-driven architecture is the wrong answer

The pattern is not universally appropriate. Teams adopt it when
they should not, attracted by the elegance and the conference talks,
and then spend years paying costs that were not necessary.

When you have one team and one service. Events are an
organisational boundary mechanism. If there is no organisational
boundary, you are paying the operational cost of distributed
messaging for no architectural benefit. A function call is faster,
simpler, and easier to debug. A modular monolith with internal
domain events gives you the architectural thinking without the
operational overhead.

When your operations require immediate consistency. Financial
transactions. Inventory deduction that must be accurate at the
moment of purchase. Medical record updates. Any situation where
the user or the business cannot tolerate the state being temporarily
inconsistent. Eventual consistency is not a technical property to
be engineered around in these cases. It is a fundamental unsuitability.

When your team does not have the operational maturity for it.
Event-driven systems require distributed tracing. They require DLQ
monitoring. They require schema governance. They require expertise
in at least one message broker technology. They require runbooks
for consumer failure scenarios. Teams that are still establishing
basic engineering practices should not add this operational surface
area. Stabilise first. Adopt the pattern when you have the capacity
to operate it correctly.

When the communication pattern is inherently synchronous.
A user submits a form and expects a result. An API client makes
a request and needs a response before it can proceed. A batch job
reads data and produces a report. These patterns do not become
better by adding events between the steps. They become more complex
with no benefit. Forcing an asynchronous pattern onto an inherently
synchronous workflow is an architecture astronaut move, not an
engineering decision.

The systems that work

The event-driven systems that work well in production share a set
of properties that are not negotiable.

Every event carries enough context to be processed without fetching
additional data. A consumer that needs to make a synchronous call
to process an event has a dependency on the producer that the event
pattern was supposed to eliminate.

Every consumer is idempotent. Events can be delivered more than
once. A consumer that is not idempotent will produce duplicate
effects when this happens. Designing for idempotency upfront is
straightforward. Retrofitting it after duplicate processing has
caused data integrity issues is expensive.

# Idempotent consumer using a processed events log
async def handle_order_placed(event: Event) -> None:
    # Check if we have already processed this event
    already_processed = await processed_events.exists(event.event_id)
    if already_processed:
        log.info(
            "inventory.consumer.duplicate_event_skipped",
            event_id=event.event_id,
            correlation_id=event.correlation_id,
        )
        return

    # Process the event
    await reserve_inventory(event.payload["order_id"])

    # Record that we have processed it
    await processed_events.record(
        event_id=event.event_id,
        processed_at=datetime.now(timezone.utc),
        consumer="inventory-service",
    )

Every schema change goes through a review process before it is
deployed. The review asks: which consumers will be affected? Have
they been updated? Can the change be made backwards-compatible?
If not, what is the migration plan?

Every dead letter queue has an alert and an owner. DLQ messages
are investigated the same day they appear. Not triaged. Not
backlogged. Investigated.

The teams that run event-driven systems well have built this
infrastructure. It is not glamorous work. It does not appear
in the conference talk about the elegant decoupling. It is the
thing that makes the decoupling survivable in practice.

The honest version of the pitch

Event-driven architecture is worth adopting when you have multiple
teams that need genuine autonomy, when your use cases can tolerate
eventual consistency, when you have the operational maturity to run
distributed messaging in production, and when the benefits of
decoupling outweigh the costs of asynchrony and distribution.

In those circumstances it is genuinely powerful. The teams that
use it well will tell you they cannot imagine going back. The
operational complexity feels like a fair trade for the organisational
flexibility.

In other circumstances it is an expensive mismatch between pattern
and problem. The complexity of the pattern does not disappear because
you chose it for the wrong reasons. It stays. You pay it.

The evaluation should be honest about both sides. Not "events are
the modern way to build systems" which is a fashion statement.
Not "events are always too complex" which ignores where they
genuinely excel. The honest version: this pattern solves specific
organisational and scalability problems at a specific operational
cost. Do the problems apply to us? Can we afford the cost? Those
are the questions worth asking before you commit.

The answer is sometimes yes. It is not always yes.
And pretending otherwise is how teams end up with event-driven
systems they cannot operate and cannot easily escape.