DEV Community: Benard Otieno

Event-Driven Architecture: An Honest Assessment

Benard Otieno — Sun, 24 May 2026 08:15:29 +0000

Event-driven systems are elegant in talks and brutal in production. After building and operating them across multiple companies, here is what nobody tells you before you commit to the pattern.

Every few years the industry rediscovers event-driven architecture and
decides it is the answer. The talks are compelling. Services decoupled
from each other. No direct dependencies. Producers that emit events
and never think about who consumes them. Consumers that react to what
happened and never worry about who caused it. The system as a whole
becomes a collection of independent actors responding to a shared stream
of facts about the world.

In the talk, this is clean. In production, it is one of the most
operationally demanding patterns in software engineering, and the
gap between how it is pitched and what it costs to run it well is
wider than almost any other architectural pattern I can think of.

I have built event-driven systems that worked well and event-driven
systems that were disasters. The difference was not the technology
and it was not the team's capability. It was whether the team went
in with an accurate picture of what they were buying. Most teams
do not get that picture before they commit. This article is an
attempt to provide it.

What you actually get

Start with what is genuinely good, because there is genuine good here.

Decoupling that is real. When the order service publishes an
OrderPlaced event and knows nothing about who consumes it, and when
the inventory service consumes OrderPlaced and knows nothing about
who published it, you have achieved something meaningful. Either
service can be redeployed without the other. Either can evolve
its internal implementation without negotiating with the other.
A new service can start consuming OrderPlaced tomorrow without
touching the order service at all.

This decoupling is the thing that makes large organisations with
many teams possible. The team that owns the order service does not
need to be in a meeting with every team that cares about orders.
They publish the event. Every consumer team builds and maintains
their own reaction to it. The coordination that would have been
synchronous and blocking becomes asynchronous and independent.

Audit trails that emerge naturally. If your events are your
source of truth, you have a complete record of everything that
happened in your system in the order it happened. Not just the
current state, but the history of how you got there. This is
genuinely useful for debugging, for compliance, and for the class
of bugs that are almost impossible to diagnose without knowing
what sequence of events preceded them.

Load handling that is structural rather than bolted on. A
consumer that reads from a queue processes work at the rate it
can handle, regardless of the rate at which work arrives. The
queue absorbs the spike. The consumer processes the backlog when
capacity is available. This is structurally different from a
synchronous system where a traffic spike hits the service directly
and the service either handles it or falls over.

These are real benefits. They are worth having. They are also
not free.

The cost nobody quotes you upfront

Eventual consistency is not a configuration option, it is a
commitment.

When you move from synchronous calls to events, you give up the
ability to know, at any given moment, that all parts of your system
agree on the current state. The order was placed. The OrderPlaced
event was published. The inventory service will consume it and
reserve the stock. When? Soon. How soon? Depends on the consumer's
lag. What if the user queries their order status right now, before
the inventory service has processed the event? The order exists
in the order service's view. The inventory has not yet been
reserved. The system is in an intermediate state that is internally
consistent but not yet globally consistent.

For many use cases this is acceptable. For some it is not. The
teams that adopt event-driven architecture without thinking carefully
about which of their use cases fall into which category discover
the hard way that "eventually" can mean milliseconds, seconds,
or minutes depending on what else is happening, and that users
do not have the same patience for eventual consistency that
architects do.

# The problem that bites teams who haven't thought this through:

# User places an order. Order service publishes event.
async def place_order(user_id: str, items: list) -> Order:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))
    return order

# User immediately queries order status.
# Order service returns order with status "pending".
# Inventory service has not yet processed the event.
# Frontend shows "pending" with a spinner.
# User refreshes. Still pending. Refreshes again.
# Inventory event processes. Status updates to "confirmed".
# User has refreshed four times and is on the phone with support.

# The system was correct the entire time.
# The user experience was broken the entire time.
# These are not contradictory.

# What teams often do to address this:
# Read-your-own-writes consistency for the immediate response,
# combined with a clear UI state that communicates processing is happening.

async def place_order_with_ux_in_mind(user_id: str, items: list) -> dict:
    order = Order.create(user_id=user_id, items=items)
    await db.save(order)
    await event_bus.publish(OrderPlaced(order_id=order.id, items=items))

    return {
        "order_id": order.id,
        "status": "processing",
        "message": "Your order is being confirmed. This usually takes a few seconds.",
        "poll_url": f"/orders/{order.id}/status",
        # Give the client a signal about what to do next,
        # rather than leaving them in ambiguous pending state
    }

Debugging across services is a different discipline entirely.

In a synchronous system, a bug has a call stack. You look at the
stack trace and you see exactly what called what in what order.
The sequence is right there.

In an event-driven system, the equivalent of a call stack is a
trace across multiple services, potentially across multiple events,
potentially hours or days apart. OrderPlaced fires. InventoryReserved
fires. PaymentProcessed fires. FulfillmentCreated fires. ShipmentCreated
fires. The user reports that their order is stuck. You need to find
where in this sequence something went wrong, knowing that each step
is in a different service with different logs, possibly with events
that were consumed out of order, possibly with a consumer that failed
silently and moved on.

Without distributed tracing that propagates correlation IDs across
every event, this debugging is archaeology. You are sifting through
log files from multiple services trying to reconstruct what happened
to a specific order that a specific user placed at a specific time.

# Event envelope that makes debugging survivable.
# Every event carries a correlation ID from the originating request.
# Every subsequent event in the chain inherits it.
# Every service logs it with every operation related to the event.
# When debugging, filter all service logs by correlation ID
# and you reconstruct the full sequence.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any
import uuid


@dataclass
class Event:
    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    event_type: str = ""
    correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    causation_id: str = ""    # ID of the event that caused this one
    occurred_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )
    schema_version: str = "1.0"
    payload: dict = field(default_factory=dict)

    def caused_by(self, parent_event: "Event") -> "Event":
        self.correlation_id = parent_event.correlation_id
        self.causation_id = parent_event.event_id
        return self


# When the inventory service handles OrderPlaced and emits InventoryReserved:
async def handle_order_placed(event: Event) -> None:
    order_id = event.payload["order_id"]

    log.info(
        "inventory.reservation.started",
        correlation_id=event.correlation_id,
        order_id=order_id,
    )

    await reserve_inventory(order_id)

    reserved_event = Event(
        event_type="InventoryReserved",
        payload={"order_id": order_id},
    ).caused_by(event)   # Inherits correlation_id, sets causation_id

    await event_bus.publish(reserved_event)

    log.info(
        "inventory.reservation.completed",
        correlation_id=event.correlation_id,
        order_id=order_id,
        caused_event_id=reserved_event.event_id,
    )

With this envelope, every event in a chain shares a correlation ID.
Every log line from every service that handled any event in the chain
includes that correlation ID. Debugging a stuck order is a single
log query: show me every log line with this correlation ID, across
all services, sorted by time.

Without this, you do not have an event-driven system you can operate.
You have a system that works until it breaks and then you cannot
find out why.

Consumer failures are invisible by default.

In a synchronous system, a failure is loud. The call throws an
exception. The caller gets an error. Someone notices.

In an event-driven system, a consumer failure can be completely
silent. The consumer reads the event, fails to process it, and
depending on how it is configured, either requeues the event, moves
it to a dead letter queue, or discards it and moves on. The producer
never knows. The other consumers never know. The user whose order
triggered the event never knows until they notice that something
downstream has not happened.

Dead letter queues are the standard answer to this and they are
the right answer, but they are only useful if someone is watching
them. A dead letter queue that nobody monitors is not a safety net.
It is a place where failed events go to be forgotten.

# Dead letter queue monitoring that actually alerts

import boto3
from prometheus_client import Gauge

sqs = boto3.client("sqs")

DLQ_DEPTH = Gauge(
    "sqs_dlq_message_count",
    "Number of messages in dead letter queue",
    ["queue_name"]
)


async def check_dlq_depths():
    queues = [
        ("order-processing-dlq", "https://sqs.region.amazonaws.com/account/order-processing-dlq"),
        ("inventory-dlq", "https://sqs.region.amazonaws.com/account/inventory-dlq"),
        ("payment-dlq", "https://sqs.region.amazonaws.com/account/payment-dlq"),
        ("fulfillment-dlq", "https://sqs.region.amazonaws.com/account/fulfillment-dlq"),
    ]

    for queue_name, queue_url in queues:
        response = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=["ApproximateNumberOfMessages"]
        )
        depth = int(response["Attributes"]["ApproximateNumberOfMessages"])
        DLQ_DEPTH.labels(queue_name=queue_name).set(depth)


# Alert rule: any DLQ with more than 0 messages is an incident.
# Not a warning. An incident.
# A message in the DLQ means an event failed to process.
# That means something the user expected to happen did not happen.
# That is always worth investigating immediately.

Any message in a dead letter queue is a symptom of a real problem.
Not a might-be-a-problem. A real problem. Treating DLQ depth as
a metric that alerts at zero normalises the expectation that failures
are real and visible, rather than the expectation that failures are
background noise to be managed.

The schema problem that grows until it bites you

Events are a public interface. Once a consumer is reading your events,
the schema of those events is a contract. Changing the schema breaks
the consumer.

In a synchronous API, schema evolution is managed through versioning.
The producer runs V1 and V2 of the endpoint simultaneously. Consumers
migrate at their own pace. When all consumers have migrated, V1 is
deprecated.

In an event-driven system, the equivalent is possible but operationally
harder. If you change the OrderPlaced event schema, you need every
consumer of OrderPlaced to be updated before you change the schema, or
the consumer needs to handle both old and new schemas simultaneously,
or you need to maintain two event types in parallel during migration.
None of these options is cheap, and they are cheaper if you planned
for them than if you did not.

The teams that handle this well establish schema governance before
they have a schema problem. Not after.

# Schema versioning that makes evolution manageable.
# Every event schema has an explicit version.
# Consumers declare which versions they can handle.
# The event bus routes accordingly.

from pydantic import BaseModel
from typing import Literal
from datetime import datetime


# Version 1 of OrderPlaced
class OrderPlacedV1(BaseModel):
    schema_version: Literal["1.0"] = "1.0"
    order_id: str
    user_id: str
    total: float
    occurred_at: datetime


# Version 2 adds line items and changes total to be in cents
# to avoid floating point issues that V1 had
class OrderPlacedV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    order_id: str
    user_id: str
    total_cents: int          # Changed: was "total: float"
    currency: str             # Added
    items: list[dict]         # Added
    occurred_at: datetime


# Consumer that handles both versions during migration period
class InventoryConsumer:
    async def handle(self, raw_event: dict) -> None:
        version = raw_event.get("schema_version", "1.0")

        if version == "1.0":
            event = OrderPlacedV1(**raw_event)
            order_id = event.order_id
            # V1 doesn't have items, so we have to fetch them
            items = await self.order_service.get_items(order_id)
        elif version == "2.0":
            event = OrderPlacedV2(**raw_event)
            order_id = event.order_id
            items = event.items
        else:
            log.error(
                "inventory.consumer.unknown_schema_version",
                version=version,
                event_id=raw_event.get("event_id"),
            )
            raise UnknownSchemaVersionError(version)

        await self.reserve_inventory(order_id=order_id, items=items)

Schema versioning adds boilerplate. The alternative is finding out
during an incident that a schema change broke a consumer that nobody
knew was depending on the old format.

When event-driven architecture is the wrong answer

The pattern is not universally appropriate. Teams adopt it when
they should not, attracted by the elegance and the conference talks,
and then spend years paying costs that were not necessary.

When you have one team and one service. Events are an
organisational boundary mechanism. If there is no organisational
boundary, you are paying the operational cost of distributed
messaging for no architectural benefit. A function call is faster,
simpler, and easier to debug. A modular monolith with internal
domain events gives you the architectural thinking without the
operational overhead.

When your operations require immediate consistency. Financial
transactions. Inventory deduction that must be accurate at the
moment of purchase. Medical record updates. Any situation where
the user or the business cannot tolerate the state being temporarily
inconsistent. Eventual consistency is not a technical property to
be engineered around in these cases. It is a fundamental unsuitability.

When your team does not have the operational maturity for it.
Event-driven systems require distributed tracing. They require DLQ
monitoring. They require schema governance. They require expertise
in at least one message broker technology. They require runbooks
for consumer failure scenarios. Teams that are still establishing
basic engineering practices should not add this operational surface
area. Stabilise first. Adopt the pattern when you have the capacity
to operate it correctly.

When the communication pattern is inherently synchronous.
A user submits a form and expects a result. An API client makes
a request and needs a response before it can proceed. A batch job
reads data and produces a report. These patterns do not become
better by adding events between the steps. They become more complex
with no benefit. Forcing an asynchronous pattern onto an inherently
synchronous workflow is an architecture astronaut move, not an
engineering decision.

The systems that work

The event-driven systems that work well in production share a set
of properties that are not negotiable.

Every event carries enough context to be processed without fetching
additional data. A consumer that needs to make a synchronous call
to process an event has a dependency on the producer that the event
pattern was supposed to eliminate.

Every consumer is idempotent. Events can be delivered more than
once. A consumer that is not idempotent will produce duplicate
effects when this happens. Designing for idempotency upfront is
straightforward. Retrofitting it after duplicate processing has
caused data integrity issues is expensive.

# Idempotent consumer using a processed events log
async def handle_order_placed(event: Event) -> None:
    # Check if we have already processed this event
    already_processed = await processed_events.exists(event.event_id)
    if already_processed:
        log.info(
            "inventory.consumer.duplicate_event_skipped",
            event_id=event.event_id,
            correlation_id=event.correlation_id,
        )
        return

    # Process the event
    await reserve_inventory(event.payload["order_id"])

    # Record that we have processed it
    await processed_events.record(
        event_id=event.event_id,
        processed_at=datetime.now(timezone.utc),
        consumer="inventory-service",
    )

Every schema change goes through a review process before it is
deployed. The review asks: which consumers will be affected? Have
they been updated? Can the change be made backwards-compatible?
If not, what is the migration plan?

Every dead letter queue has an alert and an owner. DLQ messages
are investigated the same day they appear. Not triaged. Not
backlogged. Investigated.

The teams that run event-driven systems well have built this
infrastructure. It is not glamorous work. It does not appear
in the conference talk about the elegant decoupling. It is the
thing that makes the decoupling survivable in practice.

The honest version of the pitch

Event-driven architecture is worth adopting when you have multiple
teams that need genuine autonomy, when your use cases can tolerate
eventual consistency, when you have the operational maturity to run
distributed messaging in production, and when the benefits of
decoupling outweigh the costs of asynchrony and distribution.

In those circumstances it is genuinely powerful. The teams that
use it well will tell you they cannot imagine going back. The
operational complexity feels like a fair trade for the organisational
flexibility.

In other circumstances it is an expensive mismatch between pattern
and problem. The complexity of the pattern does not disappear because
you chose it for the wrong reasons. It stays. You pay it.

The evaluation should be honest about both sides. Not "events are
the modern way to build systems" which is a fashion statement.
Not "events are always too complex" which ignores where they
genuinely excel. The honest version: this pattern solves specific
organisational and scalability problems at a specific operational
cost. Do the problems apply to us? Can we afford the cost? Those
are the questions worth asking before you commit.

The answer is sometimes yes. It is not always yes.
And pretending otherwise is how teams end up with event-driven
systems they cannot operate and cannot easily escape.

The Senior Engineer Who Stopped Coding

Benard Otieno — Sun, 24 May 2026 07:56:23 +0000

At some point, many senior engineers quietly transition from building things to managing the building of things. This transition is often presented as growth. Sometimes it is. Often it is the beginning of a slow professional collapse.

There is a career transition that happens quietly, usually between
year five and year ten of an engineering career, that nobody talks
about honestly.

The engineer starts getting pulled into more meetings. Their opinions
are sought on architecture decisions. They review more code than they
write. They spend increasing amounts of time on documents, on planning,
on the work of coordination rather than the work of building. Their
title changes. Their calendar fills up. Their pull request count drops
toward zero.

Everyone calls this growth. The engineer accepts that framing because
it comes with a raise and an increased sense of importance. The team
benefits because the engineer's experience is now multiplicative rather
than additive. The organisation is satisfied because the senior
headcount is being leveraged correctly.

And then, two or three years later, the engineer sits down to build
something and discovers that they cannot. The tools have moved on.
The frameworks they knew have been superseded. The muscle memory
of debugging and building and shipping is gone. They know what good
looks like but they have lost the ability to produce it directly.

They have become, without intending to, a person who talks about
engineering rather than one who does it.

Why it happens

The pull away from coding is not malicious. It is structural, and
it happens because the incentives all point in one direction.

Senior engineers are expensive. Organisations naturally want to
extract maximum value from expensive headcount. The most visible
form of leverage is influence: one senior engineer who reviews
code, answers questions, and makes architectural decisions affects
the output of five junior engineers. One senior engineer who
writes code produces the output of one engineer, perhaps with
higher quality. The leverage calculation is obvious and it is wrong.

It is wrong because it treats engineering skill as static. It
assumes the senior engineer who stops building remains as capable
as the one who keeps building. This is not how skills work. Engineering
is a practice. It requires continuous exercise. A surgeon who stops
performing operations and moves into medical administration is
not available to the hospital as a surgeon five years later. They
have different skills now. Valuable skills. Not surgical ones.

The same is true for engineers. An engineer who has not written
production code in three years, who has not debugged a live incident
at the code level, who has not sat with a new framework and built
something real with it, has a different skill set than they did
three years ago. They have institutional knowledge. They have judgment.
They can read a system design and identify its weaknesses. They cannot
build the system.

The organisation pays for what it believes is a force multiplier and
gets something valuable but different from what it expected. The
engineer pays with capability they did not intend to give up.

The judgment without craft problem

There is a specific failure mode that emerges from this transition
that is hard to see from the inside.

An engineer who has stopped building retains their judgment about
what good looks like, formed from their experience of building.
This judgment is real and useful. The problem is that it begins to
degrade in specific ways that are not immediately obvious.

Good engineering judgment is not abstract. It is grounded in the
current reality of what is possible, what is fast, what is painful,
and what breaks. This reality changes constantly. The abstractions
shift. The tooling improves. The performance characteristics of
systems change as the underlying platforms evolve. The things that
were hard five years ago are sometimes easy now. The things that
seem easy from a distance are sometimes newly hard in ways that
only become apparent when you try to build them.

An engineer who is actively building updates their judgment
continuously through contact with this reality. They try something,
it does not work the way they expected, they update their model.
They read the error message. They hit the limitation. They find
the workaround. This continuous updating is not dramatic. It is
the normal texture of building software. But it keeps the judgment
calibrated.

An engineer who is not actively building is running on cached
judgment. Their model of what is hard and what is easy, what is
fast and what is slow, is a snapshot from whenever they last
built things seriously. This snapshot becomes less accurate over
time, but the degradation is invisible to them because they have
no direct contact with the current reality that would reveal the
gap.

The result is opinions that are confidently wrong in ways that
only become apparent when someone tries to implement them.

# The kind of thing a cached-judgment engineer might propose:
# "Just add a caching layer in front of the database,
# it's straightforward, shouldn't take more than a day."

# What the engineer building it discovers:
# Cache invalidation for this data model requires
# tracking relationships across four entity types.
# The cache stampede problem on cold starts needs handling.
# The serialisation format needs versioning.
# The TTL strategy needs to account for consistency requirements.
# Testing the cached paths needs a different test infrastructure.

# What was "a day" is three weeks of careful engineering.
# The estimate came from judgment that had not been updated
# by the experience of actually doing it recently.

This gap between proposed and actual is one of the most common
sources of friction between senior engineers who have stepped back
from coding and the engineers who implement their proposals. The
implementers know the gap exists. The senior engineer often does not.

What gets lost that nobody mentions

The obvious loss when an engineer stops coding is technical currency.
The tools move. The frameworks change. This is real but it is
also recoverable with deliberate effort.

The less obvious losses are harder to recover.

The ability to estimate honestly. Estimation requires recent
experience of how long things actually take. An engineer who has
not built anything in two years cannot accurately estimate how long
it will take to build something now. They can produce a number.
The number will be wrong in ways that are hard to detect until the
project is underway. The engineer will believe the number because
it matches their cached sense of how long things take.

The ability to detect complexity from code. A senior engineer
reading code knows when something is more complex than it should
be. This knowledge comes from writing code and feeling the
difference between something that reads cleanly because it is
clean and something that reads cleanly because the complexity
is hidden. After years without writing code, this sense dulls.
The code review becomes shallower. The questions become more
high-level. The subtle architectural problems that would have
jumped off the page become invisible.

The credibility to push back effectively. An engineer who
is actively building earns credibility with their team through
demonstrated competence. When they say a design is wrong, the
team believes them because they have seen the engineer be right
before in ways that were verifiable. An engineer who has not
built anything in years makes the same claim but with a different
credibility structure. They may be right. The team has fewer
ways to verify it.

The feel for what is genuinely hard. Some things are hard
in ways that are not obvious from a design document. Race
conditions that only appear under specific timing conditions.
Data corruption that emerges from an edge case in a serialisation
format. Performance problems that only manifest at production
data volumes. An engineer who is building regularly encounters
these and updates their internal map of where the hidden complexity
lives. An engineer who is not building loses this map gradually,
replaced by a more abstract and less accurate version.

The engineers who avoid this

The senior engineers who do not fall into this pattern share
something in common: they have made an explicit commitment to
staying in contact with building, and they have made it in a
way that is visible enough to be protected from schedule pressure.

This takes different forms for different people.

Some reserve a portion of every week for writing production code.
Not prototypes. Not experiments. Production code that ships and
runs and breaks and gets debugged. The amount varies but the
commitment is treated as non-negotiable in the same way that
a weekly architecture review is non-negotiable. It goes on the
calendar. It survives meeting requests.

Some rotate through on-call rotations. Being on call forces
contact with the operational reality of the system. You are
not reading about incidents. You are debugging them. The debugger
does not care about your title. It shows you the stack trace.
You have to read it. This contact with reality is valuable
precisely because it bypasses the abstraction layer that
seniority tends to insert between an engineer and the system.

Some maintain personal projects or contribute to open source
with genuine commitment, not for portfolio reasons, but as
a way of staying in contact with the experience of building
without the organisational pressures that cause the drift in
the first place.

The common thread is deliberate, protected time for building
that is treated as a first-class responsibility rather than
something that happens when there is space in the schedule.
There is never space in the schedule. Space has to be created
and defended.

# An approach some senior engineers use to stay current:
# Build the tooling you wish existed for your team.
# Not a proof of concept. Not a demo.
# A tool that your team uses in production.

# This has several properties that make it effective:
# 1. It forces contact with the current state of the tools
# 2. The output has real users who will tell you when it breaks
# 3. The scope is usually small enough to complete
# 4. The domain is something you understand deeply
# 5. It produces something valuable rather than just exercises

# Examples that fit this pattern:
# - Internal CLI tools for deployment workflows
# - Monitoring dashboards the team actually uses
# - Code generation tools for boilerplate reduction
# - Test helpers that reduce the friction of writing tests
# - Performance profiling scripts for the specific system

# The anti-pattern is the proof of concept that lives in a branch
# and is shown at a demo and never runs in production.
# That touches code without the accountability that makes it useful.

The honest conversation about titles

There is a version of the senior engineer transition that is
genuinely right. Some engineers find that the leverage work,
the architecture work, the mentoring and the writing and the
planning, is where they create the most value and where they find
the most satisfaction. They grow into staff or principal roles
and they are genuinely excellent at them. They maintain enough
technical currency to keep their judgment calibrated, even if
they are not writing production code every week.

There is another version that is happening to engineers who
would rather be building but are being pulled away from it by
a combination of organisational incentives and social expectations
about what seniority looks like. These engineers are losing
something they valued, substituting something they value less,
and calling it career growth because that is the available
vocabulary.

The engineering industry does not have good vocabulary for
the engineer who wants to be deeply senior and deeply technical
simultaneously. Individual contributor tracks exist at many
companies but they are often treated as consolation prizes
for engineers who were not good enough for management, rather
than as the legitimate and valuable career paths they should be.

The engineers who want to stay close to building need to name
this explicitly, to themselves and to the organisations they
work for. Not "I am not interested in management" which sounds
like a limitation. But "I am most valuable and most effective
when I am close to the code and I am going to protect that."
This is a statement of professional judgment, not of career
limitation.

Organisations that understand this will make space for it.
Organisations that do not will slowly pull the engineer away
from the thing that made them valuable in the first place,
and be confused about why the force multiplication stopped
working.

The recovery

For engineers who have drifted and want to return, the recovery
is not complicated. It is uncomfortable, and it takes time, and
that is all.

The first step is accepting that capability has decayed. This
is the hardest part. An engineer who has been treated as an
authority for several years has to sit down with a codebase
or a tool they do not fully understand and feel incompetent
for a while. The internal experience of this is uncomfortable
regardless of how much intellectual acceptance there is.

The second step is building something small and finishing it.
Not learning a new framework theoretically. Building something
that ships and runs. Small enough that the finish line is
visible. Real enough that it has users and breaks and requires
debugging.

The third step is doing this consistently enough that the
muscle memory returns. This takes months, not weeks. The
calibration of judgment to current reality takes longer than
the recapture of technical mechanics, because it requires
enough accumulated experience with the current state of things
to have a reliable sense of where the hard parts are.

None of this is dramatic. Engineering is a practice and like
all practices it responds to consistent effort over time.
The engineer who drifted and then returned, who has both the
accumulated judgment of years of experience and the current
calibration of active practice, is genuinely rare and genuinely
valuable.

Most do not make the return. The drift feels like growth.
The calendar stays full. The meetings require the same preparation
as building used to require. The work feels substantial because
it is substantial. The specific thing that was lost does not
announce its absence loudly.

It is just gone, quietly, while nobody was looking.

The best time to notice this is before it is very far along.
The second best time is now.

Microservices Were Never About Technology

Benard Otieno — Thu, 21 May 2026 14:20:37 +0000

Every failed microservices adoption I have seen made the same mistake: treating microservices as an infrastructure pattern instead of an organisational one. The technology is the easy part. The hard part is everything else.

The microservices conversation in most engineering teams goes something
like this. The monolith is getting unwieldy. Deployments are slow.
The codebase is hard to navigate. A senior engineer proposes breaking
things apart into services. The team agrees. They spend six months
doing it. Things get worse.

The services are too small or too large. Nobody agrees on where the
boundaries should be. Simple features now require coordinating changes
across three repositories. A bug that used to take twenty minutes to
debug now takes two hours because it crosses service boundaries.
Deployments are more frequent but individually more fragile. The team
is working harder than before and moving slower than before.

They blame the technology. The technology is not the problem.

Microservices failed for them for the same reason they fail for most
teams that adopt them: the team treated decomposition as a technical
decision and ignored the organisational reality that microservices
are actually designed to solve. You cannot separate the architecture
from the team structure. Conway's Law is not optional.

What microservices were actually invented for

Amazon is the origin story most people know. The mandate from Jeff
Bezos in the early 2000s: all teams will expose their data and
functionality through service interfaces. All teams will communicate
through these interfaces. No other form of interprocess communication
is allowed. Anyone who doesn't do this will be fired. He was serious.

The problem Bezos was solving was not technical. Amazon's engineering
organisation had grown to a size where teams were deeply entangled with
each other. Team A could not deploy without coordinating with Team B,
which needed sign-off from Team C, which had a dependency on Team D.
Every change required a synchronisation meeting. Every release was a
negotiation. The coupling between teams was strangling the organisation's
ability to move.

Services were the solution because they forced a contract between teams.
If Team A owns Service A, and Team B owns Service B, and they communicate
only through a defined API, then Team A can change anything inside
Service A without asking Team B's permission. Team B can deploy Service B
on its own schedule. The organisational autonomy is enforced by the
technical boundary.

This is the thing that most microservices adoptions miss entirely.
Services are not primarily a way to scale technology. They are a way
to scale teams. The technical properties of services (independent
deployment, technology flexibility, fault isolation) are valuable
side effects of the organisational property (team autonomy).

If you adopt microservices without the organisational changes that
make them valuable, you get all the costs of distributed systems
and none of the benefits.

The cost of distribution

A monolith, for all its problems, has properties that distributed
systems do not have and cannot have.

A function call inside a monolith is reliable. Either it works or
it throws an exception. It is fast. It completes in microseconds.
It participates in the same database transaction as the code that
called it. If the whole operation needs to be rolled back, it is.

A network call between services is unreliable. It might succeed.
It might fail. It might succeed on the server side and fail on the
network before the response reaches the caller. It might time out,
leaving you with no information about whether the remote operation
completed. It is slow relative to a function call. It crosses
a transaction boundary, which means if something fails after the
call succeeded, you have a consistency problem that cannot be
resolved by a rollback.

This is not an implementation detail to be engineered around. It is
a fundamental property of distributed systems, described precisely
in the fallacies of distributed computing that Peter Deutsch wrote
in 1994 and that the industry has been rediscovering ever since.

The network is not reliable. Latency is not zero. Bandwidth is not
infinite. The network is not secure. Topology changes. There is not
one administrator. Transport cost is not zero. The network is not
homogeneous.

Every service boundary you add to your system is a place where these
fallacies apply. Every service call is an opportunity for latency,
failure, and consistency problems that simply do not exist inside
a monolith. The question is whether the organisational benefits of
the boundary justify the distributed systems cost of maintaining it.

For a team of eight people working on one product, they almost never do.

Where the boundaries actually belong

The most common failure mode in microservices adoption is drawing
service boundaries around technical concerns rather than business
ones. Teams create an "auth service," a "notification service,"
a "user service," a "payment service." These feel like natural
decompositions because they map to recognisable technical concepts.

They are terrible service boundaries.

An auth service that every other service must call to validate a
token is not a service. It is a shared library that has been deployed
as infrastructure, adding network latency and a failure mode to every
authenticated request in the system. If the auth service is slow,
everything is slow. If the auth service is down, everything is down.
You have taken a piece of logic that could live as a function call
and made it a distributed systems problem.

A notification service is not a service. It is a collection of side
effects that have been externalized, creating a situation where the
service that wants to send an email must make a network call, handle
the failure case, and figure out what to do if the notification
service is unavailable at the moment the email needs to be sent.

The boundaries that work are the ones that map to bounded contexts
in the business domain. Not "the thing that handles auth" but "the
thing that owns everything about how customers interact with our
platform." Not "the thing that sends notifications" but "the thing
that owns the customer communication history and all the rules about
when and how to communicate."

These boundaries are harder to identify. They require understanding
the business deeply enough to know where the real seams are. They
require conversations with product managers and domain experts, not
just with engineers. They change as the business evolves. But they
are the boundaries that, when you respect them, produce services
that teams can own autonomously and evolve independently.

Domain-Driven Design's concept of bounded contexts is the clearest
framework for finding these boundaries. The bounded context defines
the scope within which a particular domain model applies. At the
edge of the bounded context, the model changes. That is where the
service boundary belongs.

# A service boundary drawn around a technical concern.
# Every other service calls this. Auth is now a distributed dependency.
#
# Bad:
class AuthService:
    async def validate_token(self, token: str) -> User:
        ...
    async def create_token(self, user_id: str) -> str:
        ...
    async def revoke_token(self, token: str) -> None:
        ...


# A service boundary drawn around a business capability.
# This service owns everything about an order, including its auth context.
# Other services don't call into it for auth. They communicate
# through events when they need to know something happened.
#
# Better:
class OrderService:
    async def place_order(self, customer_id: str, items: list) -> Order:
        # Auth context is resolved here, not farmed out to a network call
        customer = await self.customer_repository.get(customer_id)
        if not customer.can_place_orders():
            raise InsufficientPermissionsError()
        ...

    async def cancel_order(self, order_id: str, requesting_customer_id: str) -> None:
        order = await self.order_repository.get(order_id)
        if order.customer_id != requesting_customer_id:
            raise InsufficientPermissionsError()
        ...

Conway's Law is a constraint, not a suggestion

Mel Conway observed in 1967 that organisations produce systems that
mirror their communication structures. A team with three groups will
produce a system with three components. This is not because they
planned to. It is because the system reflects who talks to whom.

The implication that most teams don't fully absorb: if you want a
particular system architecture, you need the corresponding
organisational structure. You cannot have a microservices architecture
with a team structure designed for a monolith. The organisation will
fight the architecture until one of them wins, and the organisation
usually wins because it existed first.

This is why Amazon's microservices worked. The service boundaries
and the team boundaries were the same boundaries. Team A owns Service A.
Not "Team A and Team B both contribute to Service A." Not "Service A
is maintained by whoever has time." One team, one service, full ownership.
The organisational autonomy and the technical autonomy were the same thing.

Most microservices adoptions separate these. The same team that used
to work on the monolith now works on six services. They have all the
coordination overhead of distributed systems and none of the team
autonomy that makes it worth it. They still talk to each other constantly
because they're the same people. The service boundaries don't reflect
team boundaries because there are no team boundaries. There is one team
doing distributed systems for no organisational reason.

The inverse Conway maneuver, a term coined by Thoughtworks, is the
deliberate version: you design the team structure you want, then
let the architecture follow from it. If you want a payments service
that can be developed and deployed independently, you need a payments
team that can make decisions and ship code independently. If you do
not have or cannot create that team, you do not have the prerequisite
for the payments service.
The prerequisite check before splitting a service:

Who will own this service?
"The backend team" is not an answer.
A named, stable, small team is an answer.
Can that team deploy the service without coordinating with
other teams?
If not, the boundary is wrong or the ownership is wrong.
Can that team change the service's internal implementation
without changing any other service?
If not, the boundary is wrong.
Is there a defined contract (API, event schema) between this
service and its consumers?
If not, you don't have a service. You have a distributed module.
Does the team have enough context about the business domain
this service represents to make good decisions autonomously?
If not, the team needs to exist and stabilise before the service
should be extracted.

If any of these answers is no, the split is premature.

The operational surface nobody accounts for

When a team decides to split their monolith into ten services, they
usually have a plan for the technical decomposition. They rarely have
a plan for what they are about to own operationally.

A monolith has one deployment pipeline. One set of infrastructure
to configure. One place to look at logs. One set of metrics. One
runbook for when things go wrong. The operational complexity is low.

Ten services have ten deployment pipelines. Ten infrastructure
configurations. Log aggregation that spans services. Distributed
tracing to follow a request through multiple services. Ten runbooks,
except the incidents that matter will involve multiple services and
none of the runbooks will cover that. Service discovery. Health
checking at the inter-service level. Circuit breakers for when
downstream services are degraded.

None of this complexity is impossible to manage. It is all solvable.
But it requires a team that has the capacity to manage it, tools
that have been set up before the split happens, and expertise that
takes time to develop.

Most teams split their services and then build the operational
infrastructure retroactively, while also trying to deliver product
work, while also debugging the new distributed systems problems they
did not have before. This is where the eighteen months of slowdown
comes from.

The teams that do this well build the operational infrastructure
first. They get distributed tracing working in the monolith before
they split it. They standardise their deployment pipeline before
they have ten of them. They establish logging conventions before
they have ten services emitting logs in subtly different formats.

# The operational baseline that must exist before splitting services.
# This is not optional infrastructure to add later.

# Centralised structured logging
logging:
  format: json
  fields:
    service: ${SERVICE_NAME}
    version: ${SERVICE_VERSION}
    environment: ${ENVIRONMENT}
    trace_id: ${TRACE_ID}    # Must be propagated across service calls
    span_id: ${SPAN_ID}

# Every service exposes these endpoints. No exceptions.
health_endpoints:
  liveness: /healthz      # Is the process running?
  readiness: /ready       # Is it ready to serve traffic?
  metrics: /metrics       # Prometheus metrics

# Every inter-service call propagates these headers
trace_propagation:
  headers:
    - traceparent           # W3C Trace Context
    - tracestate

# Every service has these alerts configured before it handles traffic
minimum_alerts:
  - error_rate_above_1_percent
  - p99_latency_above_1_second
  - service_unavailable

The monolith that should stay a monolith

Not every system should be microservices. This is easy to say and
hard to accept in an industry where microservices became the mark
of a serious engineering organisation.

The monolith that should stay a monolith is the one where:

The team is small enough that coordination overhead is low. Five
to eight engineers can coordinate in a daily standup without the
synchronisation cost becoming significant. For a team this size,
the organisational problem that microservices solve does not exist.

The domain is not yet well understood. Early-stage products have
unstable domain models. The concepts that seem fundamental change
as you learn what you're actually building. Service boundaries drawn
around an unstable domain model have to be redrawn as the domain
stabilises, which is expensive and demoralising. The monolith lets
the domain model evolve cheaply. Split when the domain is understood.

The operational team does not exist. If nobody owns the infrastructure
that a distributed system requires, the system will be operated badly.
A well-operated monolith beats a poorly-operated distributed system
every time.

The internal structure can be improved without splitting. A modular
monolith with clear internal boundaries, enforced through package
structure and dependency rules, provides most of the cognitive benefits
of microservices (clear ownership, bounded contexts, interface discipline)
without the distributed systems cost. It is not a compromise. For the
right team and domain, it is the correct architecture.

# A modular monolith with enforced boundaries.
# orders/ cannot import directly from payments/.
# They communicate through defined interfaces.
# This is achievable without distributed systems.

# src/orders/service.py
from orders.repository import OrderRepository
from orders.events import OrderPlaced  # Orders emits events
# from payments.service import PaymentService  # This import is forbidden
                                               # enforced by linting rules

class OrderService:
    def __init__(
        self,
        repository: OrderRepository,
        event_bus: EventBus,
        payment_gateway: PaymentGateway,  # Interface, not concrete payments module
    ):
        self.repository = repository
        self.event_bus = event_bus
        self.payment_gateway = payment_gateway

    async def place_order(self, customer_id: str, items: list) -> Order:
        order = Order.create(customer_id=customer_id, items=items)
        await self.repository.save(order)
        await self.event_bus.publish(OrderPlaced(order_id=order.id))
        return order

# The payments module listens for OrderPlaced events.
# It never gets called directly by orders.
# The boundary is real. It is enforced by design, not by a network.

# src/payments/handlers.py
from orders.events import OrderPlaced  # Reading event schema is allowed

class PaymentEventHandler:
    async def on_order_placed(self, event: OrderPlaced) -> None:
        await self.payment_service.initiate_payment(order_id=event.order_id)

This is a real architecture that scales further than most teams
think before the overhead of splitting services becomes worth paying.
Shopify ran a version of this for years. Stack Overflow still does.
They are not unsophisticated organisations.

What the good teams understand

The teams that have figured out distributed systems share a
perspective that took most of them several years and at least one
failed microservices adoption to arrive at.

Services are not about code organisation. They are about team
organisation. A service boundary that does not correspond to a team
boundary is overhead without benefit.

The overhead of distributed systems is real, permanent, and
compounding. You pay it forever. It needs to buy something worth
having. For a team that is too large to coordinate, team autonomy
is worth having. For a team that is not yet at that size, it is
not.

The correct direction of reasoning is: we have an organisational
problem, what architecture solves it? Not: we have an architecture
trend, what organisation do we need to adopt it?

Microservices adopted as a technical decision produce the costs
of distribution and the politics of boundary negotiation without
the autonomy that makes them valuable. Microservices adopted as
an organisational decision, by teams that have done the work of
defining ownership and building operational foundations, produce
systems that actually deliver what the pattern promises.

The technology has never been the hard part.
The hard part is everything the technology forces you to sort out first.
Most teams skip that part and wonder why the technology failed them.

The GPU Is the New Database

Benard Otieno — Wed, 20 May 2026 14:22:32 +0000

Twenty years ago, teams had no idea how to run databases at scale. They made every mistake possible before the patterns solidified. We are now in the same position with GPU infrastructure, making the same mistakes, faster.
Find more articles atThis site
_

In 2004, if you were running a web application at any meaningful scale,
your biggest infrastructure problem was the database. Not the application
servers, those were stateless, you could add more. The database was the
single stateful thing everything depended on, it didn't scale horizontally,
it was expensive to run, and almost nobody knew how to operate it well.

Teams made every mistake. They put too much logic in the application and
not enough in the database. They put too much in the database and not
enough in the application. They didn't index correctly. They didn't cache
correctly. They scaled vertically until they couldn't, then scrambled to
shard. They had no idea what their query plans looked like. They treated
the database as a black box until it stopped working, then learned the
hard way that it wasn't.

Over the following decade, the patterns solidified. Connection pooling.
Read replicas. Query analysis. Proper indexing strategy. Cache layers.
The knowledge became common. The tools improved. Managed database services
abstracted most of the complexity. Today a competent team can run a
database at significant scale without extraordinary expertise.

We are now, in 2026, in the same position with GPU infrastructure. The
GPU is the new database, the expensive, stateful, poorly-understood
bottleneck that everything AI depends on, that doesn't scale the way
people expect, that is being operated badly by the majority of teams
running it, and for which the patterns have not yet solidified.

The teams that figure this out first will have an infrastructure advantage
that is very difficult to close. The teams that don't will spend the
next five years making the same mistakes everyone made with databases
in 2004, just faster and more expensively.

Why the GPU is not just a fast CPU

The first mistake most teams make with GPU infrastructure is treating
GPUs as very fast CPUs. They're not. They're a fundamentally different
computational model, and the mismatch between that model and how most
people use them is where most of the waste comes from.

A CPU is optimised for latency ,completing a single complex task as
quickly as possible. It has a small number of powerful cores, large
caches, sophisticated branch prediction, and out-of-order execution.
It's good at sequential logic, conditional branching, and tasks where
each step depends on the result of the previous one.

A GPU is optimised for throughput, completing an enormous number of
simple tasks simultaneously. It has thousands of smaller, simpler cores.
It's good at the same operation applied in parallel to a large amount
of data. It's bad at anything sequential, anything with complex
branching, and anything where you need to move data back to the CPU
in the middle of computation.

The practical consequence: a GPU that is not batching work is a GPU
that is mostly idle. The most common pattern for teams deploying AI
inference in production, one request comes in, run the model, return
the result, wait for the next request, uses a small fraction of the
GPU's actual capacity. The GPU's utilisation number looks reasonable.
The GPU's actual computational throughput is terrible.

This is the equivalent of a database that opens a new connection for
every query, executes it, and closes the connection. Technically
functional. Completely missing how the system should be used.

# What most teams do: one request, one inference
# GPU utilisation looks like 20-40%, but throughput is poor

async def handle_inference_request(prompt: str) -> str:
    result = model.generate(prompt)  # GPU mostly idle while waiting
    return result


# What should be happening: dynamic batching
# Multiple requests grouped and processed together

class InferenceBatcher:
    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: int = 50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: asyncio.Queue = asyncio.Queue()

    async def infer(self, prompt: str) -> str:
        future = asyncio.Future()
        await self.queue.put((prompt, future))
        return await future

    async def _batch_worker(self):
        while True:
            batch = []
            deadline = asyncio.get_event_loop().time() + (self.max_wait_ms / 1000)

            # Collect requests until batch is full or deadline passes
            while len(batch) < self.max_batch_size:
                timeout = deadline - asyncio.get_event_loop().time()
                if timeout <= 0:
                    break
                try:
                    item = await asyncio.wait_for(
                        self.queue.get(),
                        timeout=timeout
                    )
                    batch.append(item)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            prompts = [item[0] for item in batch]
            futures = [item[1] for item in batch]

            # Single GPU call processes all requests simultaneously
            results = self.model.generate_batch(prompts)

            for future, result in zip(futures, results):
                future.set_result(result)

Dynamic batching is the connection pooling of GPU inference. It is
not optional if you care about cost or throughput. It is also not
implemented by default in most hand-rolled inference deployments,
for the same reason that early web applications didn't implement
connection pooling: teams didn't know they needed it until they
hit the wall.

The memory hierarchy nobody teaches you

GPU memory is not like CPU memory. Understanding the difference is
the difference between a system that works and one that doesn't, and
between inference costs that are manageable and ones that are not.

A GPU has its own on-device memory ,VRAM. VRAM is fast, finite,
and expensive. A GPU with 80GB of VRAM is a very expensive GPU.
The model you're running must fit in VRAM. If it doesn't fit, you
can use techniques like quantization to make it smaller, or you can
distribute it across multiple GPUs, but you cannot simply overflow
to system RAM without taking a catastrophic performance hit. The
bandwidth between CPU RAM and GPU VRAM is orders of magnitude slower
than VRAM bandwidth. When you hear about models being "quantized
to 4-bit," this is why 4-bit quantization halves the memory
footprint roughly, which is the difference between fitting on one
GPU and not fitting on one GPU.

Within the GPU itself, there is a memory hierarchy that determines
how fast computation runs. The KV cache, the cached attention
computation for the tokens already processed in a conversation
lives in VRAM and grows with sequence length. Managing KV cache
is one of the most consequential performance decisions in LLM serving,
and most teams don't think about it at all until they start hitting
out-of-memory errors on long contexts.

# KV cache management: what happens without it
# Each new token regenerates attention for the entire context
# Cost is O(n²) in sequence length

# What vLLM and similar systems do differently:
# PagedAttention manages KV cache in fixed-size blocks
# like virtual memory paging in an OS

# This allows:
# 1. Sharing KV cache between requests with the same prefix
# 2. Better memory utilisation (no internal fragmentation)
# 3. Handling variable-length sequences without pre-allocating
#    worst-case memory

from vllm import LLM, SamplingParams

# vLLM handles KV cache management automatically
# This is not a minor optimisation — it's 2-4x throughput improvement
# on typical workloads versus naive implementations

llm = LLM(
    model="meta-llama/Llama-3-8b-instruct",
    gpu_memory_utilization=0.90,    # Leave 10% headroom
    max_model_len=8192,             # Maximum sequence length
    enable_prefix_caching=True,     # Cache common prefixes (system prompts)
    tensor_parallel_size=1,         # Number of GPUs for this model
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
)

# Prefix caching means your system prompt is computed once
# and cached for all subsequent requests — significant for
# long system prompts used with every inference
outputs = llm.generate(prompts, sampling_params)

Most teams serving LLMs in production are not using PagedAttention.
They're using naive inference implementations that waste fifty to
seventy percent of their GPU memory to fragmentation and redundant
computation. The cost difference is not marginal.

The scaling question everyone asks wrong

When a team's AI infrastructure starts struggling under load, the
first question is almost always: "should we add more GPUs?"

This is the wrong question asked at the wrong time, for the same
reason that "should we add more database servers" was the wrong
first question when a database was struggling in 2008. The right
question is: "why are we using our current GPUs so inefficiently?"

GPU utilisation that is below sixty percent is almost always a
batching problem. Requests are not being grouped efficiently before
hitting the GPU. You can add more GPUs and halve your utilisation
number, which means you now have twice the infrastructure running at
thirty percent capacity instead of one set running at sixty. You've
doubled your cost and solved nothing.

GPU utilisation that is high but latency is still bad is almost
always a model sizing problem. The model is too large for the request
volume being served. A smaller quantized model, or a different
architecture, may serve your request latency requirements at a
fraction of the compute cost.

# Measuring what actually matters before deciding to scale

import time
import psutil
from prometheus_client import Histogram, Gauge, Counter

# These metrics tell you where the problem actually is

GPU_UTILISATION = Gauge(
    'gpu_utilisation_percent',
    'GPU compute utilisation',
    ['device_id']
)

GPU_MEMORY_USED = Gauge(
    'gpu_memory_used_bytes',
    'GPU VRAM in use',
    ['device_id']
)

BATCH_SIZE = Histogram(
    'inference_batch_size',
    'Number of requests processed per batch',
    buckets=[1, 2, 4, 8, 16, 32, 64]
)

TOKENS_PER_SECOND = Histogram(
    'inference_tokens_per_second',
    'Throughput of inference in tokens per second',
    buckets=[10, 25, 50, 100, 200, 400, 800]
)

TIME_TO_FIRST_TOKEN = Histogram(
    'inference_ttft_seconds',
    'Time from request to first token generated',
    buckets=[.05, .1, .25, .5, 1, 2, 5]
)

REQUEST_QUEUE_DEPTH = Gauge(
    'inference_queue_depth',
    'Number of requests waiting for GPU'
)


class InstrumentedInferenceServer:
    async def infer(self, prompts: list[str]) -> list[str]:
        BATCH_SIZE.observe(len(prompts))
        REQUEST_QUEUE_DEPTH.set(self.queue.qsize())

        start = time.perf_counter()
        results = await self._run_inference(prompts)
        duration = time.perf_counter() - start

        total_tokens = sum(len(r.split()) for r in results)
        TOKENS_PER_SECOND.observe(total_tokens / duration)

        return results

When you can see batch sizes, queue depth, tokens per second, and
time-to-first-token alongside GPU utilisation and VRAM usage, the
question of "do we need more GPUs" almost answers itself. Usually
the answer is "no, we need to batch better" or "no, we need to use
a smaller model" and scaling turns out to be unnecessary.

The cold start problem nobody planned for

Databases take seconds to start. GPU inference servers take minutes.

A database that restarts unexpectedly is back within thirty seconds
in most cases. An LLM inference server that restarts needs to load
model weights from storage into VRAM before it can serve any requests.
A 70B parameter model stored in 4-bit quantization is roughly 35GB.
Loading 35GB from network storage into VRAM, at typical cloud storage
bandwidth, takes several minutes under good conditions.

This changes incident dynamics entirely. A database blip is a brief
interruption. A GPU server blip is a several-minute outage for every
affected instance. Autoscaling, which works well for stateless
application servers and adequately for databases, works badly for
GPU inference because new instances take so long to become ready.

The teams that have worked this out run warm pools ,GPU instances
with models already loaded, sitting idle, waiting for traffic that
hasn't arrived yet. This feels wasteful. It's the only way to handle
traffic spikes without minutes-long latency blowouts.

# Kubernetes deployment with warm pool strategy
# Minimum replicas keep instances warm even at low traffic
# This costs money. The alternative is cold start latency.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 3  # Never scale below this. These are your warm pool.
  template:
    spec:
      containers:
      - name: inference-server
        image: myteam/inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          # Model loading takes time. This probe must not pass
          # until the model is fully loaded in VRAM.
          initialDelaySeconds: 180   # 3 minutes minimum
          periodSeconds: 10
          failureThreshold: 30       # 5 more minutes of retries
        lifecycle:
          preStop:
            exec:
              # Drain in-flight requests before shutdown
              command: ["/bin/sh", "-c", "sleep 30"]

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 3    # Warm pool floor
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"  # Scale when queue exceeds 5 requests per replica
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120   # Add 2 pods max every 2 minutes
                             # Fast enough to respond, slow enough
                             # to not over-provision during spikes
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
                                       # Cold start cost makes yo-yo scaling
                                       # extremely expensive

The scaleDown stabilization window is long deliberately. The cold start
cost is so high that scaling down and back up in response to a brief
traffic dip is more expensive than just keeping the instances running.
This is counterintuitive if you're coming from stateless web services.
It's the operational reality of GPU infrastructure.

The cost model is upside down

Database costs scale with data volume and query complexity. You pay
more as your data grows and your queries get more complex.

GPU costs scale with time. You pay for every second a GPU exists,
whether it's serving requests or not. An idle GPU costs the same as
a busy GPU.

This inverts the normal infrastructure economics. With stateless
application servers, idle capacity is cheap, you can scale to zero
when traffic drops and pay nothing. With GPU inference, scaling to
zero means cold starts when traffic returns. The minimum viable
capacity for a production inference service is not zero, it's
whatever your warm pool needs to be, which is determined by your
acceptable cold start latency and your traffic spike patterns.

The teams that have made peace with this have stopped thinking about
GPU cost as a variable cost that tracks usage and started thinking
about it as a fixed cost that buys capacity. The question is not
"how do we pay less for GPU when traffic is low?" The question is
"what is the right amount of always-on capacity, and how do we make
sure we use it efficiently?"

Efficient use means high batch fill rates, high token throughput per
GPU-hour, low idle time. The metrics above are the inputs to this
calculation. Without them you're guessing about whether your
infrastructure is sized correctly.

The pattern that's emerging

The teams operating GPU infrastructure well in 2026 look, in their
operational discipline, a lot like the teams that operated databases
well in 2012 after enough people had been burned that the patterns
were starting to solidify.

They treat GPU utilisation as a lagging indicator and token throughput
as the leading one. They instrument everything: batch sizes, queue
depth, time-to-first-token, VRAM usage, KV cache hit rates. They
size their warm pools based on measured traffic patterns rather
than intuition. They run the smallest model that meets their quality
bar, not the largest model they can afford, because smaller models
batched efficiently outperform larger models batched poorly on
almost every practical metric.

They've also accepted something that takes a while to accept: that
the right abstraction for GPU infrastructure is not "fast compute"
but "throughput capacity." The question is not "how fast can this
machine process one request?" GPUs are fast at that regardless.
The question is "how many requests per dollar can this infrastructure
handle at acceptable latency?" That question requires different
metrics, different architecture, and a different mental model than
the one most teams bring from their experience with CPU infrastructure.

The database analogy runs deeper than it looks. In 2004, the teams
that treated the database as a black box, put data in, get data
out, add more RAM when it's slow, eventually hit walls that their
architecture couldn't get past. The teams that understood what was
happening inside the database query plans, index usage, lock
contention, buffer pool behaviour built things that scaled.

The GPU is not a black box. It has a memory hierarchy, a batching
model, a cost structure, and performance characteristics that reward
understanding and punish ignorance in the same way the database did.

The patterns are forming. The teams learning them now will have the
same advantage in five years that database-literate engineers had in 2015.

The mistakes are happening right now, at scale, expensively.
Most of them are the same mistakes. Most of them are avoidable.

Unit Tests Are Overrated and You Know It

Benard Otieno — Tue, 19 May 2026 10:55:03 +0000

We test the wrong things obsessively and the right things barely at all. The unit test orthodoxy has produced codebases with 90% coverage that break constantly in production. It's time to say this out loud.

I'm going to say something that will make some people close this tab
immediately: most unit tests are not worth the time it takes to write
and maintain them, and the culture around unit testing has caused more
harm to software quality than it has prevented.

Not all unit tests. Not testing in general. Specifically the orthodoxy
that says you should test every function, mock every dependency, aim for
maximum coverage, and measure quality by how many green checkmarks your
test runner produces.

That orthodoxy is producing codebases that are simultaneously over-tested
and under-validated. Teams that spend enormous engineering hours maintaining
test suites that don't catch the bugs that actually affect users. Developers
who spend more time making tests pass than making software work. Coverage
reports that read ninety percent and services that break every other
deployment.

If this makes you uncomfortable, good. Stay with the discomfort for a
minute, because the alternative is continuing to do something that doesn't
work while calling it best practice.

What unit tests actually test

A unit test tests a unit of code in isolation. The unit is typically a
function or a class. The dependencies of that unit — other functions,
databases, external services — are replaced with mocks or fakes that
return controlled responses.

This is valuable for exactly one category of problem: logic that lives
in pure functions, isolated from external state, where the relationship
between input and output is the entire thing being tested.

# This is worth unit testing. The logic is the point.
def calculate_discount(
    base_price: Decimal,
    customer_tier: str,
    order_quantity: int,
) -> Decimal:
    if customer_tier == "enterprise":
        tier_discount = Decimal("0.20")
    elif customer_tier == "pro":
        tier_discount = Decimal("0.10")
    else:
        tier_discount = Decimal("0.00")

    quantity_discount = Decimal("0.05") if order_quantity >= 100 else Decimal("0.00")
    total_discount = min(tier_discount + quantity_discount, Decimal("0.25"))

    return base_price * (1 - total_discount)

A unit test for this function is testing the right thing. The function
is pure. Its behavior is entirely determined by its inputs. There are
no external dependencies to mock. The test directly validates the
business logic.

Now look at what most unit tests actually test:

# This is what unit tests look like in most codebases.
@patch('app.services.payment.stripe_client')
@patch('app.services.payment.db')
@patch('app.services.payment.email_service')
@patch('app.services.payment.inventory_service')
async def test_process_payment(
    mock_inventory,
    mock_email,
    mock_db,
    mock_stripe,
):
    mock_stripe.create_payment_intent.return_value = Mock(id="pi_123", status="succeeded")
    mock_db.get_order.return_value = Mock(id="order_1", total=49.99, user_id="user_1")
    mock_inventory.reserve.return_value = True
    mock_email.send.return_value = None

    result = await process_payment("order_1")

    assert result.status == "completed"
    mock_stripe.create_payment_intent.assert_called_once()
    mock_inventory.reserve.assert_called_once()
    mock_email.send.assert_called_once()

What is this test actually testing? It is testing that when everything
works exactly as mocked, the function calls the mocked things in the
expected order and returns the expected result.

It is not testing what happens when Stripe returns an error. It is not
testing what happens when the database is unavailable. It is not testing
what happens when inventory reservation fails after payment succeeds,
leaving a paid order in a broken state. It is not testing the actual
integration between these components.

It is testing that the code is wired together the way it was wired
together when the test was written. It is a snapshot of the implementation
masquerading as a validation of the behavior.

And it will pass green on every run until the day something real breaks
in production, at which point it will still pass green because the mocks
are still returning what you told them to return.

The mock problem

Mocks are the original sin of unit testing culture. They were created
to solve a real problem — tests that depend on external services are
slow, unreliable, and hard to set up — and they solved that problem
by replacing the external service with a fake version that does whatever
the test needs it to do.

The consequence is that your test suite no longer tests your software.
It tests your software's interaction with your software's assumptions
about how its dependencies behave. When those assumptions are wrong —
when the real Stripe API returns a response shape that's slightly
different from what you mocked, when the real database has a different
transaction isolation level than your mock assumes, when the real email
service deduplicates in a way your mock doesn't — your tests pass and
your production breaks.

I have debugged more production incidents that were caused by the gap
between mocked behavior and real behavior than I can count. The test
said it worked. The mock said the API returned this. The real API
does not return this. The test was wrong about the contract, and because
the test was wrong, the code was deployed with a broken assumption that
nobody caught.

The more you mock, the less your tests tell you about whether the
software works. This is not a design smell to be managed — it's a
fundamental property of mocking. Every mock is a place where reality
has been replaced with assumption.

The coverage lie

Coverage is the most destructive metric in software engineering.

Not because high coverage is bad. Because coverage as a target produces
the wrong behavior. When coverage is a goal, developers write tests to
cover code rather than to validate behavior. These are different
activities that produce very different tests.

A test written to cover code asks: how do I execute this line?
A test written to validate behavior asks: what should this system do,
and how do I know it's doing it?

Tests written to cover code tend to be thin — they call the function
with happy-path inputs and assert that it doesn't throw. They increase
coverage. They do not increase confidence.

# Written to cover code. Gets you to 100% on this function.
def test_create_user():
    user = create_user(email="test@example.com", password="password123")
    assert user is not None

# Written to validate behavior. Tests what actually matters.
def test_create_user_hashes_password():
    user = create_user(email="test@example.com", password="password123")
    assert user.password_hash != "password123"
    assert verify_password("password123", user.password_hash)

def test_create_user_rejects_duplicate_email():
    create_user(email="test@example.com", password="password123")
    with pytest.raises(DuplicateEmailError):
        create_user(email="test@example.com", password="different")

def test_create_user_sends_verification_email(fake_email_sender):
    create_user(email="test@example.com", password="password123")
    assert any(
        email.to == "test@example.com" and "verify" in email.subject.lower()
        for email in fake_email_sender.sent
    )

def test_create_user_with_invalid_email():
    with pytest.raises(ValidationError, match="invalid email"):
        create_user(email="not-an-email", password="password123")

The second set has the same line coverage as the first if the function
is simple. It tests fundamentally different things. A system with the
first kind of tests has coverage. A system with the second kind has
confidence.

Coverage rewards quantity. Confidence comes from quality. These are
not correlated, and treating them as if they are has produced an
industry-wide habit of writing many low-value tests instead of fewer
high-value ones.

What actually breaks in production

Here is a list of things that unit tests, as typically practiced,
will never catch:

The query that works correctly against your test database with twenty rows and times out against production with two million rows
The race condition that only manifests when two requests hit the same endpoint within fifty milliseconds of each other
The API response from your payment provider that changed shape slightly in a minor version update
The session expiry behavior that's different in the production Redis configuration than in the in-memory fake you test against
The cascade delete behavior that your ORM handles differently than the raw SQL you use in the migration script
The encoding issue that only appears when a user's name contains a character outside the ASCII range
The timeout that is set correctly in the service but not propagated to the client that calls it

Every item on this list is a production incident I have personally
been part of. None of them was caught by unit tests. Most of them
would have been caught by integration tests that weren't written
because the team was busy maintaining the unit test suite.

This is the trade you make when you prioritize unit testing: you get
fast, reliable tests that validate your assumptions, and you skip the
slower, harder tests that would challenge them.

What to do instead

I am not arguing for no tests. I'm arguing for tests calibrated to
where the real risk is.

Integration tests over unit tests for anything with dependencies.

If a function touches a database, a cache, a message queue, or an
external service — test it against the real thing, or as close to
the real thing as you can get. Not a mock. Not an in-memory fake
that you wrote. A real database with a real schema and real data
volumes. A real Redis instance. A real queue.

Yes, these tests are slower. Run them in CI, not on every save.
They are dramatically more valuable than unit tests that mock the
same dependencies because they test what the code actually does,
not what you assumed the code would do.

# This is worth the setup cost. It catches real problems.
@pytest.mark.integration
async def test_process_payment_handles_stripe_card_declined(
    test_db,       # Real PostgreSQL, real schema
    stripe_mock,   # Stripe's own test environment, not our mock
):
    order = await create_test_order(test_db, total=Decimal("49.99"))

    # Stripe's test mode has real card numbers that trigger specific behaviors
    result = await process_payment(
        order_id=order.id,
        card_token="tok_chargeDeclined",  # Stripe test token for declines
    )

    assert result.status == "failed"
    assert result.failure_code == "card_declined"

    # Verify the order status was updated correctly in the real database
    updated_order = await test_db.fetch_one(
        "SELECT status FROM orders WHERE id = $1",
        order.id
    )
    assert updated_order["status"] == "payment_failed"

    # Verify no inventory was reserved for a failed payment
    reservation = await test_db.fetch_one(
        "SELECT id FROM inventory_reservations WHERE order_id = $1",
        order.id
    )
    assert reservation is None

This test uses a real database and Stripe's test environment. It is
slower than a mocked unit test. It tests whether the actual system
behaves correctly when a real dependency does something unexpected.
It is the test you actually need.

Test behavior at the system boundary, not implementation in the middle.

The most valuable tests are the ones that call your API, your
message handler, your batch job — the public interface of your
system — and assert on the observable output. Not which functions
were called, not which mocks were invoked. What came out.

@pytest.mark.integration
async def test_order_api_returns_correct_status_after_payment(
    client,
    test_db,
):
    # Create an order through the API
    create_response = await client.post("/orders", json={
        "items": [{"product_id": "prod_1", "quantity": 2}]
    })
    assert create_response.status_code == 201
    order_id = create_response.json()["id"]

    # Process payment through the API
    payment_response = await client.post(f"/orders/{order_id}/pay", json={
        "card_token": "tok_visa"
    })
    assert payment_response.status_code == 200

    # Verify the order status reflects the payment
    order_response = await client.get(f"/orders/{order_id}")
    assert order_response.json()["status"] == "confirmed"
    assert order_response.json()["payment"]["status"] == "succeeded"

This test goes through the API, through the service layer, through
the database, and back. It validates the entire vertical slice. It
would catch a bug in the API handler, a bug in the service logic,
a bug in the database query, or a bug in the response serializer.
A unit test that mocked all the layers would catch none of these
except the one in the specific layer being tested.

Reserve unit tests for pure logic.

Unit tests are excellent for exactly what they're suited for:
pure functions with complex branching logic where the relationship
between input and output is the whole point. Discount calculations.
Validation rules. Data transformations. Parsing logic. Algorithms.

These are worth unit testing because the test is actually testing
the logic. There's nothing to mock. The test runs in microseconds.
Failures tell you exactly what's wrong.

For everything else — anything that touches infrastructure, anything
that coordinates between components, anything that talks to external
systems — integration tests are not just better, they're the only
tests that tell you anything true.

The heresy in full

Here is the position I'm staking out, clearly, so it can be clearly
disagreed with:

A codebase with forty percent coverage from integration tests that
test real behavior against real dependencies is more reliable than
a codebase with ninety percent coverage from unit tests that mock
every external interaction.

Coverage is not quality. Mocks are not validation. A green test suite
is not a guarantee that the software works — it's a guarantee that
the software works according to the assumptions baked into the tests,
which may or may not match reality.

The software quality crisis is not a testing crisis. We test more
than we ever have. The crisis is a misalignment between what we test
and what breaks. We test pure logic obsessively and integration
boundaries barely. The bugs live at the integration boundaries. They
always have.

The counterargument I hear most often: integration tests are slow.
Yes. They are. They are slow because they do real things. Real things
take time. The alternative is fast tests that don't do real things
and therefore don't tell you whether the real things work.

Speed is not a virtue in a test suite. Accuracy is.

I expect this to generate disagreement. That's fine. The developers
most likely to disagree are the ones who have invested the most in
unit testing culture, which makes their disagreement somewhat
self-referential. The developers most likely to agree quietly are the
ones who have been paged at 3am because a perfectly unit-tested
function didn't work the way its mocks said it would.

Those developers know. They've always known. This is just someone
finally saying it.

You Are Building for the Wrong User

Benard Otieno — Sun, 17 May 2026 10:58:59 +0000

The user in your head when you make product decisions is not your actual user. The gap between those two people is where most product failures live.

Every product decision gets made with a user in mind. When you design
an API, you're imagining how it will be called. When you write error
messages, you're imagining who will read them. When you decide how a
feature should work, you're imagining someone using it. When you choose
what to build next, you're imagining who will need it.

The problem is that this user — the one in your head — is almost always
wrong. Not slightly wrong. Systematically, structurally wrong in ways that
compound with every decision.

And the engineers and teams who understand this build qualitatively
different things from the ones who don't.

Who the imagined user actually is

The user most engineers imagine when building is a technically sophisticated,
highly motivated person who wants to use the product correctly. They read
the documentation before starting. They understand the conceptual model
the product is built around. They know what they want to achieve and they're
trying to figure out how the product helps them achieve it.

This person does not exist in your user base in large numbers. They exist
among early adopters, among the colleagues who gave you feedback when you
were building, among the developers on your own team who use the product
to test it. They are massively overrepresented in the feedback you receive
because they're the ones who care enough to give feedback. They are
massively underrepresented in your actual user base.

Your actual users are not unsophisticated. They're busy. They have
twenty-three other things demanding their attention. They encounter your
product in the middle of trying to accomplish something else. They have not
read the documentation. They will not read the documentation. They are
trying to figure out if your product can solve their problem in the next
ninety seconds, and if it's not obvious that it can, they will stop trying.

These are the same people. The difference is not intelligence or capability.
It's context. The imagined user has context — they know what the product
does, they're focused on it, they're motivated to learn it. The actual user
has none of that. They showed up to solve a problem. Whether they stay
depends entirely on how quickly they can see that the product helps.

Every design decision made for the imagined user makes the product slightly
worse for the actual user. A feature that's powerful but requires
configuration: the imagined user configures it. The actual user sees a blank
state and leaves. An error message that's technically accurate: the imagined
user understands it. The actual user doesn't know what to do next.
Documentation that's comprehensive and well-organized: the imagined user
reads it. The actual user never opens it.

The specific ways this goes wrong

I want to be concrete, because "build for actual users" is advice that
sounds obvious and is almost universally ignored, and the reason it's
ignored is that the failure modes are invisible until you're looking for them.

Onboarding built for people who already understand the product.

The most expensive minute in any product's relationship with a user is
the first one. Not expensive in compute cost — expensive in the sense
that the user is forming the impression that determines whether they
ever come back. The product gets roughly sixty seconds to communicate:
what this does, whether it's for you, and what to do first.

Most onboarding fails this because it's designed by people who deeply
understand the product, which makes it impossible for them to accurately
simulate not understanding it. The team knows what the product does,
so the product's value feels obvious. The team knows the mental model
the product is built around, so the conceptual framework feels natural.
The actual new user has none of this scaffolding, and the onboarding
that feels clear to the team is opaque to them.

The fix is not better copy. It's building onboarding by watching people
who have never seen the product try to use it. Not asking them what
they think. Watching what they do. Where do they click first? Where do
they stop? Where do they look confused? Where do they give up?

Five sessions of this will surface more real problems than a month of
internal review, and most teams have never done it.

Error messages written for developers.

Error messages are the product's voice in the moment the user is most
frustrated. They are almost universally written by the developer who
implemented the feature, for an audience of developers who understand
the system.
Error: Invalid parameter 'start_date'. Expected ISO 8601 format.

This message is technically accurate. The developer reading it knows
immediately what to fix. The non-developer user — or even the developer
who is not familiar with ISO 8601 — reads this and has several questions:
what's a parameter? What's ISO 8601? What did I type that was wrong?
What should I type instead?

The message answered none of these questions. It described the problem
in terms that require prior knowledge to decode. It provided no path
forward.
The start date you entered isn't in the right format.
Try: 2026-05-16 (year-month-day)
You entered: 05/16/2026

Same information. Different audience. The second version costs nothing
extra to implement and turns a moment of frustration into a moment of
clarity. Most error messages in most products are written like the first.

The reason is that the developer writes the error message while
implementing the validation logic, in the mental context of the
implementation, for an imagined user who shares that context. The actual
user is never imagined at all.

Features complete enough to ship but not complete enough to use.

There is a version of shipping fast that is genuinely good: getting a
working feature in front of users quickly, learning from real usage,
iterating. There is a version that is genuinely bad: shipping something
that is technically functional but missing the parts that make it
actually usable, because those parts didn't make it into the sprint.

The difference between these two versions is whether the shipped thing
works for actual users or only for imagined ones.

An API endpoint that returns data but has no pagination is complete
enough for the imagined user, who is building a demo with twenty records.
It is not complete enough for the actual user, who is trying to process
a real dataset. A form that collects information but has no confirmation
state works for the imagined user, who is testing the happy path. It
doesn't work for the actual user, who isn't sure if their submission went
through and submits again, creating duplicates.

The imagined user will find a way to make incomplete features work.
The actual user will encounter the gap between the feature and their
reality and leave. The sprint velocity that comes from shipping incomplete
features is borrowed against the retention you lose when actual users
encounter them.

Documentation written for the moment of maximum understanding.

Documentation is written by someone who understands the product deeply,
at the moment when they understand it most deeply, and then it is
assumed to be complete.

The person reading the documentation is someone who understands the
product least, at the moment when they need help most. These are
maximally mismatched participants. The writer has internalized everything
that the reader doesn't yet know. What feels like a clear explanation
to the writer is often a chain of assumptions that the reader can't
follow.

The specific failure: concepts used before they're explained. A getting
started guide that says "first, configure your workspace" where "workspace"
is a domain concept that the reader doesn't yet understand. A reference
document that uses the product's internal terminology throughout, assuming
the reader has already acquired that vocabulary. A tutorial that assumes
the reader knows why they'd want to do what they're being shown, rather
than establishing the motivation first.

The person who wrote this documentation was not explaining from first
principles. They were documenting from expertise. Those are different
cognitive activities and they produce different artifacts.

The data you're probably not looking at

Most teams have more data about their actual users than they use. The
data is uncomfortable, so it gets looked at less than it should.

Session recordings of real users encountering real problems. The vast
majority of teams with access to session recording tools use them
reactively — to investigate a specific reported problem — rather than
proactively to understand where users generally struggle. A few hours
of watching session recordings from new users will show you more about
where your product fails actual users than any amount of internal review.

Activation funnel drop-off. Where in the onboarding flow do users stop?
Most teams know this number but don't sit with what it implies. A 60%
drop-off at step three of onboarding means four in ten users who started
your onboarding never got to step four. What is step three? What does
it ask the user to do? Is it actually necessary at that point, or is it
there because it was the logical next step from an implementation
perspective?

Support tickets, literally read. Not summarised. Not categorised. Read,
one by one, by the people who made the design decisions that generated
them. The support ticket is the user telling you, in their own words,
what your product did that didn't match their expectation. It is
unmediated feedback from actual users about actual failure modes. Most
teams process support tickets through a support function and those
learnings never reach the people making the product decisions.

Search queries within the product, if you have a search function or a
help center with search. What are users typing? The search query is
the user telling you what they're looking for that they couldn't find
on their own. A user who searches "how do I delete my account" is a
user who couldn't find the account deletion flow. A user who searches
"why is my data wrong" is a user encountering a data integrity problem
they don't understand. The aggregate of these queries is a map of where
your product is failing actual users.

The proximity problem

The reason teams build for the imagined user rather than the actual user
is structural, not intentional. It's a proximity problem.

The people making product decisions are close to the product and far
from the users. They understand how the product works, why it works that
way, what the tradeoffs were. They use the product themselves, but they
use it with expert knowledge that insulates them from the actual experience
of a new user. When they imagine a user, they imagine someone like
themselves with the same context they have.

The actual users are far from the people making decisions and leave
signals that are filtered, delayed, and translated before they reach
anyone who can act on them. A user who struggles and leaves doesn't file
a bug report. They just don't come back. A user who figures something
out eventually doesn't report that it was hard. They just move on. The
signals that make it back are the ones from the vocal minority who cared
enough to write something down, which is not a representative sample.

Closing this gap requires deliberate effort because it doesn't close
on its own. The product gets more complex and the team's expertise
increases over time, which means the gap between their mental model
and the new user's experience widens if nothing is done to counter it.

The specific practices that work:

Regular sessions watching new users. Not asking for opinions.
Watching where they click, where they pause, where they read, where
they give up. Monthly, with the whole team watching, not just the
designer or the PM. Watching is a different cognitive activity from
asking. Watching gives you behavior. Asking gives you rationalizations.

Someone on the team responsible for carrying the user's perspective.
Not a UX researcher who files reports that get read and filed. Someone
with standing in product discussions who can say "a user who doesn't
know what a workspace is would not understand this" and have that
land as a real input to the decision. The imagined user has many
advocates on the team — everyone building the product is effectively
advocating for the imagined user's needs. The actual user needs
an explicit advocate because their perspective is not naturally
represented in the room.

Requiring first-use documentation. Before any feature ships, someone
who didn't build it has to be able to use it with no guidance. Not
as a QA pass. As a design gate. If the person who didn't build it
needs explanation to use the feature, the feature is not ready for
actual users who also won't receive explanation.

Reading your own error messages as a user. Take the last five error
messages that appeared in your logs or support tickets. Read them as
someone who doesn't know your system. What do they tell you to do?
If the answer is "nothing concrete," the error messages are for your
debugger, not for your user.

The version of this that compounds

The teams that build for actual users from the beginning develop a
capability that's hard to acquire later: accurate intuition about where
their product fails people who are not already experts in it.

This intuition is worth more than it looks like on paper. It's the
thing that means you don't have to watch session recordings before every
release, because the person who would have struggled with this is already
present in the designer's mind during design. It's the thing that means
your error messages are clear because clarity for actual users is already
the default, not an afterthought. It's the thing that means your
onboarding works for the user who's distracted and skeptical, not just
for the one who's engaged and motivated.

Building this intuition requires sustained exposure to actual users
struggling with the actual product. There's no shortcut. The teams that
have it got it by watching, regularly, without the filter of what the
product was supposed to do, the reality of what actual users experience
when they encounter it.

The imagined user is comfortable. They use the product well, they
appreciate the features, they understand the mental model. Building for
them is building for the team's own reflection.

The actual user is uncomfortable to watch. They do unexpected things.
They miss obvious affordances. They read things wrong. They give up
at moments that feel, to the team, like they should be easy. Watching
this is genuinely difficult when the product is something you made.

It's also the only accurate feedback you have on whether what you made
works.

Build for the person in the session recording, not the person in your
head. They are not the same person. One of them is your actual user.

Pick Boring Technology. Yes, Especially for AI.

Benard Otieno — Fri, 15 May 2026 11:50:52 +0000

What "Boring" Actually Means

Boring technology does not mean old technology. It does not mean slow, limited, or low-quality. It means technology that has been in production long enough that its failure modes are documented, its operational characteristics are well understood, and the person debugging it at 2am has a reasonable chance of finding a Stack Overflow answer that is not from a beta forum post in 2024.

Postgres is boring. Redis is boring. S3 is boring. A plain HTTP API with JSON is boring. SQLite, for things that fit in SQLite, is boring. None of these things are slow, limited, or embarrassing to use. They are boring because they have been deployed by enough people, at enough scale, for long enough that the surprises have mostly been found. The surface area of "things that can go wrong that nobody has written about" is small.

When Dan McKinley wrote the Choose Boring Technology essay in 2015, he framed it as a budget: you get a limited number of new technologies per project, and you should spend that budget intentionally. That framing is still correct. What's changed is that AI products have a non-negotiable budget item now: the model and the scaffolding around it. That item is expensive. It is genuinely new. It has failure modes that nobody fully understands yet. That is the place you are choosing to spend your novelty budget. Everywhere else, the argument for boring is stronger than it has ever been.

The Vector Database Problem

The most common place I see this play out is in the retrieval layer. A team is building RAG — retrieval-augmented generation, some form of semantic search over a corpus. They need to store embeddings and query them by similarity. There are purpose-built vector databases for this: Pinecone, Weaviate, Qdrant, Chroma. They have impressive benchmarks, polished SDKs, and marketing copy that makes Postgres look like a horse and buggy.

So teams reach for them. Then six months later they are managing two separate databases — Postgres for everything else, Pinecone for vectors — running two separate migration workflows, debugging sync issues between them, and paying for an additional managed service. The team that wanted to move fast has added an operational surface area that requires dedicated attention.

pgvector exists. It is a Postgres extension. It is boring. It stores vectors in Postgres, queries them in Postgres, transactions with them in Postgres. You run one database. You use the migration tooling you already have. You query it with SQL you already know. The performance ceiling is lower than a dedicated system optimised for nothing but ANN search — but the teams I've talked to who hit that ceiling with pgvector are building at a scale where infrastructure complexity is genuinely their problem to manage. Most teams are not those teams.

The right question is not "what is the best vector database." It is "what is the simplest thing that handles my actual query volume, that I can operate with my existing knowledge, that does not require me to manage data consistency across two systems." The answer to that question, for most products, is Postgres.

-- pgvector: you already know how to do this
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id         uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  content    text NOT NULL,
  embedding  vector(1536),
  metadata   jsonb,
  created_at timestamptz DEFAULT now()
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

-- Retrieval: pure SQL, same connection pool as everything else
SELECT
  id,
  content,
  metadata,
  1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;

That is the retrieval layer for a production RAG system. It is a Postgres query. You already know how to read it, index it, back it up, and monitor it.

Agent Frameworks Are the Same Problem, Bigger

The vector database situation is a contained example. Agent frameworks are the same problem, scaled up.

There are now a meaningful number of agent frameworks in active development: LangChain, LangGraph, AutoGen, CrewAI, Pydantic AI, and several more depending on when you are reading this. They differ in their abstractions for tool calling, memory management, multi-agent coordination, and state persistence. Some of them are good. Some of them are in the process of becoming good. All of them are new enough that you are, to some degree, a beta tester.

The alternative is to not use a framework for the parts that don't require one. The model's tool-calling API is not complicated. You define tools as JSON schemas. The model returns a function call. You route it and return the result. That is the loop. You can implement the core of it in a hundred lines of Python that you wrote, that you understand completely, that has no transitive dependencies you didn't choose.

import anthropic
import json

client = anthropic.Anthropic()

def run_agent(system: str, user_message: str, tools: list, tool_handlers: dict) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system,
            tools=tools,
            messages=messages,
        )

        # No tool use → we're done
        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if b.type == "text")

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue
            handler = tool_handlers.get(block.name)
            if not handler:
                raise ValueError(f"No handler for tool: {block.name}")
            result = handler(**block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result),
            })

        # Extend conversation with model turn and tool results
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

That is a complete agentic loop. No framework. No magic. Every line of it is readable by someone who has never seen it before. When it breaks, you know where to look. When you need to add a checkpoint before a destructive operation, you know exactly where to put it. When a framework update ships a breaking change to how tool results are structured, you are unaffected because you wrote the tool result handling yourself.

Frameworks earn their keep when they solve problems you genuinely have: complex multi-agent coordination, built-in state persistence, graph-based execution flows where you need cycle detection and conditional edges. If you have those problems, use a framework. But reach for your own loop first, and upgrade to a framework when you have a reason, not because the README has a compelling architecture diagram.

The Counterargument Is Real

I want to be honest about the case against this position, because it is not trivial.

Boring technology is not always available in the form you need. pgvector has a performance ceiling. If you are running similarity search across a hundred million vectors with sub-10ms latency requirements, you need a dedicated ANN index and the purpose-built databases are probably worth their operational cost. If your agent coordination is genuinely complex — multiple agents with heterogeneous capabilities, conditional routing based on intermediate state, nested tool calls — a framework that has solved those problems is better than reinventing it.

The real trap is not "using new technology." It is using new technology as the default rather than as the exception. When you reach for Pinecone before asking whether pgvector handles your actual query volume, you have made a choice you probably did not mean to make. The question is whether you made it consciously.

What Changes When AI Is Involved

The argument for boring technology is not new. What AI changes is the urgency of it, for a specific reason: the model is already the source of novel, hard-to-predict behavior in your system. The model hallucinates. The model handles edge cases in ways you did not anticipate. The model's output quality varies with context length, with phrasing, with temperature settings you forgot you changed. The model is a continuous source of surprises, and managing those surprises is the actual engineering work.

When the model is already the unpredictable component, adding unpredictable infrastructure around it is compounding risk. A flaky external API call in your tool chain plus a model that sometimes decides to call that tool three times in a row plus a vector database that occasionally returns inconsistent results under concurrent load is not three small problems. It is three small problems that interact in ways you cannot enumerate in advance.

Boring infrastructure shrinks the problem space. When the retrieval layer is Postgres and the queue is Redis and the API is plain HTTP, the list of things that can behave unexpectedly in hard-to-reproduce ways is shorter. You are not eliminating surprises — the model will still surprise you — but you are constraining where they can come from.

The system that is easiest to debug is not the one with the fewest components. It is the one where the largest number of components have predictable, documented behavior. Build toward that, and let the model be the interesting part.

The Heuristic I Actually Use

When evaluating a new technology for an AI product, I ask three questions before I let it into the stack:

1. What happens when this fails? Not "can it fail" — everything can fail. What does failure look like? Is it a clean error or silent corruption? Is it recoverable without data loss? Is there a runbook for it, or will I be writing one?

2. Can I replace it in a weekend? This is not about whether I will replace it. It is about whether the abstraction is thin enough that swapping the implementation does not require a rewrite. If replacing the vector store requires touching thirty files, the abstraction is wrong.

3. Does my boring alternative exist and have I ruled it out? Postgres, Redis, S3, plain HTTP. If one of these handles the problem, I need a specific reason not to use it — not just a feeling that the new thing is more purpose-built.

If a technology passes all three, it can earn its place in the stack. If it fails the first question and the second and the third, the burden of proof is high.

The teams that ship boring AI products are not the teams that lack ambition. They are the teams that understand where the ambition should go. The model is where the novel bets live. The model is where you spend the engineering attention on failure modes you have never seen before, on evaluation strategies that do not exist in textbooks yet, on product decisions that require genuine taste about AI behavior. That is the hard, interesting work.

Letting the infrastructure be interesting too is not ambitious. It is just expensive.

Make the retrieval layer boring. Make the queue boring. Make the API boring. Let Postgres handle the things Postgres is good at, which turn out to be most things. And spend the attention you just freed up on the part of the system that actually requires it.

Your Observability Is Looking at the Wrong Things

Benard Otieno — Thu, 14 May 2026 16:01:35 +0000

I've been in incident calls where every dashboard was green. Latency nominal. Error rate under 0.1%. CPU humming along at a comfortable 40%. And somewhere downstream, a critical workflow had been silently producing wrong results for six hours.

Nobody had an alert for "the thing is doing something, just not the right thing."

This is the gap most observability setups never close: they're watching the infrastructure, not the behavior. They'll tell you the system is alive. They won't tell you it's lying.

The Three Dials Everyone Watches

The default observability stack for most teams converges on the same three signals: uptime, latency, and error rate. These show up in every runbook, every SLA, every on-call rotation. They're not useless — a spike in error rate is real signal, a latency cliff is real signal — but they share a critical property: they're all lagging indicators of failure that's already happened.

More importantly, they only fire when the system is explicitly misbehaving. They say nothing about a system that's doing exactly what you told it to do, but where what you told it to do was wrong.

I had a recommendation service that returned results within 50ms, with a 0.02% error rate, and near-perfect uptime. It was also returning the same stale set of recommendations to every user because a cache invalidation job had silently stopped running four days earlier. The system was technically flawless. It had completely stopped serving its purpose.

The dashboard gave it a clean bill of health.

Logs Are Not a Narrative

The second failure mode is subtler. Most teams log well, in the sense that they log a lot. Request in. Response out. Exceptions caught and written somewhere. Database queries above a threshold. Auth events.

What they don't have is a narrative — a way to reconstruct what actually happened during a user's session, a job's execution, a transaction's lifecycle. Individual log lines are breadcrumbs. What you need is the trail.

The difference shows up immediately when something goes wrong. With breadcrumbs, you spend the first hour of an incident correlating timestamps across three different log streams, mentally assembling a sequence of events that should have been assembled for you. With a trail — structured traces with a shared correlation ID flowing through every service that touched a request — you open one query and see the story.

import uuid
import logging
import functools
from contextvars import ContextVar

correlation_id: ContextVar[str] = ContextVar("correlation_id", default="")

def traced(fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        cid = correlation_id.get()
        logger.info(
            "enter",
            extra={"fn": fn.__name__, "correlation_id": cid}
        )
        try:
            result = fn(*args, **kwargs)
            logger.info(
                "exit",
                extra={"fn": fn.__name__, "correlation_id": cid, "status": "ok"}
            )
            return result
        except Exception as e:
            logger.error(
                "error",
                extra={"fn": fn.__name__, "correlation_id": cid, "error": str(e)}
            )
            raise
    return wrapper

# At the edge — set once, propagate everywhere
def handle_request(request):
    correlation_id.set(request.headers.get("X-Correlation-ID") or str(uuid.uuid4()))
    return process(request)

This is not complicated. It's not expensive. The reason most teams don't have it is that they added logging incrementally — one print statement at a time — and never stepped back to ask whether the sum of those statements could tell a story.

Metrics Without a Baseline Are Just Numbers

Here's a metric: your API is returning responses in 340ms.

Is that good? Bad? Degraded from yesterday? Normal for this time of week? You cannot answer without a baseline, and most teams don't have one that's precise enough to be useful.

What typically exists is a static threshold: alert if latency exceeds 500ms. That threshold was set during initial deployment, when load was a tenth of what it is now, and hasn't been revisited since. It's not a baseline — it's a guess that calcified into a rule.

A real baseline is dynamic. It accounts for time of day, day of week, and recent trend. It flags when you're 30% above your own normal, not when you cross an arbitrary line someone set two years ago.

from collections import deque
from statistics import mean, stdev
from datetime import datetime

class AdaptiveBaseline:
    def __init__(self, window_size=1440):  # 24h of per-minute samples
        self.samples = deque(maxlen=window_size)

    def record(self, value: float):
        self.samples.append((datetime.utcnow(), value))

    def is_anomalous(self, value: float, threshold_stdev: float = 2.5) -> bool:
        if len(self.samples) < 60:
            return False  # not enough data to have an opinion
        recent = [v for _, v in self.samples]
        m = mean(recent)
        s = stdev(recent)
        if s == 0:
            return False
        return abs(value - m) > threshold_stdev * s

    def summary(self) -> dict:
        if not self.samples:
            return {}
        values = [v for _, v in self.samples]
        return {"mean": mean(values), "stdev": stdev(values), "n": len(values)}

Static thresholds are a lazy stand-in for understanding your system's normal. They exist because setting them takes five minutes, and building real baselines takes an afternoon. That tradeoff looks different at 2am when an alert fires on a load pattern that's been there for three weeks.

What Actually Belongs in Your Dashboards

The signals that matter fall into a different category than infrastructure health. They're about whether the system is doing its job, measured in terms the business cares about.

Throughput on the critical path. Not "requests per second" in aggregate — the specific count of the transactions that matter. Orders placed. Reports generated. Messages delivered. If that number is lower than expected, something is wrong, even if all your infra metrics are green.

Queue depth and processing age. If you have async workers, the age of the oldest unprocessed item is a more honest health signal than worker CPU. A queue that's growing is a system falling behind, regardless of what the workers themselves are reporting.

Business-level error rates, not HTTP error rates. A 200 response that returns an empty result set is not a success. A job that completes without exception but produces zero output has failed. You need to define success in terms of what the system was supposed to produce, then measure whether it produced it.

Derivative metrics. If your checkout conversion rate drops from 68% to 51%, that's a signal — even if no individual service is throwing errors. Tracking rates and ratios, not just raw counts, catches the class of failures where something is working but working worse.

# Prometheus recording rules — compute these, don't query them live
groups:
  - name: business_health
    interval: 60s
    rules:
      - record: job:orders_per_minute:rate
        expr: rate(orders_completed_total[5m]) * 60

      - record: job:checkout_conversion:ratio
        expr: |
          rate(checkouts_completed_total[10m])
          / rate(checkout_initiated_total[10m])

      - record: job:queue_age_seconds:max
        expr: time() - min(job_enqueued_timestamp_seconds)

  - name: alerts
    rules:
      - alert: ConversionRateDrop
        expr: job:checkout_conversion:ratio < 0.55
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Checkout conversion below 55% for 5+ minutes"

      - alert: QueueProcessingStalled
        expr: job:queue_age_seconds:max > 300
        for: 2m
        labels:
          severity: warning

Alerts Should Be Harder to Silence Than to Fix

The last thing most teams get wrong is the incentive structure around noise. When alerts fire too often on non-issues, engineers start ignoring them — or worse, start routing around them. The standard fix is to raise thresholds and add retry logic so the alert doesn't fire. This is treating the symptom. The alert was lying because the metric was wrong, and the right fix is to measure something that's actually meaningful.

There's a useful rule here: if an alert fired and the on-call engineer's first instinct was to check whether it was a false positive, the alert is already broken. A good alert should produce a specific, directed response — not a "let me see if this is real" investigation. If you find yourself constantly confirming that real alerts are real, your signal-to-noise ratio is telling you something.

Flaky alerts are the observability equivalent of flaky tests. You know you have them. You've learned to distrust them. And every week they stay in the rotation makes you slightly less responsive to the ones that actually matter.

Track your alert false-positive rate like you track your error rate. Alert on your alerts. Set a rule that any alert firing more than twice without a corresponding incident review gets flagged for audit. This sounds bureaucratic until the first time you catch that a critical alert has been misfiring for three weeks and nobody noticed because the team had learned to dismiss it.

What You're Actually Missing

Most observability stacks are built to answer one question: is the system up? That's a fine question. It's just not the most important one.

The more useful questions are: is the system doing what users need? Is it doing it as well as it was yesterday? Is anything changing that I should know about before it becomes a problem?

Those questions require measuring at the level of behavior and outcome, not infrastructure and response codes. They require traces that tell a story instead of logs that record events. They require baselines instead of thresholds, and business metrics instead of system metrics.

None of this is exotic. The tooling exists — OpenTelemetry, Prometheus recording rules, structured logging with correlation IDs. The gap isn't tooling. It's the habit of reaching for the infrastructure dashboard first and calling it observability.

Start with one question: if your system silently started doing the wrong thing at 3am, how long would it take you to find out? If the answer is "until a user complained," your dashboards are watching the machine, not the work.

That's the thing worth fixing.

blog.bennerdo.org

Benard Otieno — Fri, 08 May 2026 09:09:21 +0000