Benard Otieno

Posted on May 19

Unit Tests Are Overrated and You Know It

#ai #architecture #devops #productivity

We test the wrong things obsessively and the right things barely at all. The unit test orthodoxy has produced codebases with 90% coverage that break constantly in production. It's time to say this out loud.

I'm going to say something that will make some people close this tab
immediately: most unit tests are not worth the time it takes to write
and maintain them, and the culture around unit testing has caused more
harm to software quality than it has prevented.

Not all unit tests. Not testing in general. Specifically the orthodoxy
that says you should test every function, mock every dependency, aim for
maximum coverage, and measure quality by how many green checkmarks your
test runner produces.

That orthodoxy is producing codebases that are simultaneously over-tested
and under-validated. Teams that spend enormous engineering hours maintaining
test suites that don't catch the bugs that actually affect users. Developers
who spend more time making tests pass than making software work. Coverage
reports that read ninety percent and services that break every other
deployment.

If this makes you uncomfortable, good. Stay with the discomfort for a
minute, because the alternative is continuing to do something that doesn't
work while calling it best practice.

What unit tests actually test

A unit test tests a unit of code in isolation. The unit is typically a
function or a class. The dependencies of that unit — other functions,
databases, external services — are replaced with mocks or fakes that
return controlled responses.

This is valuable for exactly one category of problem: logic that lives
in pure functions, isolated from external state, where the relationship
between input and output is the entire thing being tested.

# This is worth unit testing. The logic is the point.
def calculate_discount(
    base_price: Decimal,
    customer_tier: str,
    order_quantity: int,
) -> Decimal:
    if customer_tier == "enterprise":
        tier_discount = Decimal("0.20")
    elif customer_tier == "pro":
        tier_discount = Decimal("0.10")
    else:
        tier_discount = Decimal("0.00")

    quantity_discount = Decimal("0.05") if order_quantity >= 100 else Decimal("0.00")
    total_discount = min(tier_discount + quantity_discount, Decimal("0.25"))

    return base_price * (1 - total_discount)

A unit test for this function is testing the right thing. The function
is pure. Its behavior is entirely determined by its inputs. There are
no external dependencies to mock. The test directly validates the
business logic.

Now look at what most unit tests actually test:

# This is what unit tests look like in most codebases.
@patch('app.services.payment.stripe_client')
@patch('app.services.payment.db')
@patch('app.services.payment.email_service')
@patch('app.services.payment.inventory_service')
async def test_process_payment(
    mock_inventory,
    mock_email,
    mock_db,
    mock_stripe,
):
    mock_stripe.create_payment_intent.return_value = Mock(id="pi_123", status="succeeded")
    mock_db.get_order.return_value = Mock(id="order_1", total=49.99, user_id="user_1")
    mock_inventory.reserve.return_value = True
    mock_email.send.return_value = None

    result = await process_payment("order_1")

    assert result.status == "completed"
    mock_stripe.create_payment_intent.assert_called_once()
    mock_inventory.reserve.assert_called_once()
    mock_email.send.assert_called_once()

What is this test actually testing? It is testing that when everything
works exactly as mocked, the function calls the mocked things in the
expected order and returns the expected result.

It is not testing what happens when Stripe returns an error. It is not
testing what happens when the database is unavailable. It is not testing
what happens when inventory reservation fails after payment succeeds,
leaving a paid order in a broken state. It is not testing the actual
integration between these components.

It is testing that the code is wired together the way it was wired
together when the test was written. It is a snapshot of the implementation
masquerading as a validation of the behavior.

And it will pass green on every run until the day something real breaks
in production, at which point it will still pass green because the mocks
are still returning what you told them to return.

The mock problem

Mocks are the original sin of unit testing culture. They were created
to solve a real problem — tests that depend on external services are
slow, unreliable, and hard to set up — and they solved that problem
by replacing the external service with a fake version that does whatever
the test needs it to do.

The consequence is that your test suite no longer tests your software.
It tests your software's interaction with your software's assumptions
about how its dependencies behave. When those assumptions are wrong —
when the real Stripe API returns a response shape that's slightly
different from what you mocked, when the real database has a different
transaction isolation level than your mock assumes, when the real email
service deduplicates in a way your mock doesn't — your tests pass and
your production breaks.

I have debugged more production incidents that were caused by the gap
between mocked behavior and real behavior than I can count. The test
said it worked. The mock said the API returned this. The real API
does not return this. The test was wrong about the contract, and because
the test was wrong, the code was deployed with a broken assumption that
nobody caught.

The more you mock, the less your tests tell you about whether the
software works. This is not a design smell to be managed — it's a
fundamental property of mocking. Every mock is a place where reality
has been replaced with assumption.

The coverage lie

Coverage is the most destructive metric in software engineering.

Not because high coverage is bad. Because coverage as a target produces
the wrong behavior. When coverage is a goal, developers write tests to
cover code rather than to validate behavior. These are different
activities that produce very different tests.

A test written to cover code asks: how do I execute this line?
A test written to validate behavior asks: what should this system do,
and how do I know it's doing it?

Tests written to cover code tend to be thin — they call the function
with happy-path inputs and assert that it doesn't throw. They increase
coverage. They do not increase confidence.

# Written to cover code. Gets you to 100% on this function.
def test_create_user():
    user = create_user(email="test@example.com", password="password123")
    assert user is not None

# Written to validate behavior. Tests what actually matters.
def test_create_user_hashes_password():
    user = create_user(email="test@example.com", password="password123")
    assert user.password_hash != "password123"
    assert verify_password("password123", user.password_hash)

def test_create_user_rejects_duplicate_email():
    create_user(email="test@example.com", password="password123")
    with pytest.raises(DuplicateEmailError):
        create_user(email="test@example.com", password="different")

def test_create_user_sends_verification_email(fake_email_sender):
    create_user(email="test@example.com", password="password123")
    assert any(
        email.to == "test@example.com" and "verify" in email.subject.lower()
        for email in fake_email_sender.sent
    )

def test_create_user_with_invalid_email():
    with pytest.raises(ValidationError, match="invalid email"):
        create_user(email="not-an-email", password="password123")

The second set has the same line coverage as the first if the function
is simple. It tests fundamentally different things. A system with the
first kind of tests has coverage. A system with the second kind has
confidence.

Coverage rewards quantity. Confidence comes from quality. These are
not correlated, and treating them as if they are has produced an
industry-wide habit of writing many low-value tests instead of fewer
high-value ones.

What actually breaks in production

Here is a list of things that unit tests, as typically practiced,
will never catch:

The query that works correctly against your test database with twenty rows and times out against production with two million rows
The race condition that only manifests when two requests hit the same endpoint within fifty milliseconds of each other
The API response from your payment provider that changed shape slightly in a minor version update
The session expiry behavior that's different in the production Redis configuration than in the in-memory fake you test against
The cascade delete behavior that your ORM handles differently than the raw SQL you use in the migration script
The encoding issue that only appears when a user's name contains a character outside the ASCII range
The timeout that is set correctly in the service but not propagated to the client that calls it

Every item on this list is a production incident I have personally
been part of. None of them was caught by unit tests. Most of them
would have been caught by integration tests that weren't written
because the team was busy maintaining the unit test suite.

This is the trade you make when you prioritize unit testing: you get
fast, reliable tests that validate your assumptions, and you skip the
slower, harder tests that would challenge them.

What to do instead

I am not arguing for no tests. I'm arguing for tests calibrated to
where the real risk is.

Integration tests over unit tests for anything with dependencies.

If a function touches a database, a cache, a message queue, or an
external service — test it against the real thing, or as close to
the real thing as you can get. Not a mock. Not an in-memory fake
that you wrote. A real database with a real schema and real data
volumes. A real Redis instance. A real queue.

Yes, these tests are slower. Run them in CI, not on every save.
They are dramatically more valuable than unit tests that mock the
same dependencies because they test what the code actually does,
not what you assumed the code would do.

# This is worth the setup cost. It catches real problems.
@pytest.mark.integration
async def test_process_payment_handles_stripe_card_declined(
    test_db,       # Real PostgreSQL, real schema
    stripe_mock,   # Stripe's own test environment, not our mock
):
    order = await create_test_order(test_db, total=Decimal("49.99"))

    # Stripe's test mode has real card numbers that trigger specific behaviors
    result = await process_payment(
        order_id=order.id,
        card_token="tok_chargeDeclined",  # Stripe test token for declines
    )

    assert result.status == "failed"
    assert result.failure_code == "card_declined"

    # Verify the order status was updated correctly in the real database
    updated_order = await test_db.fetch_one(
        "SELECT status FROM orders WHERE id = $1",
        order.id
    )
    assert updated_order["status"] == "payment_failed"

    # Verify no inventory was reserved for a failed payment
    reservation = await test_db.fetch_one(
        "SELECT id FROM inventory_reservations WHERE order_id = $1",
        order.id
    )
    assert reservation is None

This test uses a real database and Stripe's test environment. It is
slower than a mocked unit test. It tests whether the actual system
behaves correctly when a real dependency does something unexpected.
It is the test you actually need.

Test behavior at the system boundary, not implementation in the middle.

The most valuable tests are the ones that call your API, your
message handler, your batch job — the public interface of your
system — and assert on the observable output. Not which functions
were called, not which mocks were invoked. What came out.

@pytest.mark.integration
async def test_order_api_returns_correct_status_after_payment(
    client,
    test_db,
):
    # Create an order through the API
    create_response = await client.post("/orders", json={
        "items": [{"product_id": "prod_1", "quantity": 2}]
    })
    assert create_response.status_code == 201
    order_id = create_response.json()["id"]

    # Process payment through the API
    payment_response = await client.post(f"/orders/{order_id}/pay", json={
        "card_token": "tok_visa"
    })
    assert payment_response.status_code == 200

    # Verify the order status reflects the payment
    order_response = await client.get(f"/orders/{order_id}")
    assert order_response.json()["status"] == "confirmed"
    assert order_response.json()["payment"]["status"] == "succeeded"

This test goes through the API, through the service layer, through
the database, and back. It validates the entire vertical slice. It
would catch a bug in the API handler, a bug in the service logic,
a bug in the database query, or a bug in the response serializer.
A unit test that mocked all the layers would catch none of these
except the one in the specific layer being tested.

Reserve unit tests for pure logic.

Unit tests are excellent for exactly what they're suited for:
pure functions with complex branching logic where the relationship
between input and output is the whole point. Discount calculations.
Validation rules. Data transformations. Parsing logic. Algorithms.

These are worth unit testing because the test is actually testing
the logic. There's nothing to mock. The test runs in microseconds.
Failures tell you exactly what's wrong.

For everything else — anything that touches infrastructure, anything
that coordinates between components, anything that talks to external
systems — integration tests are not just better, they're the only
tests that tell you anything true.

The heresy in full

Here is the position I'm staking out, clearly, so it can be clearly
disagreed with:

A codebase with forty percent coverage from integration tests that
test real behavior against real dependencies is more reliable than
a codebase with ninety percent coverage from unit tests that mock
every external interaction.

Coverage is not quality. Mocks are not validation. A green test suite
is not a guarantee that the software works — it's a guarantee that
the software works according to the assumptions baked into the tests,
which may or may not match reality.

The software quality crisis is not a testing crisis. We test more
than we ever have. The crisis is a misalignment between what we test
and what breaks. We test pure logic obsessively and integration
boundaries barely. The bugs live at the integration boundaries. They
always have.

The counterargument I hear most often: integration tests are slow.
Yes. They are. They are slow because they do real things. Real things
take time. The alternative is fast tests that don't do real things
and therefore don't tell you whether the real things work.

Speed is not a virtue in a test suite. Accuracy is.

I expect this to generate disagreement. That's fine. The developers
most likely to disagree are the ones who have invested the most in
unit testing culture, which makes their disagreement somewhat
self-referential. The developers most likely to agree quietly are the
ones who have been paged at 3am because a perfectly unit-tested
function didn't work the way its mocks said it would.

Those developers know. They've always known. This is just someone
finally saying it.