Mike Davis

Posted on Apr 1

Why I Stopped Mocking 2FA in Tests (And What I Do Instead)

#testing #cicd #python #security

I mocked OTP verification for two years. Then a bug hit production that my test suite should have caught. Here's what went wrong and how I fixed the approach.

I'll start with the bug that changed my mind.

Last summer, we pushed a release that broke the phone number change flow in our app. A user would enter their new number, receive an OTP, type it in — and get a 400 error. The code was correct in every way except one: someone refactored the verification endpoint and accidentally changed the OTP field name from otp_code to verification_code in the request body. The frontend still sent otp_code. The backend no longer recognized it.

Our test suite had 100% coverage on this flow. Every test passed. Green across the board.

Why? Because we were mocking the entire OTP layer.

The Mock That Lies

Here's roughly what our test setup looked like:

# conftest.py — the old way
@pytest.fixture
def mock_otp(monkeypatch):
    """Skip OTP verification entirely in tests."""
    monkeypatch.setenv("SKIP_OTP_VERIFICATION", "true")

# In the actual verification view:
def verify_otp(request):
    if os.environ.get("SKIP_OTP_VERIFICATION"):
        return Response({"status": "verified"}, status=200)

    otp_code = request.data.get("otp_code")
    # ... actual verification logic

You see the problem. When SKIP_OTP_VERIFICATION is set, the function returns early. It never reads request.data. It never checks field names. It never calls the downstream SMS provider. The entire code path that actually matters — the one users hit in production — is invisible to the test suite.

We weren't testing our verification flow. We were testing a shortcut.

"But Everyone Does This"

I know, I know. Mocking external dependencies is Testing 101. And to be fair, there are legitimate reasons people mock OTP:

Speed. Waiting for a real SMS adds 5–15 seconds per test. If you have 30 verification tests, that's up to 7 extra minutes per CI run.

Cost. Services like Twilio charge per SMS. At $0.0079/message for US numbers, 30 tests × 5 CI runs/day × 22 days = $25/month just for tests. Not huge, but it adds up.

Flakiness. Real SMS delivery is non-deterministic. Networks hiccup. Carrier filtering eats messages. Timeouts happen. Nobody wants a red build because Vodafone's gateway in Frankfurt had a bad morning.

These are valid concerns. I'm not going to pretend otherwise. But there's a spectrum between "mock everything" and "mock nothing," and I was way too far on the wrong end.

What the Mock Was Actually Hiding

After the otp_code → verification_code incident, I did an audit. I went through every test that involved OTP and asked: "What production behavior is this test actually exercising?"

The answer was depressing. Here's what our mocks were hiding:

Field name mismatches. The bug that started this whole thing. Mock returns early, never reads the request body, never catches field renames.

Timing issues. Our OTP codes expire after 5 minutes. We had a bug where the expiry check used datetime.now() instead of datetime.utcnow(), causing failures for users in certain timezones. The mock bypassed the expiry check entirely.

Rate limiting. We rate-limit OTP requests to 3 per phone number per hour. A refactor accidentally reset the counter on every request instead of incrementing it. No test caught it — the mock never touched the rate limiter.

Error handling. What happens when the SMS provider returns a 503? When the phone number format is invalid? When the country code doesn't match the number? All of these paths were invisible because the mock replaced the entire function.

I counted seven distinct bugs in the previous six months that our tests should have caught but didn't, all because of the OTP mock.

The Middle Ground: Mock the Network, Not the Logic

My first instinct was to rip out all mocks and hit a real SMS provider in every test. That's the other extreme, and it's not great either — you get slow, flaky, expensive tests that fail for reasons unrelated to your code.

Instead, I landed on a layered approach:

Layer 1: Mock at the HTTP boundary (unit/integration tests)

Instead of skipping verification logic, I mock the HTTP call to the SMS provider. The key difference: all of our code still runs. Field parsing, validation, rate limiting, expiry checks — everything executes normally. Only the final httpx.post() to Twilio (or whoever) gets intercepted.

# conftest.py — the new way
import respx

@pytest.fixture
def mock_sms_provider():
    """Mock the external SMS API, but run all our verification logic."""
    with respx.mock:
        # Mock the "send OTP" endpoint
        respx.post("https://api.twilio.com/2010-04-01/Accounts/*/Messages.json").mock(
            return_value=httpx.Response(200, json={
                "sid": "SM_test_123",
                "status": "queued",
            })
        )
        yield

@pytest.fixture
def known_otp(mock_sms_provider, monkeypatch):
    """Fix the OTP value so tests can 'enter' it, but keep all logic."""
    monkeypatch.setattr("myapp.auth.otp.generate_code", lambda: "482901")
    return "482901"

Now the test actually sends a request with the correct field names, our view parses request.data, validates the format, checks the rate limiter, calls generate_code() (which returns a fixed value), and stores the OTP in Redis with a TTL. The only thing that doesn't happen is a real HTTP call to Twilio.

The verification side works symmetrically — the test submits the known OTP value, and our code runs the full check: look up the stored code, compare, check expiry, invalidate after use.

This approach caught three of the seven bugs I mentioned within the first week of switching.

Layer 2: Real SMS in a nightly E2E suite

For the full end-to-end confidence, I run a smaller test suite once per night that actually sends and receives real SMS messages. This suite covers the critical paths: sign-up, password reset, phone change, 2FA enable/disable.

I won't go deep into the provider setup here — that's a whole separate post I'm working on — but the key idea is:

@pytest.fixture
def real_phone_number(sms_provider):
    """Get a real temporary number from an external provider."""
    result = sms_provider.get_number(country="US", service="myapp")
    yield result
    sms_provider.release(result["id"])

def test_full_signup_with_real_otp(page, real_phone_number):
    """E2E: complete signup with actual SMS delivery."""
    phone, activation = real_phone_number

    page.goto("/signup")
    page.fill("[name=phone]", phone)
    page.click("button[type=submit]")

    # Wait for real OTP from the provider
    otp = sms_provider.wait_for_code(activation["id"], timeout=20)

    page.fill("[name=otp]", otp)
    page.click("button#verify")

    expect(page.locator(".welcome-message")).to_be_visible()

This runs against staging, uses real phone numbers, and receives real OTP codes. It's slow (~45 seconds per test) and costs a few cents per run. But it runs once a day, covers 8 critical paths, and has caught two carrier-related delivery issues that no amount of mocking would have revealed.

The Split in Practice

Here's how it breaks down in our CI:

# .github/workflows/ci.yml
jobs:
  unit-and-integration:
    # Runs on every push — fast, mocked at HTTP boundary
    runs-on: ubuntu-latest
    steps:
      - run: pytest tests/ -m "not e2e" --timeout=30

  nightly-e2e:
    # Runs once per day — slow, real SMS
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - run: pytest tests/e2e/ -m "e2e" --timeout=120

The daily CI takes ~4 minutes. The nightly E2E takes ~12 minutes. Both are stable — the nightly suite has had 3 flaky failures in the last two months, all due to SMS delivery timeouts on the provider side, not our code.

The Refactoring Trick That Made This Possible

One thing that made the transition easier: I extracted all OTP logic into a single module. Before, verification code was scattered across three views, two serializers, and a middleware. Mocking was the only sane option because there was no single seam to intercept.

After refactoring:

myapp/
  auth/
    otp.py          # generate, store, verify, expire — all OTP logic lives here
    providers/
      base.py       # abstract SMS provider interface
      twilio.py     # production provider
      mock.py       # test provider (for Layer 1)

The otp.py module exposes four functions: generate_code(), send_code(phone, code), verify_code(phone, code), and check_rate_limit(phone). Every view calls these functions instead of implementing its own OTP handling. This means I have exactly one place to mock (send_code's HTTP call) and one place to fix when something breaks.

If your OTP logic is spread across your codebase, start here. The mock cleanup becomes trivial once the logic is consolidated.

What I Wish Someone Had Told Me

Mocking isn't the problem — mocking at the wrong layer is. There's a massive difference between "skip this entire feature in tests" and "intercept the external HTTP call but run everything else." The first gives you false confidence. The second gives you real coverage with controlled dependencies.

The SKIP_* env var pattern is a code smell. If your production code has if TEST_MODE: return early, you're not testing your production code. You're testing a fork of it that only exists in CI. Treat any such pattern as tech debt.

You don't need real SMS for every test. Layer 1 (mock at HTTP boundary) catches 90% of bugs. Layer 2 (real SMS) catches the remaining carrier/delivery edge cases. Running real SMS on every push is overkill and will make your team hate the test suite.

Track what your mocks hide. After every production bug, ask: "Would our test suite have caught this?" If the answer is "no, because the mock bypasses that code path," you have a mock problem. I keep a running doc of these incidents — it's been the best argument for investing in better test infrastructure.

Next up, I'm going to write about the practical side of integrating real SMS providers into a CI/CD pipeline — which services work, which ones don't, and the pytest fixtures that tie it all together. If you've dealt with this problem, I'd love to hear how you approached it.

All code examples are simplified. In reality, you'll want proper async handling, retry logic, and secret management. But the architecture holds.

DEV Community