DEV Community

mgoolden17-cyber
mgoolden17-cyber

Posted on

What 22 CI runs taught me about the gap between local dev and production

I built a multi-tenant DNS/email security audit app — dnslint — as a portfolio piece over a few months. FastAPI, async SQLAlchemy against Postgres, deployed to Render's free tier. The interesting part of building it wasn't the application code. It was that my local test suite passed cleanly while CI kept failing.

Not one or two failures. Twenty-two CI runs over the lifespan of the feature branch. Some were the same bug showing up twice while I tried fixes. Most were distinct bugs that my local environment was hiding from me.

This is a writeup of four of those failures. They're not glamorous bugs. They're the kind of thing that's obvious in hindsight and infuriating in the moment. The pattern across them is the same: my local dev environment was shaped differently than production in ways I hadn't noticed, and each shape difference hid a class of bug until CI or the deployed environment forced me to see it.

Bug 1: A trailing newline in DATABASE_URL

The first CI failure I want to talk about is a one-character bug. My GitHub Actions workflow defined DATABASE_URL as a multiline YAML string:

env:
  DATABASE_URL: |
    postgresql+asyncpg://user:pass@host:5432/dbname
Enter fullscreen mode Exit fullscreen mode

That pipe (|) is YAML's "literal block scalar" syntax. It preserves newlines. So the actual value of DATABASE_URL inside the CI environment wasn't postgresql+asyncpg://... — it was postgresql+asyncpg://...\n. The trailing newline broke the connection string parser in a way that produced an opaque error: "could not translate host name."

Locally, I read DATABASE_URL from a .env file via python-dotenv, which trims whitespace. So the local code path implicitly sanitized something that CI did not. My tests never had a chance to catch this because they ran against a value that had been silently cleaned before it reached SQLAlchemy.

The fix was trivial — change the YAML to a quoted single-line string. The lesson was bigger: anything that touches my code through a different runtime than the one I develop in is a potential source of bugs my tests can't catch. Environment variable parsing, file path handling, signal delivery, process startup order — every interface between my code and the world is a place where the local and deployed environments can disagree.

Bug 2: unittest.mock.patch is not thread-safe

I have a test that exercises a concurrent DNS resolver. It spawns two threads, each scanning a different domain with a different resolver IP, and asserts that each thread's resolver was correctly isolated via Python's ContextVar. Locally, it passed. In CI, it failed intermittently with an assertion that looked like this:

AssertionError: assert None == '1.1.1.1'
 where None = {'t2': '1.1.1.1'}.get('t1')
Enter fullscreen mode Exit fullscreen mode

What was happening: only one of the two threads was recording its resolver IP. The other thread's result was just gone.

I assumed at first that the ContextVar isolation was broken. It wasn't. The bug was somewhere else entirely.

The test was structured like this — each thread called unittest.mock.patch inside its own scope to install a side-effect function:

def _capture_resolver(name, ip):
    def _checks(domain):
        seen[name] = _resolver_var.get().nameservers[0]
        return _mock_checks(domain)

    with patch("dnslint_core.run_all_checks", side_effect=_checks):
        run_scan("example.com", resolver=ip)
Enter fullscreen mode Exit fullscreen mode

The problem is in what unittest.mock.patch actually does under the hood. It doesn't install a thread-local mock. It mutates a module-global attribute — literally dnslint_core.run_all_checks = mock_fn on entry, and restores the original on exit. When two threads enter their own with patch(...) blocks at overlapping times, the second thread's __enter__ overwrites the first thread's mock function while the first thread is still inside its with block.

So Thread 1's _checks closure (which would have written to seen["t1"]) got replaced by Thread 2's _checks closure (which writes to seen["t2"]) before Thread 1's run_scan call actually invoked the mock. By the time Thread 1's scan triggered the mocked function, it was running Thread 2's closure. Thread 2's result got recorded twice; Thread 1's result was never written.

Locally, the threads happened to interleave in a way that worked. In CI, on a faster runner with different scheduling, they didn't.

The fix was to install one patch outside the threads, with a single dispatcher that reads the ContextVar to figure out which resolver is active for the calling thread:

def shared_checks(domain):
    ip = _resolver_var.get().nameservers[0]
    with lock:
        seen[threading.get_ident()] = ip
    return _mock_checks(domain)

with patch("dnslint_core.run_all_checks", side_effect=shared_checks):
    t1 = threading.Thread(target=runner, args=("1.1.1.1",))
    t2 = threading.Thread(target=runner, args=("8.8.8.8",))
    t1.start(); t2.start()
    t1.join(); t2.join()
Enter fullscreen mode Exit fullscreen mode

One patch, never re-entered. The dispatcher uses the genuinely thread-local ContextVar to figure out what each thread should see.

The lesson here is about what testing tools actually do. mock.patch is the most-used mocking utility in the Python standard library. It's documented as thread-unsafe, but the docs are easy to miss, and the failure mode is statistical — the patch overwrites happen during a short window, and most test runs schedule around them by accident. Reading the source of your own tools, especially the ones that mutate global state, is the only reliable way to know what they'll do under concurrency you didn't design around. My laptop hid this for months; CI's different scheduler made it reliably visible within a few runs.

Bug 3: SQLite vs Postgres engine config

This one cost me three CI runs and an hour of confused debugging.

My SQLAlchemy engine factory looked roughly like this:

engine = create_async_engine(
    settings.database_url,
    pool_pre_ping=True,
    pool_size=5,
    max_overflow=10,
)
Enter fullscreen mode Exit fullscreen mode

That looks fine. It's the default-ish configuration for an async Postgres connection pool. The problem: my local tests use SQLite (because spinning up a Postgres container for every test run is slow), and SQLite's async driver — aiosqlite — uses a StaticPool or NullPool and raises an error if you pass pool_size or max_overflow.

Locally I'd configured the test fixtures to skip the engine factory entirely and build their own SQLite engine. So my code path that built the real engine was never exercised by my test suite. CI ran the tests against SQLite the same way, but the integration test I'd added that hit the real engine factory blew up with "invalid argument 'pool_size' for SQLite dialect."

The fix was a dialect check in the factory:

kwargs = {"pool_pre_ping": True}
if not settings.database_url.startswith("sqlite"):
    kwargs.update(pool_size=5, max_overflow=10)
engine = create_async_engine(settings.database_url, **kwargs)
Enter fullscreen mode Exit fullscreen mode

Ugly, but correct. The lesson is about test infrastructure that fakes too much. I'd designed my tests to avoid exercising the engine factory because doing so was inconvenient. That convenience came with a cost: a real code path went un-tested for the entire development cycle until CI exercised it as part of an integration test I hadn't been running locally. The bug was findable; I just hadn't been looking.

Bug 4: Migrations blocking port binding on Render

This is the bug that taught me the most about deployment, and it's the one that took me longest to recognize as a bug at all.

Render's free tier expects a web service container to bind a port within a startup timeout — I think it's ten seconds. If your container doesn't open the port in that window, Render kills it and marks the deploy as failed.

My FastAPI app's startup hook ran Alembic migrations synchronously before yielding to the Uvicorn server:

@asynccontextmanager
async def lifespan(app: FastAPI):
    command.upgrade(alembic_config, "head")  # blocks
    yield
Enter fullscreen mode Exit fullscreen mode

command.upgrade is a synchronous call. Inside an async def context, that means it blocks the entire event loop. Uvicorn can't start serving on the port until lifespan yields, which can't happen until migrations finish.

Locally this was invisible. My dev database had no migrations to run after the first time, so command.upgrade was nearly instant. On Render, the first deploy had a few hundred milliseconds of real migration work to do, and the next deploy had to acquire a connection through the free-tier database's startup latency. The combined startup time crept past Render's ten-second window. Container killed. Deploy marked failed. Render's error message was the deeply unhelpful "exited before binding to port."

The fix had three parts:

  1. Run migrations as a background task, not in the lifespan critical path:
   migration_task = asyncio.create_task(asyncio.to_thread(command.upgrade, alembic_config, "head"))
Enter fullscreen mode Exit fullscreen mode
  1. Wrap them in a Postgres advisory lock so that if I ever scale to multiple workers, they don't race each other on migrations:
   await conn.execute(text("SELECT pg_advisory_lock(12345)"))
Enter fullscreen mode Exit fullscreen mode
  1. Split the health checks: /health returns 200 immediately with migration status in the body (so Render's liveness probe is happy), and /health/ready returns 503 until migrations complete (so monitoring tools that care about real readiness get the truth).

That last part is the bit I'm proudest of, because it required me to think about what liveness and readiness actually mean. Liveness is "this container is alive and not stuck." Readiness is "this container can serve real traffic right now." Conflating them — which is the default for most apps — means your readiness signal is too coarse to be useful and your liveness signal is too strict to survive realistic startup latency.

The lesson: production-shaped constraints surface architectural decisions you didn't know you were making. I hadn't decided that migrations should block the event loop. I'd just written the most obvious code, and the obvious code happened to be wrong in a way that only mattered when the deployment environment had a startup timeout. Without that constraint, the bug would still be there, just invisible.

What I'd actually take away from this

If you read the four sections above and shrugged, you've probably been an engineer for a while and seen each of these classes of bug before. They're not novel. The novelty, for me, was in seeing all four in the same project over a short window and noticing the shared shape.

The pattern: every one of these bugs existed because my development environment was different from production in a way I hadn't accounted for. The differences were in places I'd been treating as invisible plumbing:

  • Environment variable parsing (Bug 1)
  • Process scheduling and concurrency (Bug 2)
  • Database dialect (Bug 3)
  • Startup sequencing and runtime constraints (Bug 4)

These aren't application concerns. They're environment concerns, and I'd been writing code as if the environment was a stable layer underneath my application — which it isn't. The environment is part of the code path. My local environment was running a slightly different program than my production environment, and the differences between those programs were exactly where bugs lived.

What I'd do differently next time:

  1. Pick one production-shaped layer to run locally, early. I picked SQLite for local tests, which was a convenience I paid for in Bug 3. A Postgres container via Docker Compose costs me thirty seconds at startup and would have caught the engine config bug on my laptop instead of in CI.

  2. Treat CI failures as data, not as obstacles. I caught myself, more than once, trying to "make CI green" rather than understand why it was red. The pattern of "this passed locally so the CI must be misconfigured" is a tell — it usually means I'm wrong, not the CI.

  3. Write the integration test before the unit test. Unit tests confirm that the function I just wrote does what I think. Integration tests confirm that the system I just wrote can actually start and serve a request. Bug 4 would have been caught by a smoke test against the real Render-shaped startup path, weeks before I noticed it on a deploy.

  4. Health checks aren't a checkbox; they're an interface design problem. Liveness vs readiness is a real distinction. So is "responds to TCP" vs "is functionally serving requests." The right split depends on what's consuming the signal.

CI was supposed to be the boring last step before a deploy. Instead, it was the most informative debugging tool I had — better than my local IDE, better than my test suite, better than reading logs after the fact. Every red run was a free lesson, and most of them taught me something about my own assumptions that I wouldn't have caught any other way.

If you're building a portfolio project and find your CI is flaky or your deploys are weird, don't paper over it. The annoyance is the signal.


The full project is at github.com/mgoolden17-cyber/dnslint with a live demo and a full known-issues log. I built this while finishing my cybersecurity degree at SUNY Canton; I write occasionally about engineering practice.

Top comments (0)