<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mgoolden17-cyber</title>
    <description>The latest articles on DEV Community by mgoolden17-cyber (@mgoolden17cyber).</description>
    <link>https://dev.to/mgoolden17cyber</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3952363%2F2f5be186-e7a5-4c78-8006-1d370a1488b6.png</url>
      <title>DEV Community: mgoolden17-cyber</title>
      <link>https://dev.to/mgoolden17cyber</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mgoolden17cyber"/>
    <language>en</language>
    <item>
      <title>What 22 CI runs taught me about the gap between local dev and production</title>
      <dc:creator>mgoolden17-cyber</dc:creator>
      <pubDate>Tue, 26 May 2026 10:54:18 +0000</pubDate>
      <link>https://dev.to/mgoolden17cyber/what-22-ci-runs-taught-me-about-the-gap-between-local-dev-and-production-29ac</link>
      <guid>https://dev.to/mgoolden17cyber/what-22-ci-runs-taught-me-about-the-gap-between-local-dev-and-production-29ac</guid>
      <description>&lt;p&gt;I built a multi-tenant DNS/email security audit app — &lt;a href="https://github.com/mgoolden17-cyber/dnslint" rel="noopener noreferrer"&gt;dnslint&lt;/a&gt; — as a portfolio piece over a few months. FastAPI, async SQLAlchemy against Postgres, deployed to Render's free tier. The interesting part of building it wasn't the application code. It was that my local test suite passed cleanly while CI kept failing.&lt;/p&gt;

&lt;p&gt;Not one or two failures. Twenty-two CI runs over the lifespan of the feature branch. Some were the same bug showing up twice while I tried fixes. Most were distinct bugs that my local environment was hiding from me.&lt;/p&gt;

&lt;p&gt;This is a writeup of four of those failures. They're not glamorous bugs. They're the kind of thing that's obvious in hindsight and infuriating in the moment. The pattern across them is the same: my local dev environment was &lt;em&gt;shaped differently&lt;/em&gt; than production in ways I hadn't noticed, and each shape difference hid a class of bug until CI or the deployed environment forced me to see it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 1: A trailing newline in &lt;code&gt;DATABASE_URL&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The first CI failure I want to talk about is a one-character bug. My GitHub Actions workflow defined &lt;code&gt;DATABASE_URL&lt;/code&gt; as a multiline YAML string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;postgresql+asyncpg://user:pass@host:5432/dbname&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pipe (&lt;code&gt;|&lt;/code&gt;) is YAML's "literal block scalar" syntax. It preserves newlines. So the actual value of &lt;code&gt;DATABASE_URL&lt;/code&gt; inside the CI environment wasn't &lt;code&gt;postgresql+asyncpg://...&lt;/code&gt; — it was &lt;code&gt;postgresql+asyncpg://...\n&lt;/code&gt;. The trailing newline broke the connection string parser in a way that produced an opaque error: "could not translate host name."&lt;/p&gt;

&lt;p&gt;Locally, I read &lt;code&gt;DATABASE_URL&lt;/code&gt; from a &lt;code&gt;.env&lt;/code&gt; file via &lt;code&gt;python-dotenv&lt;/code&gt;, which trims whitespace. So the local code path &lt;em&gt;implicitly&lt;/em&gt; sanitized something that CI did not. My tests never had a chance to catch this because they ran against a value that had been silently cleaned before it reached SQLAlchemy.&lt;/p&gt;

&lt;p&gt;The fix was trivial — change the YAML to a quoted single-line string. The lesson was bigger: &lt;strong&gt;anything that touches my code through a different runtime than the one I develop in is a potential source of bugs my tests can't catch.&lt;/strong&gt; Environment variable parsing, file path handling, signal delivery, process startup order — every interface between my code and the world is a place where the local and deployed environments can disagree.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 2: &lt;code&gt;unittest.mock.patch&lt;/code&gt; is not thread-safe
&lt;/h2&gt;

&lt;p&gt;I have a test that exercises a concurrent DNS resolver. It spawns two threads, each scanning a different domain with a different resolver IP, and asserts that each thread's resolver was correctly isolated via Python's &lt;code&gt;ContextVar&lt;/code&gt;. Locally, it passed. In CI, it failed intermittently with an assertion that looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssertionError: assert None == '1.1.1.1'
 where None = {'t2': '1.1.1.1'}.get('t1')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What was happening: only one of the two threads was recording its resolver IP. The other thread's result was just gone.&lt;/p&gt;

&lt;p&gt;I assumed at first that the &lt;code&gt;ContextVar&lt;/code&gt; isolation was broken. It wasn't. The bug was somewhere else entirely.&lt;/p&gt;

&lt;p&gt;The test was structured like this — each thread called &lt;code&gt;unittest.mock.patch&lt;/code&gt; inside its own scope to install a side-effect function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_capture_resolver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_checks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_resolver_var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;nameservers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_mock_checks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnslint_core.run_all_checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;side_effect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_checks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;run_scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resolver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is in what &lt;code&gt;unittest.mock.patch&lt;/code&gt; actually does under the hood. It doesn't install a thread-local mock. It mutates a module-global attribute — literally &lt;code&gt;dnslint_core.run_all_checks = mock_fn&lt;/code&gt; on entry, and restores the original on exit. When two threads enter their own &lt;code&gt;with patch(...)&lt;/code&gt; blocks at overlapping times, the second thread's &lt;code&gt;__enter__&lt;/code&gt; overwrites the first thread's mock function while the first thread is still inside its &lt;code&gt;with&lt;/code&gt; block.&lt;/p&gt;

&lt;p&gt;So Thread 1's &lt;code&gt;_checks&lt;/code&gt; closure (which would have written to &lt;code&gt;seen["t1"]&lt;/code&gt;) got replaced by Thread 2's &lt;code&gt;_checks&lt;/code&gt; closure (which writes to &lt;code&gt;seen["t2"]&lt;/code&gt;) before Thread 1's &lt;code&gt;run_scan&lt;/code&gt; call actually invoked the mock. By the time Thread 1's scan triggered the mocked function, it was running Thread 2's closure. Thread 2's result got recorded twice; Thread 1's result was never written.&lt;/p&gt;

&lt;p&gt;Locally, the threads happened to interleave in a way that worked. In CI, on a faster runner with different scheduling, they didn't.&lt;/p&gt;

&lt;p&gt;The fix was to install one &lt;code&gt;patch&lt;/code&gt; outside the threads, with a single dispatcher that reads the &lt;code&gt;ContextVar&lt;/code&gt; to figure out which resolver is active for the calling thread:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;shared_checks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_resolver_var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;nameservers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ident&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_mock_checks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnslint_core.run_all_checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;side_effect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shared_checks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.1.1.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.8.8.8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
    &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One patch, never re-entered. The dispatcher uses the genuinely thread-local &lt;code&gt;ContextVar&lt;/code&gt; to figure out what each thread should see.&lt;/p&gt;

&lt;p&gt;The lesson here is about &lt;strong&gt;what testing tools actually do.&lt;/strong&gt; &lt;code&gt;mock.patch&lt;/code&gt; is the most-used mocking utility in the Python standard library. It's documented as thread-unsafe, but the docs are easy to miss, and the failure mode is statistical — the patch overwrites happen during a short window, and most test runs schedule around them by accident. Reading the source of your own tools, especially the ones that mutate global state, is the only reliable way to know what they'll do under concurrency you didn't design around. My laptop hid this for months; CI's different scheduler made it reliably visible within a few runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 3: SQLite vs Postgres engine config
&lt;/h2&gt;

&lt;p&gt;This one cost me three CI runs and an hour of confused debugging.&lt;/p&gt;

&lt;p&gt;My SQLAlchemy engine factory looked roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pool_pre_ping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_overflow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That looks fine. It's the default-ish configuration for an async Postgres connection pool. The problem: my local tests use SQLite (because spinning up a Postgres container for every test run is slow), and SQLite's async driver — &lt;code&gt;aiosqlite&lt;/code&gt; — uses a &lt;code&gt;StaticPool&lt;/code&gt; or &lt;code&gt;NullPool&lt;/code&gt; and &lt;em&gt;raises an error if you pass &lt;code&gt;pool_size&lt;/code&gt; or &lt;code&gt;max_overflow&lt;/code&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Locally I'd configured the test fixtures to skip the engine factory entirely and build their own SQLite engine. So my code path that built the real engine was never exercised by my test suite. CI ran the tests against SQLite the same way, but the &lt;em&gt;integration test&lt;/em&gt; I'd added that hit the real engine factory blew up with "invalid argument 'pool_size' for SQLite dialect."&lt;/p&gt;

&lt;p&gt;The fix was a dialect check in the factory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pool_pre_ping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_overflow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ugly, but correct. The lesson is about &lt;strong&gt;test infrastructure that fakes too much.&lt;/strong&gt; I'd designed my tests to &lt;em&gt;avoid&lt;/em&gt; exercising the engine factory because doing so was inconvenient. That convenience came with a cost: a real code path went un-tested for the entire development cycle until CI exercised it as part of an integration test I hadn't been running locally. The bug was findable; I just hadn't been looking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 4: Migrations blocking port binding on Render
&lt;/h2&gt;

&lt;p&gt;This is the bug that taught me the most about deployment, and it's the one that took me longest to recognize as a bug at all.&lt;/p&gt;

&lt;p&gt;Render's free tier expects a web service container to bind a port within a startup timeout — I think it's ten seconds. If your container doesn't open the port in that window, Render kills it and marks the deploy as failed.&lt;/p&gt;

&lt;p&gt;My FastAPI app's startup hook ran Alembic migrations synchronously before yielding to the Uvicorn server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lifespan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upgrade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alembic_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;head&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# blocks
&lt;/span&gt;    &lt;span class="k"&gt;yield&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;command.upgrade&lt;/code&gt; is a synchronous call. Inside an &lt;code&gt;async def&lt;/code&gt; context, that means it blocks the entire event loop. Uvicorn can't start serving on the port until &lt;code&gt;lifespan&lt;/code&gt; yields, which can't happen until migrations finish.&lt;/p&gt;

&lt;p&gt;Locally this was invisible. My dev database had no migrations to run after the first time, so &lt;code&gt;command.upgrade&lt;/code&gt; was nearly instant. On Render, the first deploy had a few hundred milliseconds of real migration work to do, and the next deploy had to acquire a connection through the free-tier database's startup latency. The combined startup time crept past Render's ten-second window. Container killed. Deploy marked failed. Render's error message was the deeply unhelpful "exited before binding to port."&lt;/p&gt;

&lt;p&gt;The fix had three parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run migrations as a background task, not in the lifespan critical path:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;migration_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alembic_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;head&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Wrap them in a Postgres advisory lock so that if I ever scale to multiple workers, they don't race each other on migrations:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT pg_advisory_lock(12345)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Split the health checks: &lt;code&gt;/health&lt;/code&gt; returns 200 immediately with migration status in the body (so Render's liveness probe is happy), and &lt;code&gt;/health/ready&lt;/code&gt; returns 503 until migrations complete (so monitoring tools that care about real readiness get the truth).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last part is the bit I'm proudest of, because it required me to think about &lt;em&gt;what liveness and readiness actually mean.&lt;/em&gt; Liveness is "this container is alive and not stuck." Readiness is "this container can serve real traffic right now." Conflating them — which is the default for most apps — means your readiness signal is too coarse to be useful and your liveness signal is too strict to survive realistic startup latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson: production-shaped constraints surface architectural decisions you didn't know you were making.&lt;/strong&gt; I hadn't &lt;em&gt;decided&lt;/em&gt; that migrations should block the event loop. I'd just written the most obvious code, and the obvious code happened to be wrong in a way that only mattered when the deployment environment had a startup timeout. Without that constraint, the bug would still be there, just invisible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd actually take away from this
&lt;/h2&gt;

&lt;p&gt;If you read the four sections above and shrugged, you've probably been an engineer for a while and seen each of these classes of bug before. They're not novel. The novelty, for me, was in seeing all four in the same project over a short window and noticing the shared shape.&lt;/p&gt;

&lt;p&gt;The pattern: every one of these bugs existed because my development environment was &lt;em&gt;different from&lt;/em&gt; production in a way I hadn't accounted for. The differences were in places I'd been treating as invisible plumbing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment variable parsing (Bug 1)&lt;/li&gt;
&lt;li&gt;Process scheduling and concurrency (Bug 2)&lt;/li&gt;
&lt;li&gt;Database dialect (Bug 3)&lt;/li&gt;
&lt;li&gt;Startup sequencing and runtime constraints (Bug 4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't application concerns. They're &lt;em&gt;environment&lt;/em&gt; concerns, and I'd been writing code as if the environment was a stable layer underneath my application — which it isn't. The environment is part of the code path. My local environment was running a slightly different program than my production environment, and the differences between those programs were exactly where bugs lived.&lt;/p&gt;

&lt;p&gt;What I'd do differently next time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pick one production-shaped layer to run locally, early.&lt;/strong&gt; I picked SQLite for local tests, which was a convenience I paid for in Bug 3. A Postgres container via Docker Compose costs me thirty seconds at startup and would have caught the engine config bug on my laptop instead of in CI.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Treat CI failures as data, not as obstacles.&lt;/strong&gt; I caught myself, more than once, trying to "make CI green" rather than understand why it was red. The pattern of "this passed locally so the CI must be misconfigured" is a tell — it usually means I'm wrong, not the CI.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write the integration test before the unit test.&lt;/strong&gt; Unit tests confirm that the function I just wrote does what I think. Integration tests confirm that the &lt;em&gt;system&lt;/em&gt; I just wrote can actually start and serve a request. Bug 4 would have been caught by a smoke test against the real Render-shaped startup path, weeks before I noticed it on a deploy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Health checks aren't a checkbox; they're an interface design problem.&lt;/strong&gt; Liveness vs readiness is a real distinction. So is "responds to TCP" vs "is functionally serving requests." The right split depends on what's consuming the signal.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CI was supposed to be the boring last step before a deploy. Instead, it was the most informative debugging tool I had — better than my local IDE, better than my test suite, better than reading logs after the fact. Every red run was a free lesson, and most of them taught me something about my own assumptions that I wouldn't have caught any other way.&lt;/p&gt;

&lt;p&gt;If you're building a portfolio project and find your CI is flaky or your deploys are weird, don't paper over it. The annoyance is the signal.&lt;/p&gt;




&lt;p&gt;The full project is at &lt;a href="https://github.com/mgoolden17-cyber/dnslint" rel="noopener noreferrer"&gt;github.com/mgoolden17-cyber/dnslint&lt;/a&gt; with a live demo and a full known-issues log. I built this while finishing my cybersecurity degree at SUNY Canton; I write occasionally about engineering practice.&lt;/p&gt;

</description>
      <category>python</category>
      <category>fastapi</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
