My headless Chrome mirror kept dying on busy mornings. Here is the fix.

#automation #debugging #performance #webdev

Cold Chrome launch blew the 15 second deadline. Not every time. Just when the machine was already running content generation and publishing jobs at the same time, which is exactly when the mirror needs to work.

The failure mode was silent in the worst way: session() would throw a RuntimeError at startup, the whole hourly mirror run would abort, and nothing would make it to Medium. No retry, no recovery. Just a dead run in the logs.

Why the default timeout was wrong

Fifteen seconds is fine for a warm machine with nothing else going on. Headless Chrome on macOS typically comes up in 3 to 5 seconds under light load. But this runner shares a schedule with content jobs that are CPU and memory intensive. On a busy morning the OS is already swapping, Chrome's startup sequence takes longer, and 15 seconds is just not enough.

The naive fix is bump the number. But that only gets you so far, because there is a second failure mode the timeout does not cover: a previous run that died mid session sometimes leaves a Chrome process squatting the debugging port or the profile directory. The next run tries to bring up a fresh instance, hits a conflict, and dies with a RuntimeError even if you give it all the time in the world.

What I actually changed

Two things, in order:

First, raise the launch deadline to 45 seconds. This is not elegant but it is correct. The deadline needs to cover the realistic worst case, which on a shared schedule with content jobs is somewhere between 20 and 40 seconds for a cold launch. 45 is the number that clears that window with margin.

Second, add a recovery retry on RuntimeError. If the launch fails, the handler shuts down any half dead Chrome process holding the port or profile, then tries once more. The retry is single shot: attempt, fail, clean up, attempt again. If the second attempt fails, it surfaces the error and stops. No infinite loops, no exponential backoff theater for a problem that either resolves in one retry or does not resolve at all.

The cleanup step is the important part. Without it the retry just runs into the same zombie process and fails identically. The sequence matters: kill first, then launch.

The tradeoff worth naming

A 45 second deadline means a bad launch scenario blocks for up to 45 seconds before you know it failed. For an hourly job that is acceptable. If this were a user facing flow I would not do it this way. I would run Chrome in a dedicated process, keep it warm between requests, and connect over a persistent socket instead of re launching every time.

But this is a background mirror job. It runs once an hour. The blast radius of a slow startup is one mirror cycle, not a user request. Keeping Chrome warm across runs would add state management complexity for a problem that now only surfaces when the machine is genuinely overloaded, which is rare. The 45 second deadline plus one recovery retry costs almost nothing to reason about and covers the real failure cases.

What I would do differently

The retry logic belongs in a context manager, not scattered across the session setup function. Right now the cleanup and relaunch are inline. That works, but it means the next person reading the code has to follow the control flow to understand what is happening. A managed_chrome_session() context manager that handles launch, the 45 second deadline, and the single recovery retry would make the intent obvious and make testing each piece in isolation straightforward.

I did not refactor it to that shape because the change I needed was a targeted fix, not a cleanup. But if this file gets touched again, that is the first thing I would do.