The obvious fix when headless Chrome times out on bring up is to increase the deadline. Triple it, call it done. That intuition is half right and fully insufficient.
I have an hourly mirror job that syncs posts to Medium. It drives a headless Chrome instance over CDP using Playwright's connect_over_cdp. The sandbox Chrome lives on a fixed port (9223) with a dedicated user data directory. When the machine is idle, bring up is fast. When it is busy, the publish and generate jobs share the same launchd schedule window, and a cold Chrome launch can blow past a 15 second deadline before it hands back the WebSocket URL. The whole run aborts at the session() call. No article gets mirrored.
The fix looks simple: raise the deadline to 45 seconds. Here is the actual change:
try:
return sandbox.ensure(headless=headless, deadline_seconds=45.0)
except RuntimeError:
sandbox.shutdown()
return sandbox.ensure(headless=headless, deadline_seconds=45.0)
The deadline_seconds bump is the less interesting half. The critical part is the sandbox.shutdown() call inside the except block.
When Chrome fails to bring up in time, it is not cleanly absent. It is partially started: process running, port 9223 possibly bound, user data directory locked. A naive retry into that state fails immediately or hangs on a stale lock. The only path forward is to kill whatever is squatting the port and the profile, then try again.
Without the shutdown, a retry would not get 45 more seconds. It would get zero seconds and a connection refused.
This is what gets missed when you read "just increase the timeout." You also have to make the retry start from a clean state. Shutdown is not optional. It is what makes the retry meaningful.
A few things I would do differently starting from scratch:
First, the retry is capped at once. One retry after cleanup is enough. If the second attempt fails, something is genuinely wrong beyond scheduling jitter, and the right answer is to surface the error, not spin forever.
Second, the 45 second deadline is a guess calibrated to "seems long enough on a busy Mac." The real number should come from measuring actual cold start times across a week of launchd logs. I have not done that. If the machine gets busier, 45 seconds may not hold, and I will find out the hard way when the mirror starts silently skipping articles again.
Third, the failure mode is silent. When the hourly run aborts at session(), the article just does not get mirrored. No notification, no retry queue. A missed mirror is discovered on the next manual status check, not immediately. That is the part worth fixing next.
The architectural lesson is narrower than it sounds: when you own a singleton resource (a fixed port, a locked directory, a named pipe), recovery is not just "wait longer." It is "tear it down cleanly and rebuild it." Extra time only matters if the second attempt starts from a genuinely clean state.
Top comments (0)