Most fixes for CDP connection failures start the same way: bump the timeout. More seconds, less drama. But if Chrome is launching on a machine that is already doing other work, the number is not the problem. The assumption that the machine is idle is.
Chrome DevTools Protocol attaches to headless Chrome during session setup. On a cold, idle machine that happens fast. On a machine already running generate and publish jobs on overlapping cron schedules, it can take a lot longer.
My mirror job publishes posts to Medium by driving headless Chrome through a CDP session. The hourly cron fired, the session setup timed out at 15 seconds, and the whole run aborted with nothing sent. This happened maybe once every few hours, always at the same line:
session()
The 15 second default was the symptom. The real cause was contention: a generate job and the mirror job sharing the machine and sometimes firing at the same time. Chrome cold launch does not get a quiet machine; it gets a busy one.
The obvious fix is more time. The less obvious part is what happens when Chrome does not just time out but stalls: a process still holding the port and the profile directory. If you attempt a new launch without clearing that first, you will fail immediately, no matter how generous your new timeout is.
Here is what I changed:
Extended the deadline to 45 seconds. That covers cold launch on a busy machine with margin left over, without making a real failure just sit there burning time.
On RuntimeError, the code now shuts down any Chrome process squatting on the CDP port and the profile path, then retries exactly once. One retry. If that also fails, the error surfaces and the run stops. No silent swallowing, no infinite loop.
The tradeoff: a failed run now takes up to 90 seconds before giving up instead of 15. For an hourly job that is fine. For a real time system it would not be. Know your cadence before you copy this.
What I would do differently: log Chrome launch duration from day one. The tail distribution on a busy machine looked nothing like what I saw in development. If I had been tracking that number, I would have found this before the cron job found it for me.
Four lines of code. The actual lesson is that any startup sequence depending on a resource you do not own exclusively needs a recovery path, not just a longer timeout.
Top comments (0)