DEV Community

Deva
Deva

Posted on

Why Your Headless Chrome Session Fails Silently on a Busy Machine

What kills an hourly automation job that works fine in isolation but fails one in five times under real load?

For me it was a 15 second timeout on a cold Chrome launch. Not a bug in the scraping logic. Not a broken selector. Just a number baked in as a default that was never tested against a machine doing other things at the same time.

Here is what was happening. I run a Medium mirror job every hour. It opens a headless Chrome via CDP, logs into Medium, scrapes the post, and publishes it. The job shares a schedule with content generation and publish jobs. When those fire at the same time, the machine is already maxed out: ffmpeg frames being encoded, claude p shells spawned, SQLite writes happening. Cold Chrome launch on a loaded CPU can take 30 to 45 seconds. My session setup was timing out at 15 and raising a RuntimeError that killed the whole run at the session() call, before any mirroring happened.

The fix has two parts.

First, bump the deadline from 15s to 45s. That is the obvious part. If your machine can be busy, your timeout has to reflect worst case startup time, not average case. Average is a lie.

Second, handle the half dead Chrome problem. When CDP brings up a browser and the connection times out, Chrome is not necessarily dead. It may still be sitting on the port or holding the profile directory lock. If you retry immediately without cleaning up, the new launch fails too because the old one is still there. So on RuntimeError, I now shut down any Chrome process squatting the same port and profile path before retrying. One retry. If it fails again, the job logs and exits cleanly rather than wedging the next run.

The tradeoff worth naming: a 45s deadline means the outer job now blocks for up to 45 seconds before it knows something is wrong. In a system that runs hourly that is acceptable. In a system that needs sub minute turnaround it would not be. The right ceiling depends entirely on your cadence.

The retry once pattern is also a deliberate constraint. I could retry three times, back off exponentially, and keep going. I did not, because two consecutive failures on a port and profile pair means something is actually broken, not just slow. A retry loop that keeps spinning on a genuinely dead Chrome wastes time and leaves zombie processes. One retry catches transient load spikes. More than one masks real failures.

What I would do differently from the start: treat the initial timeout as a parameter surfaced in config, not a constant in the function. I only noticed the 15s default when production broke it. If it had been in a config file with a comment explaining it covered cold start time, the relationship between machine load and launch latency would have been obvious earlier.

The broader point: any automation that depends on an external process starting within a fixed window needs that window calibrated against real world conditions, not a development laptop. Flakiness at 20% load is often a timeout that was tuned at 0% load. Before you go hunting for logic bugs, check your deadlines.

Top comments (0)