DEV Community

Deva
Deva

Posted on

My headless Chrome timeout was 15 seconds and my machine needed 40

The hourly mirror job was dying silently. Not a network error, not a bad response from the target site. It was dying at session(), the line where Playwright connects to the CDP socket and hands back a browser context. Cold Chrome launch on a busy machine was taking longer than 15 seconds, and 15 seconds was the deadline I had set.

The machine runs content generation and publishing jobs on the same schedule. When those fire, the CPU spikes. Chrome, starting cold into a headless sandbox without a warm profile cache, can take 30 to 45 seconds to get its CDP socket listening. I had picked 15 seconds because that felt generous on an idle machine. It was not generous on a loaded one.

The fix has two parts.

First, raise the deadline to 45 seconds. That covers the actual worst case I observed. If you set a timeout by feel rather than by measurement, you will eventually see the feel case fail in production. The right way to pick a number is to instrument the launch, collect a sample across machine states, and pick the 99th percentile with a buffer. I did not do that the first time. I guessed, and the guess was wrong.

Second, on RuntimeError during bring up, kill any half dead Chrome that is squatting the port or the profile directory, then retry once. This matters because a failed launch does not always clean up after itself. If Chrome starts, acquires the port, then crashes before the CDP socket is ready, the next attempt will fail immediately with a port conflict rather than a timeout. You end up in a state where retrying naively does nothing. The recovery step is: find the process holding the port, kill it, wipe the lock files in the profile directory if they exist, then try again. One retry is enough. If it fails twice in a row the problem is something other than a race condition on bring up.

The tradeoff is latency. A 45s deadline means a failed launch now blocks the job for 45 seconds before erroring out. On an hourly schedule that is fine. On a high frequency job it would not be. If you care about fast failure, the right move is to separate the deadline for "waiting for Chrome to start" from the deadline for "Chrome started but CDP never responded," because those two failure modes have different acceptable wait budgets. I did not split them here. I made the whole window wider. For an hourly mirror job running on a shared machine, that tradeoff is worth it.

What I would do differently next time: instrument the bring up time from the first run. Log how long Chrome takes to get the CDP socket ready across different machine load states. Set the deadline from data, not intuition. And write the recovery path before you need it, not after you see the half dead Chrome squatting the port at 3am.

The failure mode here is not exotic. It shows up any time you run a resource hungry process on a machine that does other things. The fix is not complicated. The only mistake was assuming the happy path timing held under load.

Top comments (0)