DEV Community

Deva
Deva

Posted on

The 15 Second CDP Timeout That Was Silently Killing My Mirror Run

The 15 second CDP timeout was wrong the moment I shared the machine with other jobs.

My Medium mirror runs hourly. It spins up a headless Chrome session, authenticates, scrapes the post, and ships it. On an idle machine the whole thing is fast. But this machine also runs content generation and publish jobs on overlapping schedules. When those fired first, the Chrome launch at session() would push past 15 seconds and abort. Often enough to make the mirror unreliable.

The failure mode was clean but brutal: RuntimeError at session start, no post mirrored, no retry. An hour lost.

The fix has two parts. First, raise the deadline. 45 seconds is generous enough to survive a busy machine without being so long that you wait forever on a genuinely dead process. The number is not scientific: it is based on observing that cold Chrome launches under load peaked around 30 seconds, so 45 gives 50% headroom and still fails fast against actual hangs.

Second, the retry. On RuntimeError, the naive path is to just fail. But in practice a failed launch often leaves a Chrome process squatting the port and the profile directory in a broken state. If you retry immediately without cleaning that up, the next launch fails too, for a different reason. So the recovery step is: shut down whatever is squatting the port and profile, then try once more.

Once. Not in a loop.

This is the part worth thinking about. You could write a retry loop with backoff and call it robust. But a retry loop on a subprocess launch is a footgun. If something is genuinely wrong (wrong binary path, missing profile, OS level resource exhaustion), a loop just delays your discovery of it by three times the deadline. One retry is enough to handle the transient case (resource contention, flaky OS scheduler) while making permanent failures obvious quickly.

The cleanup itself is blunt: find processes on the CDP port, kill them, clear the profile lock file if it exists. Not elegant. This is not a place for elegance. You want state clean before the retry lands, and the shortest path to that is kill and clear.

What I would do differently: the 45 second deadline should be configurable, not a literal. The mirror job runs on a machine I control with a known load profile, so a literal is fine for now. But the moment you run this on shared infrastructure or CI, a hardcoded 45 seconds is going to be wrong for someone. Make it an env var, default to something sane, document it.

The other thing: log the process state before the kill. Right now the recovery step logs that it found a dead Chrome and killed it, but not what that process was doing or how long it had been alive. That information would matter the next time this fires and I am trying to understand whether contention is getting worse.

For a mirror job that runs every hour and needs to stay reliable without babysitting, 45 seconds and one retry is the right posture. The mirror has been clean since the fix landed.

Top comments (0)