"When 'Control request timeout: initialize' actually means SIGKILL: Claude Code CLI OOM inside Celery"

#claude #celery #python #debugging

A production Celery task in Codens Green started returning this, intermittently, only under real load:

Control request timeout: initialize

The string is suspiciously specific. It looks like the kind of message you would see if Claude Code CLI's MCP initialization handshake had timed out on the other side of a pipe. That is what it sounds like. That is not what it was.

The task is analyze_code_specification. It spawns Claude Code CLI as a subprocess to analyze a repository against a PRD. It worked in staging, worked locally, worked in CI. It failed in production a few times a day, almost always when more than one analysis was running at the same time.

What we eventually shipped: route that task to a dedicated Celery queue, run that queue on a separate ECS Fargate worker tier with 8 GB of memory, pin concurrency to 1. The real bug was the Linux kernel OOM killer terminating Claude Code CLI partway through startup, before it could complete its handshake with the parent task. The misleading log line was just what survives when a child process is shot in the head mid-init.

This is the chase.

The wrong paths

I spent the better part of a day inside Claude Code CLI's initialization code path, because that is where the error string lived.

First theory: stdio buffering. The CLI talks to the parent over stdin/stdout. If the parent is not reading fast enough, the child can block on a full pipe and look like it is hanging. I added explicit buffer drains, raised the timeout, switched to line-buffered mode on both sides. The error still happened.

Second theory: MCP protocol version mismatch. Maybe a recent Claude Code update changed the init handshake and our version pin was stale. I diffed the changelog, compared protocol versions across our deployed image and a known-good local environment. They matched.

Third theory: a bug in the agent SDK config. We pass a lot of options into the CLI. Maybe one of them was triggering a slow path during init that exceeded the handshake budget. I trimmed the config down to the smallest reproducible set, then to nothing. Same error in production. Still nothing in staging.

Fourth theory, the one I am least proud of: maybe Claude Code itself has an upstream init bug under concurrent load. I drafted half of a GitHub issue before I noticed I had no actual evidence and was just frustrated.

None of these held up. The fingerprint of the failure, intermittent, only under load, only in production, did not match any of them. Buffering bugs are deterministic. Protocol mismatches are deterministic. Config bugs are deterministic. This was load-correlated. That is a different shape of problem.

The exit code

The thing that finally cracked it was looking at the subprocess exit code instead of the log message. We were capturing the error string before we captured returncode, and the error string was so plausible it had crowded out the rest of the diagnostic surface.

proc = await asyncio.create_subprocess_exec(*cmd, ...)
stdout, stderr = await proc.communicate()
if proc.returncode != 0:
    logger.error("claude code failed rc=%s", proc.returncode)

The value coming out was -9.

On POSIX, when subprocess reports a negative return code, the absolute value is the signal that killed the child. Signal 9 is SIGKILL. SIGKILL cannot be caught, cannot be handled, cannot be cleaned up after. The process is removed from the run queue. There is exactly one common source of SIGKILL on Linux that arrives without a parent or operator sending it on purpose: the kernel OOM killer.

That was the moment. This is no longer a Claude Code problem. This is an OS-level problem. The CLI had not timed out during initialization. The CLI had been shot during initialization, by the kernel, for using too much memory.

The "Control request timeout: initialize" message was a downstream symptom. The parent task was waiting for the child to finish its handshake. The child was killed mid-handshake. The parent eventually gave up waiting and surfaced the most specific thing it knew, which was that init had not completed in time. The error was technically true and completely misleading.

OOM math

Once you know the shape, the math is easy.

Claude Code CLI is not a small process. It boots a JavaScript runtime, loads the agent SDK, hydrates context, and prepares for tool calls. In our workload, resident memory per invocation sits between roughly 500 MB and 1.5 GB, peaking higher during initial context load.

Our Celery worker pool was the general-purpose one. Sized for the rest of our tasks, which are normal Python work: webhook fan-out, database writes, small HTTP calls. Those tasks live happily in well under 200 MB each. The worker host had memory headroom appropriate to that profile, with default Celery concurrency, which spins up multiple worker processes per host so several tasks run in parallel.

That is fine for normal traffic. It is not fine when two of those parallel tasks each decide to spawn a 1+ GB CLI subprocess.

Picture the failure mode. Two PRDs are submitted within the same minute. Two Celery workers pick up analyze_code_specification. Each launches Claude Code CLI. Both CLIs start allocating. The host's resident memory climbs past its limit. The kernel's OOM killer wakes up and picks a victim, typically the largest recent allocator. Claude Code CLI dies with SIGKILL. The Celery task surfaces "Control request timeout: initialize" because that is what it saw from its end of the pipe. The other task may or may not also die, depending on timing.

The reason this never showed up in staging was simple: staging has one user, me, running one job at a time. Concurrency was always 1 by accident. The bug needed two simultaneous invocations on the same host to express itself.

The fix, in four parts

I did not want to over-engineer this. The fix is structurally small. It is mostly Celery routing and infra sizing.

1. Dedicated queue. analyze_code_specification got its own queue, separated from everything else.

# celery_app.py
task_routes = {
    "tasks.analyze_code_specification": {"queue": "analysis"},
    "tasks.run_fix": {"queue": "fixing"},
    "tasks.control_plane.*": {"queue": "control_plane"},
    "tasks.plan_monitor.*": {"queue": "plan_monitor"},
    # everything else falls through to "default"
}

The point of the queue split is not load balancing. It is so we can attach a different worker profile to this task without changing anything about the others.

2. Dedicated ECS Fargate worker tier. The analysis queue gets its own worker service, on its own Fargate task definition, with 8 GB of memory. The rest of the workers stay on the smaller general-purpose host. One service, one queue, one process shape.

3. Concurrency = 1. The worker for the analysis queue starts like this:

celery -A app worker -Q analysis --concurrency 1 --loglevel info

This is the load-bearing piece. Even on an 8 GB host, if you let two CLI invocations run in parallel, you can still blow past the limit when both peak at 1.5 GB at the same time and the OS plus worker plus everything else has its own footprint. Concurrency 1 means exactly one Claude Code CLI subprocess exists on this host at any time. Two analyses come in, the second one queues, waits, runs next. Slower, totally fine, never OOMs.

4. Memory headroom. 1 CLI × roughly 1.5 GB peak × concurrency 1, against 8 GB total, with the worker process and OS taking a few hundred MB. That gives more than 5 GB of headroom for a worst-case CLI invocation. If we ever needed to raise concurrency to 2, we would also need to either double the instance size or accept the OOM risk back. We chose not to.

We also added regression tests at the routing layer, asserting that analyze_code_specification resolves to the analysis queue, that control-plane tasks do not accidentally get rerouted there, and that plan-monitor isolation is preserved. The routing dict is the kind of thing that quietly bit-rots in a PR review, and a misroute would silently bring the bug back.

Tradeoffs

The dedicated worker tier is more expensive per task than just bumping the general worker's RAM. It scales slower under burst load because the queue depth gates throughput. It is one more service to deploy, monitor, alert on, and update during a Claude Code CLI version bump. None of that is free.

What we got in return is that this failure mode cannot happen anymore for any reason that is not "we accidentally raised concurrency above 1." That is a single config line in one repo with a test guarding it. I will take that tradeoff.

What generalizes

Two things stuck with me after this.

One: when a child process surfaces a plausible-sounding error during a handshake, check returncode before you check the message. A negative return code on POSIX is a different category of failure from anything the application itself can report. A negative number is the OS telling you the application never got a chance.

Two: per-task memory profiles matter for Celery worker sizing in a way that defaults do not protect you from. A worker pool tuned for 200 MB tasks will silently kill a 1.5 GB task and tell you something else happened. If your task spawns a subprocess that is heavier than your worker, the right answer is almost always a separate queue with its own concurrency and its own host, not a bigger general-purpose host.

We build Codens, an AI dev harness with this kind of analysis baked in. https://www.codens.ai/en/