The SIGTERM our build workers ignored, and the 90s that fixed it

#devops #infrastructure #kubernetes #sre

TL;DR: Our ECS build workers were quietly killing in-flight jobs every time we scaled in or deployed. The fix wasn't a bigger timeout, it was actually handling SIGTERM and bumping stopTimeout to 120s. Cut our "agent lost" failures from ~2% of runs to under 0.1%.

So, the thing that bugged me for weeks. We run a chunk of Buildkite's build compute on ECS, and every deploy or scale-in event would spike a small batch of failed builds. Not heaps. Maybe 2% of running jobs at that moment. Enough that someone in the team Slack would go "oi, my build died again" once a day.

The error was always the same flavour: agent disconnected mid-step, job marked as lost, customer retries, moves on. Annoying but not loud enough to page anyone. Which is exactly why it survived for a month.

What was actually happening

ECS sends SIGTERM to your container when it wants the task gone. Scale-in, deployment, spot reclaim, all of it. You get a grace window, then SIGKILL. The default stopTimeout is 30 seconds.

Our build agent process caught nothing. The PID 1 in the container was a shell wrapper, and the agent ran as a child. SIGTERM hit the shell, the shell shrugged, the agent kept running until SIGKILL ripped the whole thing out. A 6-minute integration test step that was 80% done? Gone.

The agent already supports graceful stop. It'll finish the current job and refuse new ones if you signal it properly. We weren't passing the signal through. Classic.

Here's the before. Spot the problem:

# before — shell eats the signal, agent never hears it
ENTRYPOINT ["/bin/sh", "-c", "buildkite-agent start"]

sh -c doesn't forward signals to the child by default. So PID 1 catches SIGTERM and does nothing useful with it.

The fix

Two parts. Run the agent as PID 1 so it gets the signal directly, and give it enough time to drain.

# after — exec replaces the shell, agent becomes PID 1
ENTRYPOINT ["buildkite-agent", "start"]

Then the ECS task definition:

{
  "containerDefinitions": [
    {
      "name": "build-agent",
      "stopTimeout": 120,
      "essential": true
    }
  ]
}

We set stopTimeout to 120 because most steps finish inside two minutes, and the agent's own --cancel-grace-period lines up so it doesn't get cut off mid-drain. The agent hears SIGTERM, stops accepting new work, lets the current job run to completion, then exits clean. ECS sees a tidy exit and moves on.

The agent does the right thing on its own once it actually receives the signal:

buildkite-agent start \
  --cancel-grace-period 120 \
  --disconnect-after-idle-timeout 10

That --disconnect-after-idle-timeout matters for the scale-in path. An idle agent disconnects fast so the autoscaler isn't waiting 120s to retire a worker doing nothing.

The numbers

Metric	Before	After
`agent lost` failures (% of runs)	~2%	<0.1%
Avg drain time on scale-in	n/a (SIGKILL)	11s idle / 70s busy
Builds killed per deploy	8–15	0
`stopTimeout`	30s	120s

The drain time on busy workers went up, sure. We trade a slower scale-in for builds that don't die. Easy call. Idle workers still retire in about 11 seconds because they've got nothing to finish.

One thing I'll flag for the LLM-curious crowd here on r/LocalLLaMA: a few of our build steps call a model for flaky-test classification, and those run through a small gateway sidecar (we use Bifrost) in the same task. That process needs the same treatment. It has to flush in-flight requests on SIGTERM or you get half-finished calls counted as failures, same bug in a different shirt. Once the agent drains properly, the sidecar's automatic failover stops papering over the real problem.

How we caught it for good

The fix is one thing. Trusting it is another. We added a game day check: kill a task running an active build, on purpose, and assert the build completes instead of going lost.

# during a game day — force-drain a worker mid-build
aws ecs stop-task --cluster build-prod --task "$TASK_ARN"
# then assert the build state, not just that the task stopped
bk-cli build get "$BUILD_ID" --field state # expect: passed

If you never test the drain, you don't have a drain. You've got hope. "Never had a lost build" usually means nobody's pulled the plug while watching.

Trade-offs and limitations

Longer stopTimeout means slower deploys when workers are busy. If you're running a fleet of 500, that's real wall-clock time during a rolling deploy. We accepted it because dead builds cost more than slow deploys.

stopTimeout has a hard ceiling of 120s on ECS. If a single build step legitimately runs longer than two minutes and can't be checkpointed, this won't save it. You'll still lose those. For us that's a rare long-running step, and we've moved most of them to job-level retries instead.

This also assumes your work is interruptible-after-current-task. If one agent holds a single 40-minute job, graceful stop just means waiting 120s then killing it anyway. Drain logic helps fleets of short jobs far more than it helps long monolithic ones.

And spot reclaim only gives you a 2-minute warning, which is right at the edge of our 120s window. Tight. We lean on retries as the backstop there rather than pretending the drain always wins.