DEV Community

Cover image for I Spent 40 Minutes at 11pm Debugging a Deploy That Wasn't Broken
Anguishe
Anguishe

Posted on • Originally published at bashsnippets.xyz

I Spent 40 Minutes at 11pm Debugging a Deploy That Wasn't Broken

I once spent forty minutes at eleven at night debugging a deploy that wasn't broken. The release script ran the database migration, the migration threw connection refused, the script exited non-zero, the deploy rolled itself back, and I got paged.

So I did the things you do. I read the migration. I read the logs. I checked the database — it was up, it was healthy, it accepted my connection instantly. I re-ran the deploy and it worked. I chalked it up to gremlins and went to bed, which is the part I'm not proud of, because it happened again two days later. That time I watched the timing: the script brought up a fresh database container and started the migration about six seconds before Postgres finished initializing and began accepting connections. The migration was racing the database's boot. Most of the time it won. The times it lost, I lost forty minutes.

The script wasn't wrong about anything except one assumption: that a dependency is ready the instant you ask for it.

In production, dependencies are eventually ready

That's the mental model shift. Networks blip. A service you call returns a 503 for the two seconds it takes to finish a rolling restart. An API rate-limits you with a 429 it fully expects you to retry. A fresh container's database isn't accepting connections for its first few seconds. Treating the first failure as fatal turns every one of these normal, transient conditions into a paged engineer — and the script that handles them isn't smarter — it declines to give up on the first try.

But retrying naively is its own trap. Retry instantly and you hammer a recovering service into staying down. Retry forever and a genuinely dead dependency hangs your script indefinitely. Retry a 404 and you wait a minute to confirm what you already knew. Good retries are bounded, backed off, and selective.

A retry function you can reuse anywhere

#!/bin/bash
# Purpose: survive transient failures instead of dying on the first error
set -euo pipefail

CHECK="✓"
CROSS="✗"

# retry <max_attempts> <command> [args...]
retry() {
    local max_attempts="$1"; shift
    local attempt=1
    local delay=1            # base delay — doubles each round
    local max_delay=30       # cap so the backoff never runs away

    until "$@"; do
        if (( attempt >= max_attempts )); then
            echo "$CROSS '$*' failed after $attempt attempts" >&2
            return 1
        fi
        # 0–2s of jitter so parallel callers don't all retry on the same beat
        local pause=$(( delay + RANDOM % 3 ))
        echo "$CROSS attempt $attempt failed — retrying in ${pause}s" >&2
        sleep "$pause"
        attempt=$(( attempt + 1 ))
        delay=$(( delay * 2 ))
        (( delay > max_delay )) && delay=$max_delay
    done

    echo "$CHECK '$*' succeeded on attempt $attempt"
}

# The actual fix for my 11pm deploy: wait for Postgres to accept connections.
retry 6 nc -z -w 2 db.internal 5432
echo "$CHECK database reachable — running migration"
Enter fullscreen mode Exit fullscreen mode

The whole engine is until "$@"; do ... done. until runs the command and executes the loop body only when it fails, exiting the instant it succeeds. Passing the command as "$@" (after shift-ing past the attempt count) means the function retries anything — a curl, an ssh, a port check, your own script — without caring what it is.

The backoff is the three lines at the bottom of the loop: sleep for the current delay plus a little jitter, then double the delay, capped at max_delay. That gives you 1s, 2s, 4s, 8s, 16s, 30s, 30s…

The jitter is not decoration

RANDOM % 3 looks trivial, and it's the line people delete to "clean up." Don't. Without jitter, a fleet of machines that all failed at the same instant — because the same service went down — will all retry at the same instant, and the same instant after that, producing a synchronized thundering herd that knocks the recovering service straight back over on every round. A few hundred milliseconds of randomness per client spreads the retries out so the service actually gets room to recover. At one machine it does nothing; at fifty it's the difference between recovery and a retry storm.

The mistake that makes retries dangerous

# Good: a transient failure that retrying can fix
retry 6 nc -z -w 2 db.internal 5432

# Bad: retrying a deterministic failure just delays the error 30 seconds
retry 6 curl -fsS https://api.example.com/v1/thing-that-returns-404
Enter fullscreen mode Exit fullscreen mode

A retry loop is only as smart as what you point it at. A port check belongs in a loop because the answer changes — "no" until the database boots, then "yes." A request that returns 404 returns 404 on attempt one and attempt six; the loop just postpones the failure and buries the real status under retry noise. Retry transient failures — timeouts, connection-refused, 429, 5xx, DNS hiccups. Don't retry deterministic ones — a 404, a 401, a syntax error, a missing file. When you can, branch on the exit code or HTTP status and loop only on the codes worth looping on.

For plain curl, its built-in --retry 5 --retry-delay 2 does most of this and is simpler; reach for the function when the thing you're retrying isn't curl, or when you want one backoff policy across a database probe, an ssh call, and a download at once.

Back to 11pm

That deploy never paged me again once the migration waited for the port instead of assuming it. The database still took its six seconds to boot, the network still blipped occasionally — retrying didn't make the dependencies faster. It stopped a normal, transient slowness from being treated as a fatal error, which is most of what "production-ready" means for a script.

Full function with the wait-for-port pattern and the guidance on which failures to retry: https://bashsnippets.xyz/snippets/bash-retry-with-backoff

Retries are the third leg of an unattended job: flock stops overlap, timeout stops hangs, retry rides out the blip. The Hardened Cron Wrapper Generator wires all three into one wrapper, Bash Scripts That Survive Cron is the end-to-end version, and the rest of the library is at https://bashsnippets.xyz

Top comments (0)