Day 16 · Bootstrap ordering (No 14) why jobs fire before the system is ready, and how to stop zombie runs at the door

#webdev #javascript #beginners #ai

Symptom

looks fine in staging. then prod rolls and you see ghosts

webhooks arrive before vector stores hydrate. first searches return empty even though data was uploaded
agents call tools before secrets or policies load. 401 then silent retries
queues scale while migrations are mid-flight. partial writes. compensations everywhere
canary passes locally then stalls in prod because workers race each other

What is actually breaking

this is No 14 · Bootstrap ordering. there is no shared definition of ready. services boot in parallel without dependency fences. health checks return “alive” but not “serving with correct schema and indexes”. feature flags open early. the first real users become unpaid testers.

Before vs After

before
patch after execution. sleeps, exponential backoff, ad hoc retries, manual compensations. the same glitches return on every deploy.

after
install a before-execution firewall. each step must pass a readiness contract and an idempotency gate before it can run. warm the path, verify the store, pin versions, then open traffic. the fix becomes structural and it sticks.

60-second triage

empty index probe check count(index_docs) and last_ingest_ts right before the first search. if count is near zero or the timestamp predates deploy start, search fired before ingest
idempotency sanity send the same webhook body twice. if you see two side effects, the edge is open without a dedupe key
schema pin check compute a schema_hash at boot for every service. if consumers disagree with producers, a migration ran without a barrier

Minimal fix · declare readiness and fence the edge

ready is not alive expose /ready that returns schema_hash, index_ready, secrets_loaded, migrations_done, version_tag. gate traffic on this endpoint, not on /health
idempotency at the frontier require Idempotency-Key for all external triggers. record first seen. dedupe all retries
warm the critical path pre-create indexes and caches. upload one smoke doc. verify a search that must return it. only then open canary
boot as a DAG express startup as a small dependency graph. ingest waits for storage. search waits for ingest. agents wait for tools and secrets. no step runs until its parents are green
fail-closed flags router stays closed until all ready bits are true. no “probably fine” leaks
rollback order declare the reverse graph for rollback. close router → drain queues → disable writers → revert schema → replay compensations if needed

Quick checks you can run today

first search after deploy returns the smoke doc with stable citation ids
vector store has non-zero size before the first user request
logs carry a single boot_id across services so you can trace order
duplicate external events never produce two side effects
migrations are version-pinned and guarded by a barrier, not sleeps

Tiny probe script

from time import sleep, time

def wait_ready(checks, timeout=120, interval=2):
    start = time()
    while time() - start < timeout:
        if all(fn() for fn in checks):
            return True
        sleep(interval)
    return False

# supply your own checks:
# has_index(), has_smoke_doc(), secrets_loaded(), schema_hash_ok()
# usage:
# if not wait_ready([has_index, has_smoke_doc, secrets_loaded, schema_hash_ok]):
#     raise SystemExit("not ready, refuse to serve")

drop a small barrier like this in workers. refuse traffic if any check is false.

Hard fixes when minimal is not enough

two-phase open phase A warm and verify. phase B route 1 percent canary. auto close if errors cross a threshold
queue fences consume only when producer_version == consumer_version. otherwise park messages in a holding queue
migration contracts forward and backward compatible schema with explicit cutover time. refuse writes that straddle both worlds
global idempotency tokens one external event id can trigger at most one side effect across the graph
backpressure ceilings bounded concurrency during warmup so autoscalers do not stampede a cold dependency

WFGY guardrails that help here

traceability contract every request carries boot_id, version_tag, and ready bits. you can prove the system was actually ready
A B C acceptance baseline vs with-firewall vs with-firewall plus canary. measure first hour post-deploy. reject on empty-index queries or duplicate effects
variance clamp for early traffic during warmup use conservative decoding and strict tool fences. widen only after stability is proven

Acceptance targets before you call it fixed

first search after deploy returns the smoke doc within one second and keeps stable citation ids
duplicate external events produce exactly one side effect
zero empty-index queries in hour one
rollback completes without governance or rate-limit races
three redeploys in a row show identical ready-bit order with a single boot graph in logs

References you can use now

ProblemMap · Article Index

p.s. if you want a quick triage on a live trace, i keep an always-on “dr wfgy” mode. drop the shortest repro and i’ll map it to a No 14 fix and point at the exact page. no spam, minimal fix only.