Symptom
looks fine in staging. then prod rolls and you see ghosts
- webhooks arrive before vector stores hydrate. first searches return empty even though data was uploaded
- agents call tools before secrets or policies load. 401 then silent retries
- queues scale while migrations are mid-flight. partial writes. compensations everywhere
- canary passes locally then stalls in prod because workers race each other
What is actually breaking
this is No 14 · Bootstrap ordering. there is no shared definition of ready. services boot in parallel without dependency fences. health checks return “alive” but not “serving with correct schema and indexes”. feature flags open early. the first real users become unpaid testers.
Before vs After
before
patch after execution. sleeps, exponential backoff, ad hoc retries, manual compensations. the same glitches return on every deploy.
after
install a before-execution firewall. each step must pass a readiness contract and an idempotency gate before it can run. warm the path, verify the store, pin versions, then open traffic. the fix becomes structural and it sticks.
60-second triage
-
empty index probe
check
count(index_docs)
andlast_ingest_ts
right before the first search. if count is near zero or the timestamp predates deploy start, search fired before ingest - idempotency sanity send the same webhook body twice. if you see two side effects, the edge is open without a dedupe key
-
schema pin check
compute a
schema_hash
at boot for every service. if consumers disagree with producers, a migration ran without a barrier
Minimal fix · declare readiness and fence the edge
-
ready is not alive
expose
/ready
that returnsschema_hash
,index_ready
,secrets_loaded
,migrations_done
,version_tag
. gate traffic on this endpoint, not on/health
-
idempotency at the frontier
require
Idempotency-Key
for all external triggers. record first seen. dedupe all retries - warm the critical path pre-create indexes and caches. upload one smoke doc. verify a search that must return it. only then open canary
- boot as a DAG express startup as a small dependency graph. ingest waits for storage. search waits for ingest. agents wait for tools and secrets. no step runs until its parents are green
- fail-closed flags router stays closed until all ready bits are true. no “probably fine” leaks
- rollback order declare the reverse graph for rollback. close router → drain queues → disable writers → revert schema → replay compensations if needed
Quick checks you can run today
- first search after deploy returns the smoke doc with stable citation ids
- vector store has non-zero size before the first user request
- logs carry a single boot_id across services so you can trace order
- duplicate external events never produce two side effects
- migrations are version-pinned and guarded by a barrier, not sleeps
Tiny probe script
from time import sleep, time
def wait_ready(checks, timeout=120, interval=2):
start = time()
while time() - start < timeout:
if all(fn() for fn in checks):
return True
sleep(interval)
return False
# supply your own checks:
# has_index(), has_smoke_doc(), secrets_loaded(), schema_hash_ok()
# usage:
# if not wait_ready([has_index, has_smoke_doc, secrets_loaded, schema_hash_ok]):
# raise SystemExit("not ready, refuse to serve")
drop a small barrier like this in workers. refuse traffic if any check is false.
Hard fixes when minimal is not enough
- two-phase open phase A warm and verify. phase B route 1 percent canary. auto close if errors cross a threshold
-
queue fences consume only when
producer_version == consumer_version
. otherwise park messages in a holding queue - migration contracts forward and backward compatible schema with explicit cutover time. refuse writes that straddle both worlds
- global idempotency tokens one external event id can trigger at most one side effect across the graph
- backpressure ceilings bounded concurrency during warmup so autoscalers do not stampede a cold dependency
WFGY guardrails that help here
-
traceability contract every request carries
boot_id
,version_tag
, and ready bits. you can prove the system was actually ready - A B C acceptance baseline vs with-firewall vs with-firewall plus canary. measure first hour post-deploy. reject on empty-index queries or duplicate effects
- variance clamp for early traffic during warmup use conservative decoding and strict tool fences. widen only after stability is proven
Acceptance targets before you call it fixed
- first search after deploy returns the smoke doc within one second and keeps stable citation ids
- duplicate external events produce exactly one side effect
- zero empty-index queries in hour one
- rollback completes without governance or rate-limit races
- three redeploys in a row show identical ready-bit order with a single boot graph in logs
References you can use now
- Automation · Global Fix Map
- OpsDeploy · Global Fix Map
- RAG Architecture and Recovery
- Retrieval traceability
- Chunking to Embedding Contract
More articles
p.s. if you want a quick triage on a live trace, i keep an always-on “dr wfgy” mode. drop the shortest repro and i’ll map it to a No 14 fix and point at the exact page. no spam, minimal fix only.
Top comments (0)