tldr
automation looks solid in a demo, then quietly duplicates work, drops states, or loops until quotas die. most teams don’t have idempotency and contract checks wired in. this page gives a minimal set of fences you can paste into real pipelines and a target to verify they hold under load.
Full index of the series:
Global Fix Map README
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md
Why automations break after the demo
In a happy path run, every step reports “success.”
Under concurrency or retries, three things fail silently:
1) no global request identity that survives retries.
2) no once-and-only-once write contract.
3) side stores get out of sync and nobody checks drift.
Result: the same trigger fires more than once, a half-written state is treated as complete, or a self-triggered loop eats your quota.
Common failure modes
- Double-fire triggers: webhook replay or network retry executes the same job twice.
- Phantom success: upstream says ok, downstream state misses one write.
- State drift: DB updated, search index or cache not updated.
- Dead queue: retry forever without backoff, jobs pile up, then drop.
- Edge-loop recursion: flow triggers itself through a side effect.
What is actually breaking
Not Zapier. Not n8n. Not GitHub Actions. The contracts are missing.
- request identity not stable across hops
- idempotency keys not enforced on writes
- no backpressure rules for retries
- dual writes lack a commit token
- no loop detector for self-triggered flows
Minimal fixes — copyable checklist
Identity
- create a
req_id
once at the entry gateway. carry it through every step. - add
parent_id
if a step can spawn children. keep a flat log.
Idempotency
- derive an
idempotency_key = H(flow_name, req_id, normalized_payload)
- reject or no-op when the key exists with a completed state.
Backpressure
- retry policy: exponential backoff, cap, and a kill-switch flag.
- detect hot error rates. trip the breaker, drain, then resume.
Dual write fences
- write DB first, then cache or index with a
commit_token
. - if the second write fails, roll back using the token.
Loop detector
- attach a hop counter. if
hops > k
, stop and log loop_chain.
Reference snippets
Idempotency key builder
def normalize(payload: dict) -> dict:
# remove transient fields, sort keys, lower-cased strings, clamp numbers
# keep this tiny and explicit
...
def idem_key(flow_name, req_id, payload) -> str:
base = f"{flow_name}:{req_id}:{json.dumps(normalize(payload), sort_keys=True)}"
return sha256(base.encode()).hexdigest()
`
Webhook replay fence
python
def handle_webhook(event):
key = idem_key("order.created", event["req_id"], event["body"])
if store.exists(key):
return {"status": "ok", "reason": "replay-noop"}
try:
result = apply_business_write(event["body"])
store.set(key, {"done": True, "ts": now(), "result": lite(result)})
return {"status": "ok"}
except TemporaryError:
retry_with_backoff(event) # bounded, not infinite
except:
store.set(key, {"done": False, "ts": now(), "error": "fatal"})
raise
Queue policy example (GitHub Actions)
`yaml
concurrency:
group: order-sync-${{ github.ref }}
cancel-in-progress: false
jobs:
sync:
retries: 3
timeout-minutes: 15
steps:
- name: backoff gate
run: ./scripts/backoff.sh --window 60 --limit 100
`
Acceptance targets
Use these to decide if your repair held.
- duplicate execution rate under load ≤ 1 percent
- ΔS drift across stores ≤ 0.40 for the same
req_id
- queue retry convergence ≥ 0.80 over a 10k job run
- no uncontrolled recursion after 10k events
If you miss any target, instrument the fence that failed, not the whole pipeline.
One-minute self-test
- pick a flow that writes to two places, like DB and cache.
- send the same trigger three times with the same
req_id
. - verify: single final state, both stores consistent, retries bounded.
- flip the second write to fail randomly. confirm rollback leaves you clean.
Postmortem checklist
- Did the failing request carry a stable
req_id
from the first hop - Was an idempotency key computed from normalized fields
- Are success and failure states both recorded under the same key
- Is there a breaker that stops retries during hot failures
- Can you diff DB vs cache by
req_id
and compute drift
Top comments (0)