A/B tests and staged rollouts break in predictable ways:
- You optimize a single KPI (CTR/CVR/revenue), Goodhart kicks in, and you quietly destroy SLOs, safety, fairness, or policy compliance.
- You can’t replay what happened, so postmortems become “guessing + meetings.”
- The automation boundary is vague, and you only notice the damage after 100% rollout.
This post compresses the fix into one minimal pattern:
Evaluation (Goal Surface: multi-objective + constraints)
× Evidence (deterministic logs: replayable)
× Execution (safe automation: dry-run / shadow / canary / rollout)
0) Fix one failure story first
A common real incident:
- You ship a new logic (recommendation/search/billing/risk/scoring/routing) via A/B.
- A single KPI (e.g., CTR) improves, so you expand the rollout.
- Meanwhile error_rate / p95_latency / cost is getting worse.
- Logs are thin, so you can’t answer “which input / which branch caused the regression.”
- You notice only at 100%, and now you can’t even roll back with confidence.
This is not three independent mistakes.
Evaluation, logging, and automation boundaries are entangled.
1) Evaluation is not a scalar: Goal Surface (multi-objective + constraints)
1.1 What is a Goal Surface?
A Goal Surface is a contract that prevents “winning” by a single number.
- Primary (what you want to improve): CVR, churn, search success rate…
- Guardrails (SLO gates): error_rate, p95_latency, crash_rate…
- Constraints (must not violate): policy violations = 0, PII leaks = 0, forbidden features = 0, (optionally) imbalance / disparate impact monitoring…
- Budgets (allowed degradation): p95 +2% max, error_rate +0.1% max…
“Win” means: Primary improves while Guardrails + Constraints hold.
1.2 Minimal Goal Surface format (JSON policy)
Keep evaluation rules outside code and pin them with policy_version.
{
"policy_id": "exp-eval-policy",
"policy_version": "2026-02-18",
"primary": [
{ "metric": "conversion_rate", "direction": "up", "min_lift": 0.002 }
],
"guardrails": [
{ "metric": "error_rate_5m", "op": "<=", "threshold": 0.01, "budget_delta": 0.001 },
{ "metric": "p95_latency_ms_5m", "op": "<=", "threshold": 400, "budget_delta": 20 }
],
"constraints": [
{ "kind": "must_be_zero", "metric": "pii_leak_incidents" },
{ "kind": "must_be_false", "metric": "prohibited_feature_present" }
],
"missing_data_policy": {
"required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m","pii_leak_incidents","prohibited_feature_present"],
"on_missing": "DEGRADE"
}
}
Key move: on_missing = DEGRADE.
If you can’t observe the required evidence, you stop safely instead of “rolling forward on vibes.”
2) Make failures replayable: a minimal deterministic log spec
A/B tests become painful not because they fail, but because you can’t replay what happened.
So logs are not “for humans.” Logs are for replay.
2.1 Minimal fields you should not skip
-
run_id: join key for the entire rollout/experiment execution (stable across stages) -
event_id: unique per event (oftenrun_id + stage) -
experiment_id/variant_id -
policy_id/policy_version(the contract) -
input_digest: fingerprint of the evaluation input (so you can prove “same input” later) -
state_transition: from/to (stage changes) -
decision: ACCEPT / REJECT / DEGRADE + reason codes -
evidence_refs: dashboard IDs, alert policy IDs, tickets, approvals -
metrics_snapshot: the gate evidence (at least required metrics)
2.2 NDJSON append-only example
{"ts":"2026-02-18T10:00:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_10","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_10","state_transition":{"from":"SHADOW","to":"CANARY_10"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.132,"error_rate_5m":0.008,"p95_latency_ms_5m":360,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"ACCEPT","reason_codes":[],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}
{"ts":"2026-02-18T10:15:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_25","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_25","state_transition":{"from":"CANARY_10","to":"CANARY_25"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.131,"error_rate_5m":0.013,"p95_latency_ms_5m":420,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"REJECT","reason_codes":["guardrail_violation:error_rate_5m"],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}
With logs like this, you don’t “discuss.” You query.
3) Safe automation boundary: dry-run / shadow / canary / rollout
Once you have Goal Surface + deterministic logs, execution becomes a stage model.
3.1 Recommended stages
- dry-run: show plan only (do not execute)
- shadow: run in the background, collect comparisons (no user impact)
- canary: small percent (10% → 25% → 50% → 100%)
- rollout: staged expansion (auto-stop / auto-rollback on gate failures)
flowchart LR
DR[Dry-run] --> SH[Shadow]
SH --> C10[Canary 10%]
C10 --> C25[Canary 25%]
C25 --> C50[Canary 50%]
C50 --> R[Rollout 100%]
C10 -.gate fail.-> RB[Rollback/Stop]
C25 -.gate fail.-> RB
C50 -.gate fail.-> RB
R -.gate fail.-> RB
3.2 Stop conditions must be deterministic
- Observation is missing/stale → DEGRADE (hold)
- Guardrail violation → REJECT (stop/rollback)
- Constraint violation → REJECT (immediate stop)
This removes “judgment by mood.”
3.3 Turn DEGRADE into an operational metric (SLO)
DEGRADE is not failure. It’s a safe branch.
But if DEGRADE keeps increasing, operations get stuck.
So you manage DEGRADE with SLOs—not feelings.
3.3.1 Reason code taxonomy (aggregate-able granularity)
Avoid free-text. Prefer categories that map to “one action to fix.”
A) observation_missing:* (telemetry missing / delayed)
observation_missing:required_metricobservation_missing:delayed_pipelineobservation_missing:broken_dashboard_ref
B) evidence_missing:* (audit grounds missing)
evidence_missing:change_idevidence_missing:approvalevidence_missing:ticket_link
C) precondition_unknown:* (preconditions not satisfied)
precondition_unknown:assignment_not_stableprecondition_unknown:sample_size_not_reachedprecondition_unknown:variant_mapping_ambiguous
3.3.2 Two SLOs that actually work
DEGRADE rate SLO
Example:DEGRADE / (ACCEPT + REJECT + DEGRADE) <= 1%(rolling 7d)Time-to-resume SLO
Example: P95time_to_resume <= 30mfor observation issues
Example: P95time_to_resume <= 24hfor approval waits
3.3.3 Make DEGRADE re-enterable (resume-friendly)
A DEGRADE event should include:
-
reason_codes+missing -
resume_token(opaque is fine) -
requested_actions(typed “next steps”)
Example:
{
"ts": "2026-02-18T10:30:00+09:00",
"run_id": "exp-42:B:2026-02-18",
"event_id": "exp-42:B:2026-02-18:CANARY_25",
"stage": "CANARY_25",
"decision": {
"verdict": "DEGRADE",
"reason_codes": ["observation_missing:required_metric"],
"missing": ["p95_latency_ms_5m"]
},
"resume": {
"resume_token": "opaque:rsn_7f3a9c...",
"requested_actions": [
{"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}},
{"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}}
]
}
}
Now DEGRADE becomes “a queue,” not “a meeting.”
4) Minimal implementation (stdlib-only): gate evaluation + append logs + replay
You can run this pattern without external libraries.
The purpose is to show the operational skeleton.
(Code below uses PEP 604 union types like float | None, so it requires **Python 3.10+.)
4.1 Stable input fingerprint: input_digest (sha256)
# digest.py
from __future__ import annotations
import hashlib
import json
from typing import Any, Dict
def canonical_json(obj: Any) -> str:
return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
def sha256_hex(s: str) -> str:
return hashlib.sha256(s.encode("utf-8")).hexdigest()
def input_digest(inp: Dict[str, Any]) -> str:
return "sha256:" + sha256_hex(canonical_json(inp))
4.2 Gate evaluation: Goal Surface → Verdict
Important: the returned ACCEPT/REJECT/DEGRADE is an operational gate, not “statistical winner.”
-
ACCEPT= guardrails/constraints are satisfied; safe to proceed to next stage -
REJECT= stop/rollback -
DEGRADE= hold (missing observation/evidence/preconditions)
# gate_eval.py
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, List, Tuple
Verdict = str # "ACCEPT" | "REJECT" | "DEGRADE"
@dataclass(frozen=True)
class Decision:
verdict: Verdict
reason_codes: Tuple[str, ...]
missing: Tuple[str, ...]
def _missing_required(policy: Dict[str, Any], metrics: Dict[str, Any]) -> List[str]:
req = policy.get("missing_data_policy", {}).get("required_metrics", [])
out: List[str] = []
for k in req:
if k not in metrics:
out.append(k)
return out
def evaluate(policy: Dict[str, Any], metrics: Dict[str, Any]) -> Decision:
mdp = policy.get("missing_data_policy", {})
# Wrapper-friendly normalization:
# - {"metrics": {...}, "_meta": {...}} OR a flat dict.
snapshot = metrics if isinstance(metrics, dict) else {}
if isinstance(snapshot.get("metrics"), dict):
data = snapshot["metrics"]
meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}
# Pull a few optional fields up if present (minimal).
for k in ("freshness_seconds", "watermark_event_time", "sources"):
if k in snapshot and k not in meta:
meta[k] = snapshot[k]
else:
data = snapshot
meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}
# 0) Missing required metrics → DEGRADE/REJECT per policy
missing = _missing_required(policy, data)
if missing:
on_missing = mdp.get("on_missing", "DEGRADE")
rc = ("observation_missing:required_metric",)
if on_missing == "REJECT":
return Decision("REJECT", rc, tuple(missing))
return Decision("DEGRADE", rc, tuple(missing))
# 0.5) Freshness gate (optional)
max_fresh = mdp.get("max_freshness_seconds")
if max_fresh is not None:
fs = meta.get("freshness_seconds")
if fs is None:
on_stale = mdp.get("on_stale", "DEGRADE")
rc = ("observation_missing:freshness_seconds",)
if on_stale == "REJECT":
return Decision("REJECT", rc, ("freshness_seconds",))
return Decision("DEGRADE", rc, ("freshness_seconds",))
try:
fs_v = float(fs)
thr_v = float(max_fresh)
except (TypeError, ValueError):
return Decision("DEGRADE", ("observation_invalid:freshness_seconds",), ("freshness_seconds",))
if fs_v > thr_v:
on_stale = mdp.get("on_stale", "DEGRADE")
rc = ("observation_stale:over_freshness_budget",)
if on_stale == "REJECT":
return Decision("REJECT", rc, ("freshness_seconds",))
return Decision("DEGRADE", rc, ("freshness_seconds",))
# 0.6) Watermark skew gate (optional): sources[*].watermark_event_time
max_skew = mdp.get("max_watermark_skew_seconds")
if max_skew is not None:
def _parse_iso(ts: Any) -> float | None:
if isinstance(ts, str):
try:
return datetime.fromisoformat(ts).timestamp()
except ValueError:
return None
return None
wms: List[float] = []
sources = meta.get("sources")
if isinstance(sources, list):
for s in sources:
if isinstance(s, dict):
t = _parse_iso(s.get("watermark_event_time"))
if t is not None:
wms.append(t)
if len(wms) < 2:
return Decision("DEGRADE", ("observation_missing:watermark_event_time",), ("watermark_event_time",))
skew = float(max(wms) - min(wms))
if skew > float(max_skew):
return Decision("DEGRADE", ("observation_stale:watermark_skew",), ("watermark_event_time",))
# 1) Constraints (must not violate)
reasons: List[str] = []
for c in policy.get("constraints", []):
kind = c.get("kind")
metric = c.get("metric")
if not metric:
return Decision("REJECT", ("invalid_constraint:missing_metric",), ())
if metric not in data:
return Decision("DEGRADE", ("observation_missing:constraint_metric",), (metric,))
v = data[metric]
if kind == "must_be_zero":
try:
if float(v) != 0.0:
reasons.append(f"constraint_violation:{metric}")
except (TypeError, ValueError):
return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
elif kind == "must_be_false":
if not isinstance(v, bool):
return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
if v is True:
reasons.append(f"constraint_violation:{metric}")
else:
return Decision("REJECT", (f"unknown_constraint_kind:{kind}",), ())
if reasons:
return Decision("REJECT", tuple(reasons), ())
# 2) Guardrails (SLO gates)
for g in policy.get("guardrails", []):
m = g["metric"]
op = g["op"]
thr = g["threshold"]
if m not in data:
return Decision("DEGRADE", ("observation_missing:guardrail_metric",), (m,))
try:
v = float(data[m])
t = float(thr)
except (TypeError, ValueError):
return Decision("DEGRADE", ("observation_invalid:guardrail_value",), (m,))
ok = True
if op == "<=":
ok = v <= t
elif op == "<":
ok = v < t
elif op == ">=":
ok = v >= t
elif op == ">":
ok = v > t
else:
return Decision("REJECT", (f"unknown_guardrail_op:{op}",), ())
if not ok:
return Decision("REJECT", (f"guardrail_violation:{m}",), ())
# 3) Primary “winner” belongs to a separate stats layer.
return Decision("ACCEPT", (), ())
4.3 Append-only NDJSON logging
# append_log.py
from __future__ import annotations
import json
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import Any, Dict
JST = timezone(timedelta(hours=9))
def now_iso() -> str:
return datetime.now(JST).isoformat()
def append_ndjson(path: Path, event: Dict[str, Any]) -> None:
line = json.dumps(event, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("a", encoding="utf-8") as f:
f.write(line + "\n")
4.4 Stage driver (minimal)
# rollout_driver.py
from __future__ import annotations
import json
from pathlib import Path
from typing import Any, Dict
from digest import input_digest
from gate_eval import evaluate
from append_log import append_ndjson, now_iso
def load_json(path: Path) -> Dict[str, Any]:
with path.open("r", encoding="utf-8") as f:
return json.load(f)
def main() -> int:
policy = load_json(Path("policy.json"))
log_path = Path("logs/decision.ndjson")
experiment_id = "exp-42"
variant_id = "B"
stages = ["DRY_RUN", "SHADOW", "CANARY_10", "CANARY_25", "CANARY_50", "ROLLOUT_100"]
prev_stage = None
for stage in stages:
# In production: fetch metrics snapshot from Prometheus/Cloud Monitoring/warehouse.
metrics = load_json(Path(f"snapshots/{stage}.json"))
inp = {
"experiment_id": experiment_id,
"variant_id": variant_id,
"stage": stage,
"policy_id": policy["policy_id"],
"policy_version": policy["policy_version"],
"metrics_snapshot": metrics,
}
d = evaluate(policy, metrics)
run_id = f"{experiment_id}:{variant_id}:{policy['policy_version']}"
event_id = f"{run_id}:{stage}"
event = {
"ts": now_iso(),
"run_id": run_id,
"event_id": event_id,
"experiment_id": experiment_id,
"variant_id": variant_id,
"policy_id": policy["policy_id"],
"policy_version": policy["policy_version"],
"stage": stage,
"state_transition": {"from": prev_stage, "to": stage},
"input_digest": input_digest(inp),
"metrics_snapshot": metrics,
"decision": {"verdict": d.verdict, "reason_codes": list(d.reason_codes), "missing": list(d.missing)},
"evidence_refs": {"dashboard_id": "dash-foo", "alert_policy_id": "alert-slo-foo", "change_id": "CHG-123"},
}
append_ndjson(log_path, event)
if d.verdict == "REJECT":
print(f"[STOP] stage={stage} reasons={d.reason_codes}")
return 1
if d.verdict == "DEGRADE":
print(f"[HOLD] stage={stage} missing={d.missing}")
return 2
prev_stage = stage
print(f"[OK] stage={stage}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
4.5 Replay (“what happened?” becomes queryable)
# replay.py
from __future__ import annotations
import json
from pathlib import Path
from typing import Any, Dict, Iterable, List
def read_ndjson(path: Path) -> Iterable[Dict[str, Any]]:
with path.open("r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
def main() -> int:
path = Path("logs/decision.ndjson")
events = list(read_ndjson(path))
rejects = [e for e in events if e.get("decision", {}).get("verdict") == "REJECT"]
for e in rejects:
print(json.dumps({
"ts": e.get("ts"),
"run_id": e.get("run_id"),
"event_id": e.get("event_id"),
"stage": e.get("stage"),
"reason_codes": e.get("decision", {}).get("reason_codes", []),
"policy_version": e.get("policy_version"),
}, ensure_ascii=False))
return 0
if __name__ == "__main__":
raise SystemExit(main())
Now postmortems are replayable.
4.6 The missing link from PoC to production (the real constraints)
The minimal code above hides the hardest parts. This section makes them explicit.
4.6.1 Treat metrics_snapshot as evidence (with freshness)
The hard part is aligning multiple sources (Prometheus/logs/warehouse) to the “same window.” In practice, perfect alignment is unrealistic.
So treat the snapshot as evidence with meta:
-
collected_at,window -
watermark_event_time(event-time “how far data is complete”) - per-source watermarks
-
freshness_seconds(how late is this snapshot?)
Then gate freshness in policy:
{
"missing_data_policy": {
"required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m"],
"on_missing": "DEGRADE",
"max_freshness_seconds": 120,
"max_watermark_skew_seconds": 60,
"on_stale": "DEGRADE"
}
}
Freshness DEGRADE reason codes become SLO targets:
observation_stale:over_freshness_budgetobservation_stale:watermark_skewobservation_missing:required_metric
4.6.2 Don’t mix “promotion (winner)” with “safety gate”
The gate evaluator above answers: “Is it safe to proceed?”
It does not answer: “Did B win statistically?”
Separate layers:
- Safety gate: Guardrails + Constraints (canary progression)
- Promotion gate: statistics / sample size / sequential testing (100% rollout)
When promotion isn’t ready, DEGRADE (precondition unknown):
precondition_unknown:sample_size_not_reachedprecondition_unknown:assignment_not_stableprecondition_unknown:significance_not_ready
Policy can make this explicit:
{
"promotion_gate": {
"required_for_stages": ["ROLLOUT_100"],
"min_sample_size": 100000,
"require_ci_low_gt": 0.0,
"on_not_ready": "DEGRADE"
}
}
4.6.3 Rollback atomicity: automate only reversible changes
Auto-rollback is safest for:
- feature flags / routing (stateless switches)
- backward-compatible config
Auto-rollback is dangerous for:
- schema changes
- irreversible migrations
- external side effects (emails, billing)
So classify changes:
-
reversible→ allow auto rollout -
stateful_or_irreversible→ dry-run/shadow only; require human approval after DEGRADE -
unsafe_for_experiment→ don’t A/B it
4.6.4 Align by watermark contract, not “perfect 5-min windows”
Don’t chase perfect synchronization across sources. Use an explicit watermark contract:
-
watermark_event_timesays “complete up to here” - allow only bounded skew; otherwise DEGRADE
This makes “data pipeline delay” a designed state, not a surprise.
4.6.5 Implement the driver as an orchestrator (state machine)
Real rollouts run for hours/days. Don’t use a long-running loop.
Use:
- scheduler (cron/workflow engine), or
- event-driven triggers (snapshot arrival, approval granted, alert fired)
Split the “driver” into two layers. This small separation makes the implementation stable.
4.6.5.1 Split into two layers: Orchestrator vs Evaluator
In production, the driver is much more stable if you separate:
(A) Orchestrator (state + re-entry management)
Persist run_id state and manage “when this run can be evaluated again.”
Typical fields:
- current
stage -
resume_token(pointer needed to resume) - last
snapshot_id/snapshot_digest - last decision (ACCEPT/REJECT/DEGRADE + reasons)
- next evaluation condition (e.g., fresh snapshot arrival / approval granted / time elapsed)
(B) Evaluator (the deterministic judge)
This is the pure-ish part: evaluate(policy, metrics_snapshot) -> verdict.
Keep it as close to a pure function as possible.
This separation removes the unrealistic assumptions of a “single long-running process”:
crashes, restarts, duplicated execution, and partial progress become normal—and safe.
4.6.5.2 DEGRADE is not “stop”—it’s “stop in a re-enterable way”
To make DEGRADE operable, you must be able to re-evaluate the same stage under the same rules after missing grounds are resolved.
So on DEGRADE, always persist (and log):
-
reason_codes/missing -
resume_token(opaque is fine; a DB pointer is often enough) -
requested_actions(machine-readable next steps)
Then the Orchestrator can trigger re-evaluation when:
- a fresh snapshot arrives
- an approval is granted
- a deadline passes (scheduled re-check)
DEGRADE becomes a queue with re-entry—not a meeting.
4.6.6 Make Promotion Gate mandatory for 100% rollout (not only safety gates)
As discussed in 4.6.2, gate_eval answers:
“Is it safe to proceed?”
It does not answer:
“Did the variant actually win (business decision)?”
The side effect of this separation is real:
If you reach
ROLLOUT_100with “safe to proceed” only,
you can end up fully rolling out a variant that is slightly worse (statistical noise, but real business loss).
The minimal practical fix is: requirements differ by stage.
4.6.6.1 Canary uses Safety Gate; 100% requires Promotion Gate
-
CANARY (10/25/50)
- Required: Safety Gate (Guardrails / Constraints)
- Statistics: optional (log as signal)
-
ROLLOUT_100
- Required: Safety Gate (of course)
- Required: Promotion Gate (winning condition)
Treat Promotion Gate as a separate layer and include “promotion evidence” inside the snapshot, for example:
sample_sizeeffect_estimate-
ci_low/ci_high(confidence interval) -
p_value(if needed) -
sequential_test_state(for sequential tests)
If Promotion Gate is not satisfied, fail safe with DEGRADE:
precondition_unknown:sample_size_not_reachedprecondition_unknown:significance_not_readyprecondition_unknown:assignment_not_stable
4.6.6.2 Make “100% promotion conditions” explicit in policy
Operationally, the clearest approach is to express promotion conditions directly in policy:
{
"promotion_gate": {
"required_for_stages": ["ROLLOUT_100"],
"min_sample_size": 100000,
"require_ci_low_gt": 0.0,
"on_not_ready": "DEGRADE"
}
}
Now “Safety Gate passed, but we are not winning, so we do not go to 100%” becomes a deterministic decision—not vibes.
Tip:
- Canary stages stop on safety.
- 100% promotion stops on winning conditions (Promotion Gate).
This two-layer stop mechanism makes the whole rollout dramatically easier to operate.
5) Closing: three pieces, one operable system
- Goal Surface prevents single-metric Goodhart.
- Deterministic logs make incidents replayable.
- Safe automation makes staged rollout enforceable with REJECT/DEGRADE gates.
Ultimately, turning experiments into operations is not about “smartness.”
It’s about determinism—contracts, evidence, and safe boundaries.
Top comments (0)