DEV Community

kanaria007
kanaria007

Posted on

Make A/B Tests Operable: Goal Surface Deterministic Logs Safe Automation (dry-run / shadow / canary / rollout)

A/B tests and staged rollouts break in predictable ways:

  • You optimize a single KPI (CTR/CVR/revenue), Goodhart kicks in, and you quietly destroy SLOs, safety, fairness, or policy compliance.
  • You can’t replay what happened, so postmortems become “guessing + meetings.”
  • The automation boundary is vague, and you only notice the damage after 100% rollout.

This post compresses the fix into one minimal pattern:

Evaluation (Goal Surface: multi-objective + constraints)
× Evidence (deterministic logs: replayable)
× Execution (safe automation: dry-run / shadow / canary / rollout)


0) Fix one failure story first

A common real incident:

  1. You ship a new logic (recommendation/search/billing/risk/scoring/routing) via A/B.
  2. A single KPI (e.g., CTR) improves, so you expand the rollout.
  3. Meanwhile error_rate / p95_latency / cost is getting worse.
  4. Logs are thin, so you can’t answer “which input / which branch caused the regression.”
  5. You notice only at 100%, and now you can’t even roll back with confidence.

This is not three independent mistakes.
Evaluation, logging, and automation boundaries are entangled.


1) Evaluation is not a scalar: Goal Surface (multi-objective + constraints)

1.1 What is a Goal Surface?

A Goal Surface is a contract that prevents “winning” by a single number.

  • Primary (what you want to improve): CVR, churn, search success rate…
  • Guardrails (SLO gates): error_rate, p95_latency, crash_rate…
  • Constraints (must not violate): policy violations = 0, PII leaks = 0, forbidden features = 0, (optionally) imbalance / disparate impact monitoring…
  • Budgets (allowed degradation): p95 +2% max, error_rate +0.1% max…

“Win” means: Primary improves while Guardrails + Constraints hold.

1.2 Minimal Goal Surface format (JSON policy)

Keep evaluation rules outside code and pin them with policy_version.

{
  "policy_id": "exp-eval-policy",
  "policy_version": "2026-02-18",
  "primary": [
    { "metric": "conversion_rate", "direction": "up", "min_lift": 0.002 }
  ],
  "guardrails": [
    { "metric": "error_rate_5m", "op": "<=", "threshold": 0.01, "budget_delta": 0.001 },
    { "metric": "p95_latency_ms_5m", "op": "<=", "threshold": 400, "budget_delta": 20 }
  ],
  "constraints": [
    { "kind": "must_be_zero", "metric": "pii_leak_incidents" },
    { "kind": "must_be_false", "metric": "prohibited_feature_present" }
  ],
  "missing_data_policy": {
    "required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m","pii_leak_incidents","prohibited_feature_present"],
    "on_missing": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

Key move: on_missing = DEGRADE.
If you can’t observe the required evidence, you stop safely instead of “rolling forward on vibes.”


2) Make failures replayable: a minimal deterministic log spec

A/B tests become painful not because they fail, but because you can’t replay what happened.

So logs are not “for humans.” Logs are for replay.

2.1 Minimal fields you should not skip

  • run_id: join key for the entire rollout/experiment execution (stable across stages)
  • event_id: unique per event (often run_id + stage)
  • experiment_id / variant_id
  • policy_id / policy_version (the contract)
  • input_digest: fingerprint of the evaluation input (so you can prove “same input” later)
  • state_transition: from/to (stage changes)
  • decision: ACCEPT / REJECT / DEGRADE + reason codes
  • evidence_refs: dashboard IDs, alert policy IDs, tickets, approvals
  • metrics_snapshot: the gate evidence (at least required metrics)

2.2 NDJSON append-only example

{"ts":"2026-02-18T10:00:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_10","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_10","state_transition":{"from":"SHADOW","to":"CANARY_10"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.132,"error_rate_5m":0.008,"p95_latency_ms_5m":360,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"ACCEPT","reason_codes":[],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}
{"ts":"2026-02-18T10:15:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_25","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_25","state_transition":{"from":"CANARY_10","to":"CANARY_25"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.131,"error_rate_5m":0.013,"p95_latency_ms_5m":420,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"REJECT","reason_codes":["guardrail_violation:error_rate_5m"],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}
Enter fullscreen mode Exit fullscreen mode

With logs like this, you don’t “discuss.” You query.


3) Safe automation boundary: dry-run / shadow / canary / rollout

Once you have Goal Surface + deterministic logs, execution becomes a stage model.

3.1 Recommended stages

  • dry-run: show plan only (do not execute)
  • shadow: run in the background, collect comparisons (no user impact)
  • canary: small percent (10% → 25% → 50% → 100%)
  • rollout: staged expansion (auto-stop / auto-rollback on gate failures)
flowchart LR
  DR[Dry-run] --> SH[Shadow]
  SH --> C10[Canary 10%]
  C10 --> C25[Canary 25%]
  C25 --> C50[Canary 50%]
  C50 --> R[Rollout 100%]
  C10 -.gate fail.-> RB[Rollback/Stop]
  C25 -.gate fail.-> RB
  C50 -.gate fail.-> RB
  R -.gate fail.-> RB
Enter fullscreen mode Exit fullscreen mode

3.2 Stop conditions must be deterministic

  • Observation is missing/stale → DEGRADE (hold)
  • Guardrail violation → REJECT (stop/rollback)
  • Constraint violation → REJECT (immediate stop)

This removes “judgment by mood.”


3.3 Turn DEGRADE into an operational metric (SLO)

DEGRADE is not failure. It’s a safe branch.
But if DEGRADE keeps increasing, operations get stuck.

So you manage DEGRADE with SLOs—not feelings.

3.3.1 Reason code taxonomy (aggregate-able granularity)

Avoid free-text. Prefer categories that map to “one action to fix.”

A) observation_missing:* (telemetry missing / delayed)

  • observation_missing:required_metric
  • observation_missing:delayed_pipeline
  • observation_missing:broken_dashboard_ref

B) evidence_missing:* (audit grounds missing)

  • evidence_missing:change_id
  • evidence_missing:approval
  • evidence_missing:ticket_link

C) precondition_unknown:* (preconditions not satisfied)

  • precondition_unknown:assignment_not_stable
  • precondition_unknown:sample_size_not_reached
  • precondition_unknown:variant_mapping_ambiguous

3.3.2 Two SLOs that actually work

  1. DEGRADE rate SLO
    Example: DEGRADE / (ACCEPT + REJECT + DEGRADE) <= 1% (rolling 7d)

  2. Time-to-resume SLO
    Example: P95 time_to_resume <= 30m for observation issues
    Example: P95 time_to_resume <= 24h for approval waits

3.3.3 Make DEGRADE re-enterable (resume-friendly)

A DEGRADE event should include:

  • reason_codes + missing
  • resume_token (opaque is fine)
  • requested_actions (typed “next steps”)

Example:

{
  "ts": "2026-02-18T10:30:00+09:00",
  "run_id": "exp-42:B:2026-02-18",
  "event_id": "exp-42:B:2026-02-18:CANARY_25",
  "stage": "CANARY_25",
  "decision": {
    "verdict": "DEGRADE",
    "reason_codes": ["observation_missing:required_metric"],
    "missing": ["p95_latency_ms_5m"]
  },
  "resume": {
    "resume_token": "opaque:rsn_7f3a9c...",
    "requested_actions": [
      {"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}},
      {"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Now DEGRADE becomes “a queue,” not “a meeting.”


4) Minimal implementation (stdlib-only): gate evaluation + append logs + replay

You can run this pattern without external libraries.
The purpose is to show the operational skeleton.

(Code below uses PEP 604 union types like float | None, so it requires **Python 3.10+.)

4.1 Stable input fingerprint: input_digest (sha256)

# digest.py
from __future__ import annotations

import hashlib
import json
from typing import Any, Dict

def canonical_json(obj: Any) -> str:
    return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":"))

def sha256_hex(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

def input_digest(inp: Dict[str, Any]) -> str:
    return "sha256:" + sha256_hex(canonical_json(inp))
Enter fullscreen mode Exit fullscreen mode

4.2 Gate evaluation: Goal Surface → Verdict

Important: the returned ACCEPT/REJECT/DEGRADE is an operational gate, not “statistical winner.”

  • ACCEPT = guardrails/constraints are satisfied; safe to proceed to next stage
  • REJECT = stop/rollback
  • DEGRADE = hold (missing observation/evidence/preconditions)
# gate_eval.py
from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, List, Tuple

Verdict = str  # "ACCEPT" | "REJECT" | "DEGRADE"

@dataclass(frozen=True)
class Decision:
    verdict: Verdict
    reason_codes: Tuple[str, ...]
    missing: Tuple[str, ...]

def _missing_required(policy: Dict[str, Any], metrics: Dict[str, Any]) -> List[str]:
    req = policy.get("missing_data_policy", {}).get("required_metrics", [])
    out: List[str] = []
    for k in req:
        if k not in metrics:
            out.append(k)
    return out

def evaluate(policy: Dict[str, Any], metrics: Dict[str, Any]) -> Decision:
    mdp = policy.get("missing_data_policy", {})

    # Wrapper-friendly normalization:
    # - {"metrics": {...}, "_meta": {...}} OR a flat dict.
    snapshot = metrics if isinstance(metrics, dict) else {}
    if isinstance(snapshot.get("metrics"), dict):
        data = snapshot["metrics"]
        meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}
        # Pull a few optional fields up if present (minimal).
        for k in ("freshness_seconds", "watermark_event_time", "sources"):
            if k in snapshot and k not in meta:
                meta[k] = snapshot[k]
    else:
        data = snapshot
        meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}

    # 0) Missing required metrics → DEGRADE/REJECT per policy
    missing = _missing_required(policy, data)
    if missing:
        on_missing = mdp.get("on_missing", "DEGRADE")
        rc = ("observation_missing:required_metric",)
        if on_missing == "REJECT":
            return Decision("REJECT", rc, tuple(missing))
        return Decision("DEGRADE", rc, tuple(missing))

    # 0.5) Freshness gate (optional)
    max_fresh = mdp.get("max_freshness_seconds")
    if max_fresh is not None:
        fs = meta.get("freshness_seconds")
        if fs is None:
            on_stale = mdp.get("on_stale", "DEGRADE")
            rc = ("observation_missing:freshness_seconds",)
            if on_stale == "REJECT":
                return Decision("REJECT", rc, ("freshness_seconds",))
            return Decision("DEGRADE", rc, ("freshness_seconds",))
        try:
            fs_v = float(fs)
            thr_v = float(max_fresh)
        except (TypeError, ValueError):
            return Decision("DEGRADE", ("observation_invalid:freshness_seconds",), ("freshness_seconds",))
        if fs_v > thr_v:
            on_stale = mdp.get("on_stale", "DEGRADE")
            rc = ("observation_stale:over_freshness_budget",)
            if on_stale == "REJECT":
                return Decision("REJECT", rc, ("freshness_seconds",))
            return Decision("DEGRADE", rc, ("freshness_seconds",))

    # 0.6) Watermark skew gate (optional): sources[*].watermark_event_time
    max_skew = mdp.get("max_watermark_skew_seconds")
    if max_skew is not None:
        def _parse_iso(ts: Any) -> float | None:
            if isinstance(ts, str):
                try:
                    return datetime.fromisoformat(ts).timestamp()
                except ValueError:
                    return None
            return None

        wms: List[float] = []
        sources = meta.get("sources")
        if isinstance(sources, list):
            for s in sources:
                if isinstance(s, dict):
                    t = _parse_iso(s.get("watermark_event_time"))
                    if t is not None:
                        wms.append(t)

        if len(wms) < 2:
            return Decision("DEGRADE", ("observation_missing:watermark_event_time",), ("watermark_event_time",))

        skew = float(max(wms) - min(wms))
        if skew > float(max_skew):
            return Decision("DEGRADE", ("observation_stale:watermark_skew",), ("watermark_event_time",))

    # 1) Constraints (must not violate)
    reasons: List[str] = []
    for c in policy.get("constraints", []):
        kind = c.get("kind")
        metric = c.get("metric")
        if not metric:
            return Decision("REJECT", ("invalid_constraint:missing_metric",), ())
        if metric not in data:
            return Decision("DEGRADE", ("observation_missing:constraint_metric",), (metric,))
        v = data[metric]

        if kind == "must_be_zero":
            try:
                if float(v) != 0.0:
                    reasons.append(f"constraint_violation:{metric}")
            except (TypeError, ValueError):
                return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
        elif kind == "must_be_false":
            if not isinstance(v, bool):
                return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
            if v is True:
                reasons.append(f"constraint_violation:{metric}")
        else:
            return Decision("REJECT", (f"unknown_constraint_kind:{kind}",), ())

    if reasons:
        return Decision("REJECT", tuple(reasons), ())

    # 2) Guardrails (SLO gates)
    for g in policy.get("guardrails", []):
        m = g["metric"]
        op = g["op"]
        thr = g["threshold"]

        if m not in data:
            return Decision("DEGRADE", ("observation_missing:guardrail_metric",), (m,))

        try:
            v = float(data[m])
            t = float(thr)
        except (TypeError, ValueError):
            return Decision("DEGRADE", ("observation_invalid:guardrail_value",), (m,))

        ok = True
        if op == "<=":
            ok = v <= t
        elif op == "<":
            ok = v < t
        elif op == ">=":
            ok = v >= t
        elif op == ">":
            ok = v > t
        else:
            return Decision("REJECT", (f"unknown_guardrail_op:{op}",), ())

        if not ok:
            return Decision("REJECT", (f"guardrail_violation:{m}",), ())

    # 3) Primary “winner” belongs to a separate stats layer.
    return Decision("ACCEPT", (), ())
Enter fullscreen mode Exit fullscreen mode

4.3 Append-only NDJSON logging

# append_log.py
from __future__ import annotations

import json
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import Any, Dict

JST = timezone(timedelta(hours=9))

def now_iso() -> str:
    return datetime.now(JST).isoformat()

def append_ndjson(path: Path, event: Dict[str, Any]) -> None:
    line = json.dumps(event, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        f.write(line + "\n")
Enter fullscreen mode Exit fullscreen mode

4.4 Stage driver (minimal)

# rollout_driver.py
from __future__ import annotations

import json
from pathlib import Path
from typing import Any, Dict

from digest import input_digest
from gate_eval import evaluate
from append_log import append_ndjson, now_iso

def load_json(path: Path) -> Dict[str, Any]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def main() -> int:
    policy = load_json(Path("policy.json"))
    log_path = Path("logs/decision.ndjson")

    experiment_id = "exp-42"
    variant_id = "B"

    stages = ["DRY_RUN", "SHADOW", "CANARY_10", "CANARY_25", "CANARY_50", "ROLLOUT_100"]

    prev_stage = None
    for stage in stages:
        # In production: fetch metrics snapshot from Prometheus/Cloud Monitoring/warehouse.
        metrics = load_json(Path(f"snapshots/{stage}.json"))

        inp = {
            "experiment_id": experiment_id,
            "variant_id": variant_id,
            "stage": stage,
            "policy_id": policy["policy_id"],
            "policy_version": policy["policy_version"],
            "metrics_snapshot": metrics,
        }
        d = evaluate(policy, metrics)

        run_id = f"{experiment_id}:{variant_id}:{policy['policy_version']}"
        event_id = f"{run_id}:{stage}"

        event = {
            "ts": now_iso(),
            "run_id": run_id,
            "event_id": event_id,
            "experiment_id": experiment_id,
            "variant_id": variant_id,
            "policy_id": policy["policy_id"],
            "policy_version": policy["policy_version"],
            "stage": stage,
            "state_transition": {"from": prev_stage, "to": stage},
            "input_digest": input_digest(inp),
            "metrics_snapshot": metrics,
            "decision": {"verdict": d.verdict, "reason_codes": list(d.reason_codes), "missing": list(d.missing)},
            "evidence_refs": {"dashboard_id": "dash-foo", "alert_policy_id": "alert-slo-foo", "change_id": "CHG-123"},
        }
        append_ndjson(log_path, event)

        if d.verdict == "REJECT":
            print(f"[STOP] stage={stage} reasons={d.reason_codes}")
            return 1

        if d.verdict == "DEGRADE":
            print(f"[HOLD] stage={stage} missing={d.missing}")
            return 2

        prev_stage = stage
        print(f"[OK] stage={stage}")

    return 0

if __name__ == "__main__":
    raise SystemExit(main())
Enter fullscreen mode Exit fullscreen mode

4.5 Replay (“what happened?” becomes queryable)

# replay.py
from __future__ import annotations

import json
from pathlib import Path
from typing import Any, Dict, Iterable, List

def read_ndjson(path: Path) -> Iterable[Dict[str, Any]]:
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

def main() -> int:
    path = Path("logs/decision.ndjson")
    events = list(read_ndjson(path))

    rejects = [e for e in events if e.get("decision", {}).get("verdict") == "REJECT"]
    for e in rejects:
        print(json.dumps({
            "ts": e.get("ts"),
            "run_id": e.get("run_id"),
            "event_id": e.get("event_id"),
            "stage": e.get("stage"),
            "reason_codes": e.get("decision", {}).get("reason_codes", []),
            "policy_version": e.get("policy_version"),
        }, ensure_ascii=False))
    return 0

if __name__ == "__main__":
    raise SystemExit(main())
Enter fullscreen mode Exit fullscreen mode

Now postmortems are replayable.


4.6 The missing link from PoC to production (the real constraints)

The minimal code above hides the hardest parts. This section makes them explicit.

4.6.1 Treat metrics_snapshot as evidence (with freshness)

The hard part is aligning multiple sources (Prometheus/logs/warehouse) to the “same window.” In practice, perfect alignment is unrealistic.

So treat the snapshot as evidence with meta:

  • collected_at, window
  • watermark_event_time (event-time “how far data is complete”)
  • per-source watermarks
  • freshness_seconds (how late is this snapshot?)

Then gate freshness in policy:

{
  "missing_data_policy": {
    "required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m"],
    "on_missing": "DEGRADE",
    "max_freshness_seconds": 120,
    "max_watermark_skew_seconds": 60,
    "on_stale": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

Freshness DEGRADE reason codes become SLO targets:

  • observation_stale:over_freshness_budget
  • observation_stale:watermark_skew
  • observation_missing:required_metric

4.6.2 Don’t mix “promotion (winner)” with “safety gate”

The gate evaluator above answers: “Is it safe to proceed?”
It does not answer: “Did B win statistically?”

Separate layers:

  • Safety gate: Guardrails + Constraints (canary progression)
  • Promotion gate: statistics / sample size / sequential testing (100% rollout)

When promotion isn’t ready, DEGRADE (precondition unknown):

  • precondition_unknown:sample_size_not_reached
  • precondition_unknown:assignment_not_stable
  • precondition_unknown:significance_not_ready

Policy can make this explicit:

{
  "promotion_gate": {
    "required_for_stages": ["ROLLOUT_100"],
    "min_sample_size": 100000,
    "require_ci_low_gt": 0.0,
    "on_not_ready": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

4.6.3 Rollback atomicity: automate only reversible changes

Auto-rollback is safest for:

  • feature flags / routing (stateless switches)
  • backward-compatible config

Auto-rollback is dangerous for:

  • schema changes
  • irreversible migrations
  • external side effects (emails, billing)

So classify changes:

  • reversible → allow auto rollout
  • stateful_or_irreversible → dry-run/shadow only; require human approval after DEGRADE
  • unsafe_for_experiment → don’t A/B it

4.6.4 Align by watermark contract, not “perfect 5-min windows”

Don’t chase perfect synchronization across sources. Use an explicit watermark contract:

  • watermark_event_time says “complete up to here”
  • allow only bounded skew; otherwise DEGRADE

This makes “data pipeline delay” a designed state, not a surprise.

4.6.5 Implement the driver as an orchestrator (state machine)

Real rollouts run for hours/days. Don’t use a long-running loop.

Use:

  • scheduler (cron/workflow engine), or
  • event-driven triggers (snapshot arrival, approval granted, alert fired)

Split the “driver” into two layers. This small separation makes the implementation stable.

4.6.5.1 Split into two layers: Orchestrator vs Evaluator

In production, the driver is much more stable if you separate:

(A) Orchestrator (state + re-entry management)
Persist run_id state and manage “when this run can be evaluated again.”

Typical fields:

  • current stage
  • resume_token (pointer needed to resume)
  • last snapshot_id / snapshot_digest
  • last decision (ACCEPT/REJECT/DEGRADE + reasons)
  • next evaluation condition (e.g., fresh snapshot arrival / approval granted / time elapsed)

(B) Evaluator (the deterministic judge)
This is the pure-ish part: evaluate(policy, metrics_snapshot) -> verdict.
Keep it as close to a pure function as possible.

This separation removes the unrealistic assumptions of a “single long-running process”:
crashes, restarts, duplicated execution, and partial progress become normal—and safe.

4.6.5.2 DEGRADE is not “stop”—it’s “stop in a re-enterable way”

To make DEGRADE operable, you must be able to re-evaluate the same stage under the same rules after missing grounds are resolved.

So on DEGRADE, always persist (and log):

  • reason_codes / missing
  • resume_token (opaque is fine; a DB pointer is often enough)
  • requested_actions (machine-readable next steps)

Then the Orchestrator can trigger re-evaluation when:

  • a fresh snapshot arrives
  • an approval is granted
  • a deadline passes (scheduled re-check)

DEGRADE becomes a queue with re-entry—not a meeting.


4.6.6 Make Promotion Gate mandatory for 100% rollout (not only safety gates)

As discussed in 4.6.2, gate_eval answers:

“Is it safe to proceed?”

It does not answer:

“Did the variant actually win (business decision)?”

The side effect of this separation is real:

If you reach ROLLOUT_100 with “safe to proceed” only,
you can end up fully rolling out a variant that is slightly worse (statistical noise, but real business loss).

The minimal practical fix is: requirements differ by stage.

4.6.6.1 Canary uses Safety Gate; 100% requires Promotion Gate

  • CANARY (10/25/50)

    • Required: Safety Gate (Guardrails / Constraints)
    • Statistics: optional (log as signal)
  • ROLLOUT_100

    • Required: Safety Gate (of course)
    • Required: Promotion Gate (winning condition)

Treat Promotion Gate as a separate layer and include “promotion evidence” inside the snapshot, for example:

  • sample_size
  • effect_estimate
  • ci_low / ci_high (confidence interval)
  • p_value (if needed)
  • sequential_test_state (for sequential tests)

If Promotion Gate is not satisfied, fail safe with DEGRADE:

  • precondition_unknown:sample_size_not_reached
  • precondition_unknown:significance_not_ready
  • precondition_unknown:assignment_not_stable

4.6.6.2 Make “100% promotion conditions” explicit in policy

Operationally, the clearest approach is to express promotion conditions directly in policy:

{
  "promotion_gate": {
    "required_for_stages": ["ROLLOUT_100"],
    "min_sample_size": 100000,
    "require_ci_low_gt": 0.0,
    "on_not_ready": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

Now “Safety Gate passed, but we are not winning, so we do not go to 100%” becomes a deterministic decision—not vibes.

Tip:

  • Canary stages stop on safety.
  • 100% promotion stops on winning conditions (Promotion Gate).

This two-layer stop mechanism makes the whole rollout dramatically easier to operate.


5) Closing: three pieces, one operable system

  • Goal Surface prevents single-metric Goodhart.
  • Deterministic logs make incidents replayable.
  • Safe automation makes staged rollout enforceable with REJECT/DEGRADE gates.

Ultimately, turning experiments into operations is not about “smartness.”
It’s about determinism—contracts, evidence, and safe boundaries.

Top comments (0)