kanaria007

Posted on Apr 16

Make A/B Tests Operable: Goal Surface Deterministic Logs Safe Automation (dry-run / shadow / canary / rollout)

#sre #architecture #observability

A/B tests and staged rollouts break in predictable ways:

You optimize a single KPI (CTR/CVR/revenue), Goodhart kicks in, and you quietly destroy SLOs, safety, fairness, or policy compliance.
You can’t replay what happened, so postmortems become “guessing + meetings.”
The automation boundary is vague, and you only notice the damage after 100% rollout.

This post compresses the fix into one minimal pattern:

Evaluation (Goal Surface: multi-objective + constraints)
× Evidence (deterministic logs: replayable)
× Execution (safe automation: dry-run / shadow / canary / rollout)

0) Fix one failure story first

A common real incident:

You ship a new logic (recommendation/search/billing/risk/scoring/routing) via A/B.
A single KPI (e.g., CTR) improves, so you expand the rollout.
Meanwhile error_rate / p95_latency / cost is getting worse.
Logs are thin, so you can’t answer “which input / which branch caused the regression.”
You notice only at 100%, and now you can’t even roll back with confidence.

This is not three independent mistakes.
Evaluation, logging, and automation boundaries are entangled.

1) Evaluation is not a scalar: Goal Surface (multi-objective + constraints)

1.1 What is a Goal Surface?

A Goal Surface is a contract that prevents “winning” by a single number.

Primary (what you want to improve): CVR, churn, search success rate…
Guardrails (SLO gates): error_rate, p95_latency, crash_rate…
Constraints (must not violate): policy violations = 0, PII leaks = 0, forbidden features = 0, (optionally) imbalance / disparate impact monitoring…
Budgets (allowed degradation): p95 +2% max, error_rate +0.1% max…

“Win” means: Primary improves while Guardrails + Constraints hold.

1.2 Minimal Goal Surface format (JSON policy)

Keep evaluation rules outside code and pin them with policy_version.

{
  "policy_id": "exp-eval-policy",
  "policy_version": "2026-02-18",
  "primary": [
    { "metric": "conversion_rate", "direction": "up", "min_lift": 0.002 }
  ],
  "guardrails": [
    { "metric": "error_rate_5m", "op": "<=", "threshold": 0.01, "budget_delta": 0.001 },
    { "metric": "p95_latency_ms_5m", "op": "<=", "threshold": 400, "budget_delta": 20 }
  ],
  "constraints": [
    { "kind": "must_be_zero", "metric": "pii_leak_incidents" },
    { "kind": "must_be_false", "metric": "prohibited_feature_present" }
  ],
  "missing_data_policy": {
    "required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m","pii_leak_incidents","prohibited_feature_present"],
    "on_missing": "DEGRADE"
  }
}

Key move: on_missing = DEGRADE.
If you can’t observe the required evidence, you stop safely instead of “rolling forward on vibes.”

2) Make failures replayable: a minimal deterministic log spec

A/B tests become painful not because they fail, but because you can’t replay what happened.

So logs are not “for humans.” Logs are for replay.

2.1 Minimal fields you should not skip

run_id: join key for the entire rollout/experiment execution (stable across stages)
event_id: unique per event (often run_id + stage)
experiment_id / variant_id
policy_id / policy_version (the contract)
input_digest: fingerprint of the evaluation input (so you can prove “same input” later)
state_transition: from/to (stage changes)
decision: ACCEPT / REJECT / DEGRADE + reason codes
evidence_refs: dashboard IDs, alert policy IDs, tickets, approvals
metrics_snapshot: the gate evidence (at least required metrics)

2.2 NDJSON append-only example

{"ts":"2026-02-18T10:00:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_10","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_10","state_transition":{"from":"SHADOW","to":"CANARY_10"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.132,"error_rate_5m":0.008,"p95_latency_ms_5m":360,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"ACCEPT","reason_codes":[],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}
{"ts":"2026-02-18T10:15:00+09:00","run_id":"exp-42:B:2026-02-18","event_id":"exp-42:B:2026-02-18:CANARY_25","experiment_id":"exp-42","variant_id":"B","policy_id":"exp-eval-policy","policy_version":"2026-02-18","stage":"CANARY_25","state_transition":{"from":"CANARY_10","to":"CANARY_25"},"input_digest":"sha256:...","metrics_snapshot":{"conversion_rate":0.131,"error_rate_5m":0.013,"p95_latency_ms_5m":420,"pii_leak_incidents":0,"prohibited_feature_present":false},"decision":{"verdict":"REJECT","reason_codes":["guardrail_violation:error_rate_5m"],"missing":[]},"evidence_refs":{"dashboard_id":"dash-foo","alert_policy_id":"alert-slo-foo","change_id":"CHG-123"}}

With logs like this, you don’t “discuss.” You query.

3) Safe automation boundary: dry-run / shadow / canary / rollout

Once you have Goal Surface + deterministic logs, execution becomes a stage model.

3.1 Recommended stages

dry-run: show plan only (do not execute)
shadow: run in the background, collect comparisons (no user impact)
canary: small percent (10% → 25% → 50% → 100%)
rollout: staged expansion (auto-stop / auto-rollback on gate failures)

flowchart LR
  DR[Dry-run] --> SH[Shadow]
  SH --> C10[Canary 10%]
  C10 --> C25[Canary 25%]
  C25 --> C50[Canary 50%]
  C50 --> R[Rollout 100%]
  C10 -.gate fail.-> RB[Rollback/Stop]
  C25 -.gate fail.-> RB
  C50 -.gate fail.-> RB
  R -.gate fail.-> RB

3.2 Stop conditions must be deterministic

Observation is missing/stale → DEGRADE (hold)
Guardrail violation → REJECT (stop/rollback)
Constraint violation → REJECT (immediate stop)

This removes “judgment by mood.”

3.3 Turn DEGRADE into an operational metric (SLO)

DEGRADE is not failure. It’s a safe branch.
But if DEGRADE keeps increasing, operations get stuck.

So you manage DEGRADE with SLOs—not feelings.

3.3.1 Reason code taxonomy (aggregate-able granularity)

Avoid free-text. Prefer categories that map to “one action to fix.”

A) observation_missing:* (telemetry missing / delayed)

observation_missing:required_metric
observation_missing:delayed_pipeline
observation_missing:broken_dashboard_ref

B) evidence_missing:* (audit grounds missing)

evidence_missing:change_id
evidence_missing:approval
evidence_missing:ticket_link

C) precondition_unknown:* (preconditions not satisfied)

precondition_unknown:assignment_not_stable
precondition_unknown:sample_size_not_reached
precondition_unknown:variant_mapping_ambiguous

3.3.2 Two SLOs that actually work

DEGRADE rate SLO
Example: DEGRADE / (ACCEPT + REJECT + DEGRADE) <= 1% (rolling 7d)
Time-to-resume SLO
Example: P95 time_to_resume <= 30m for observation issues
Example: P95 time_to_resume <= 24h for approval waits

3.3.3 Make DEGRADE re-enterable (resume-friendly)

A DEGRADE event should include:

reason_codes + missing
resume_token (opaque is fine)
requested_actions (typed “next steps”)

Example:

{
  "ts": "2026-02-18T10:30:00+09:00",
  "run_id": "exp-42:B:2026-02-18",
  "event_id": "exp-42:B:2026-02-18:CANARY_25",
  "stage": "CANARY_25",
  "decision": {
    "verdict": "DEGRADE",
    "reason_codes": ["observation_missing:required_metric"],
    "missing": ["p95_latency_ms_5m"]
  },
  "resume": {
    "resume_token": "opaque:rsn_7f3a9c...",
    "requested_actions": [
      {"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}},
      {"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}}
    ]
  }
}

Now DEGRADE becomes “a queue,” not “a meeting.”

4) Minimal implementation (stdlib-only): gate evaluation + append logs + replay

You can run this pattern without external libraries.
The purpose is to show the operational skeleton.

(Code below uses PEP 604 union types like float | None, so it requires **Python 3.10+.)

4.1 Stable input fingerprint: `input_digest` (sha256)

# digest.py
from __future__ import annotations

import hashlib
import json
from typing import Any, Dict

def canonical_json(obj: Any) -> str:
    return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":"))

def sha256_hex(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

def input_digest(inp: Dict[str, Any]) -> str:
    return "sha256:" + sha256_hex(canonical_json(inp))

4.2 Gate evaluation: Goal Surface → Verdict

Important: the returned ACCEPT/REJECT/DEGRADE is an operational gate, not “statistical winner.”

ACCEPT = guardrails/constraints are satisfied; safe to proceed to next stage
REJECT = stop/rollback
DEGRADE = hold (missing observation/evidence/preconditions)

# gate_eval.py
from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, List, Tuple

Verdict = str  # "ACCEPT" | "REJECT" | "DEGRADE"

@dataclass(frozen=True)
class Decision:
    verdict: Verdict
    reason_codes: Tuple[str, ...]
    missing: Tuple[str, ...]

def _missing_required(policy: Dict[str, Any], metrics: Dict[str, Any]) -> List[str]:
    req = policy.get("missing_data_policy", {}).get("required_metrics", [])
    out: List[str] = []
    for k in req:
        if k not in metrics:
            out.append(k)
    return out

def evaluate(policy: Dict[str, Any], metrics: Dict[str, Any]) -> Decision:
    mdp = policy.get("missing_data_policy", {})

    # Wrapper-friendly normalization:
    # - {"metrics": {...}, "_meta": {...}} OR a flat dict.
    snapshot = metrics if isinstance(metrics, dict) else {}
    if isinstance(snapshot.get("metrics"), dict):
        data = snapshot["metrics"]
        meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}
        # Pull a few optional fields up if present (minimal).
        for k in ("freshness_seconds", "watermark_event_time", "sources"):
            if k in snapshot and k not in meta:
                meta[k] = snapshot[k]
    else:
        data = snapshot
        meta = snapshot.get("_meta") if isinstance(snapshot.get("_meta"), dict) else {}

    # 0) Missing required metrics → DEGRADE/REJECT per policy
    missing = _missing_required(policy, data)
    if missing:
        on_missing = mdp.get("on_missing", "DEGRADE")
        rc = ("observation_missing:required_metric",)
        if on_missing == "REJECT":
            return Decision("REJECT", rc, tuple(missing))
        return Decision("DEGRADE", rc, tuple(missing))

    # 0.5) Freshness gate (optional)
    max_fresh = mdp.get("max_freshness_seconds")
    if max_fresh is not None:
        fs = meta.get("freshness_seconds")
        if fs is None:
            on_stale = mdp.get("on_stale", "DEGRADE")
            rc = ("observation_missing:freshness_seconds",)
            if on_stale == "REJECT":
                return Decision("REJECT", rc, ("freshness_seconds",))
            return Decision("DEGRADE", rc, ("freshness_seconds",))
        try:
            fs_v = float(fs)
            thr_v = float(max_fresh)
        except (TypeError, ValueError):
            return Decision("DEGRADE", ("observation_invalid:freshness_seconds",), ("freshness_seconds",))
        if fs_v > thr_v:
            on_stale = mdp.get("on_stale", "DEGRADE")
            rc = ("observation_stale:over_freshness_budget",)
            if on_stale == "REJECT":
                return Decision("REJECT", rc, ("freshness_seconds",))
            return Decision("DEGRADE", rc, ("freshness_seconds",))

    # 0.6) Watermark skew gate (optional): sources[*].watermark_event_time
    max_skew = mdp.get("max_watermark_skew_seconds")
    if max_skew is not None:
        def _parse_iso(ts: Any) -> float | None:
            if isinstance(ts, str):
                try:
                    return datetime.fromisoformat(ts).timestamp()
                except ValueError:
                    return None
            return None

        wms: List[float] = []
        sources = meta.get("sources")
        if isinstance(sources, list):
            for s in sources:
                if isinstance(s, dict):
                    t = _parse_iso(s.get("watermark_event_time"))
                    if t is not None:
                        wms.append(t)

        if len(wms) < 2:
            return Decision("DEGRADE", ("observation_missing:watermark_event_time",), ("watermark_event_time",))

        skew = float(max(wms) - min(wms))
        if skew > float(max_skew):
            return Decision("DEGRADE", ("observation_stale:watermark_skew",), ("watermark_event_time",))

    # 1) Constraints (must not violate)
    reasons: List[str] = []
    for c in policy.get("constraints", []):
        kind = c.get("kind")
        metric = c.get("metric")
        if not metric:
            return Decision("REJECT", ("invalid_constraint:missing_metric",), ())
        if metric not in data:
            return Decision("DEGRADE", ("observation_missing:constraint_metric",), (metric,))
        v = data[metric]

        if kind == "must_be_zero":
            try:
                if float(v) != 0.0:
                    reasons.append(f"constraint_violation:{metric}")
            except (TypeError, ValueError):
                return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
        elif kind == "must_be_false":
            if not isinstance(v, bool):
                return Decision("DEGRADE", ("observation_invalid:constraint_value",), (metric,))
            if v is True:
                reasons.append(f"constraint_violation:{metric}")
        else:
            return Decision("REJECT", (f"unknown_constraint_kind:{kind}",), ())

    if reasons:
        return Decision("REJECT", tuple(reasons), ())

    # 2) Guardrails (SLO gates)
    for g in policy.get("guardrails", []):
        m = g["metric"]
        op = g["op"]
        thr = g["threshold"]

        if m not in data:
            return Decision("DEGRADE", ("observation_missing:guardrail_metric",), (m,))

        try:
            v = float(data[m])
            t = float(thr)
        except (TypeError, ValueError):
            return Decision("DEGRADE", ("observation_invalid:guardrail_value",), (m,))

        ok = True
        if op == "<=":
            ok = v <= t
        elif op == "<":
            ok = v < t
        elif op == ">=":
            ok = v >= t
        elif op == ">":
            ok = v > t
        else:
            return Decision("REJECT", (f"unknown_guardrail_op:{op}",), ())

        if not ok:
            return Decision("REJECT", (f"guardrail_violation:{m}",), ())

    # 3) Primary “winner” belongs to a separate stats layer.
    return Decision("ACCEPT", (), ())

4.3 Append-only NDJSON logging

# append_log.py
from __future__ import annotations

import json
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import Any, Dict

JST = timezone(timedelta(hours=9))

def now_iso() -> str:
    return datetime.now(JST).isoformat()

def append_ndjson(path: Path, event: Dict[str, Any]) -> None:
    line = json.dumps(event, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        f.write(line + "\n")

4.4 Stage driver (minimal)

# rollout_driver.py
from __future__ import annotations

import json
from pathlib import Path
from typing import Any, Dict

from digest import input_digest
from gate_eval import evaluate
from append_log import append_ndjson, now_iso

def load_json(path: Path) -> Dict[str, Any]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def main() -> int:
    policy = load_json(Path("policy.json"))
    log_path = Path("logs/decision.ndjson")

    experiment_id = "exp-42"
    variant_id = "B"

    stages = ["DRY_RUN", "SHADOW", "CANARY_10", "CANARY_25", "CANARY_50", "ROLLOUT_100"]

    prev_stage = None
    for stage in stages:
        # In production: fetch metrics snapshot from Prometheus/Cloud Monitoring/warehouse.
        metrics = load_json(Path(f"snapshots/{stage}.json"))

        inp = {
            "experiment_id": experiment_id,
            "variant_id": variant_id,
            "stage": stage,
            "policy_id": policy["policy_id"],
            "policy_version": policy["policy_version"],
            "metrics_snapshot": metrics,
        }
        d = evaluate(policy, metrics)

        run_id = f"{experiment_id}:{variant_id}:{policy['policy_version']}"
        event_id = f"{run_id}:{stage}"

        event = {
            "ts": now_iso(),
            "run_id": run_id,
            "event_id": event_id,
            "experiment_id": experiment_id,
            "variant_id": variant_id,
            "policy_id": policy["policy_id"],
            "policy_version": policy["policy_version"],
            "stage": stage,
            "state_transition": {"from": prev_stage, "to": stage},
            "input_digest": input_digest(inp),
            "metrics_snapshot": metrics,
            "decision": {"verdict": d.verdict, "reason_codes": list(d.reason_codes), "missing": list(d.missing)},
            "evidence_refs": {"dashboard_id": "dash-foo", "alert_policy_id": "alert-slo-foo", "change_id": "CHG-123"},
        }
        append_ndjson(log_path, event)

        if d.verdict == "REJECT":
            print(f"[STOP] stage={stage} reasons={d.reason_codes}")
            return 1

        if d.verdict == "DEGRADE":
            print(f"[HOLD] stage={stage} missing={d.missing}")
            return 2

        prev_stage = stage
        print(f"[OK] stage={stage}")

    return 0

if __name__ == "__main__":
    raise SystemExit(main())

4.5 Replay (“what happened?” becomes queryable)

# replay.py
from __future__ import annotations

import json
from pathlib import Path
from typing import Any, Dict, Iterable, List

def read_ndjson(path: Path) -> Iterable[Dict[str, Any]]:
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

def main() -> int:
    path = Path("logs/decision.ndjson")
    events = list(read_ndjson(path))

    rejects = [e for e in events if e.get("decision", {}).get("verdict") == "REJECT"]
    for e in rejects:
        print(json.dumps({
            "ts": e.get("ts"),
            "run_id": e.get("run_id"),
            "event_id": e.get("event_id"),
            "stage": e.get("stage"),
            "reason_codes": e.get("decision", {}).get("reason_codes", []),
            "policy_version": e.get("policy_version"),
        }, ensure_ascii=False))
    return 0

if __name__ == "__main__":
    raise SystemExit(main())

Now postmortems are replayable.

4.6 The missing link from PoC to production (the real constraints)

The minimal code above hides the hardest parts. This section makes them explicit.

4.6.1 Treat `metrics_snapshot` as evidence (with freshness)

The hard part is aligning multiple sources (Prometheus/logs/warehouse) to the “same window.” In practice, perfect alignment is unrealistic.

So treat the snapshot as evidence with meta:

collected_at, window
watermark_event_time (event-time “how far data is complete”)
per-source watermarks
freshness_seconds (how late is this snapshot?)

Then gate freshness in policy:

{
  "missing_data_policy": {
    "required_metrics": ["conversion_rate","error_rate_5m","p95_latency_ms_5m"],
    "on_missing": "DEGRADE",
    "max_freshness_seconds": 120,
    "max_watermark_skew_seconds": 60,
    "on_stale": "DEGRADE"
  }
}

Freshness DEGRADE reason codes become SLO targets:

observation_stale:over_freshness_budget
observation_stale:watermark_skew
observation_missing:required_metric

4.6.2 Don’t mix “promotion (winner)” with “safety gate”

The gate evaluator above answers: “Is it safe to proceed?”
It does not answer: “Did B win statistically?”

Separate layers:

Safety gate: Guardrails + Constraints (canary progression)
Promotion gate: statistics / sample size / sequential testing (100% rollout)

When promotion isn’t ready, DEGRADE (precondition unknown):

precondition_unknown:sample_size_not_reached
precondition_unknown:assignment_not_stable
precondition_unknown:significance_not_ready

Policy can make this explicit:

{
  "promotion_gate": {
    "required_for_stages": ["ROLLOUT_100"],
    "min_sample_size": 100000,
    "require_ci_low_gt": 0.0,
    "on_not_ready": "DEGRADE"
  }
}

4.6.3 Rollback atomicity: automate only reversible changes

Auto-rollback is safest for:

feature flags / routing (stateless switches)
backward-compatible config

Auto-rollback is dangerous for:

schema changes
irreversible migrations
external side effects (emails, billing)

So classify changes:

reversible → allow auto rollout
stateful_or_irreversible → dry-run/shadow only; require human approval after DEGRADE
unsafe_for_experiment → don’t A/B it

4.6.4 Align by watermark contract, not “perfect 5-min windows”

Don’t chase perfect synchronization across sources. Use an explicit watermark contract:

watermark_event_time says “complete up to here”
allow only bounded skew; otherwise DEGRADE

This makes “data pipeline delay” a designed state, not a surprise.

4.6.5 Implement the driver as an orchestrator (state machine)

Real rollouts run for hours/days. Don’t use a long-running loop.

Use:

scheduler (cron/workflow engine), or
event-driven triggers (snapshot arrival, approval granted, alert fired)

Split the “driver” into two layers. This small separation makes the implementation stable.

4.6.5.1 Split into two layers: Orchestrator vs Evaluator

In production, the driver is much more stable if you separate:

(A) Orchestrator (state + re-entry management)
Persist run_id state and manage “when this run can be evaluated again.”

Typical fields:

current stage
resume_token (pointer needed to resume)
last snapshot_id / snapshot_digest
last decision (ACCEPT/REJECT/DEGRADE + reasons)
next evaluation condition (e.g., fresh snapshot arrival / approval granted / time elapsed)

(B) Evaluator (the deterministic judge)
This is the pure-ish part: evaluate(policy, metrics_snapshot) -> verdict.
Keep it as close to a pure function as possible.

This separation removes the unrealistic assumptions of a “single long-running process”:
crashes, restarts, duplicated execution, and partial progress become normal—and safe.

4.6.5.2 DEGRADE is not “stop”—it’s “stop in a re-enterable way”

To make DEGRADE operable, you must be able to re-evaluate the same stage under the same rules after missing grounds are resolved.

So on DEGRADE, always persist (and log):

reason_codes / missing
resume_token (opaque is fine; a DB pointer is often enough)
requested_actions (machine-readable next steps)

Then the Orchestrator can trigger re-evaluation when:

a fresh snapshot arrives
an approval is granted
a deadline passes (scheduled re-check)

DEGRADE becomes a queue with re-entry—not a meeting.

4.6.6 Make Promotion Gate mandatory for 100% rollout (not only safety gates)

As discussed in 4.6.2, gate_eval answers:

“Is it safe to proceed?”

It does not answer:

“Did the variant actually win (business decision)?”

The side effect of this separation is real:

If you reach ROLLOUT_100 with “safe to proceed” only,
you can end up fully rolling out a variant that is slightly worse (statistical noise, but real business loss).

The minimal practical fix is: requirements differ by stage.

4.6.6.1 Canary uses Safety Gate; 100% requires Promotion Gate

CANARY (10/25/50)
- Required: Safety Gate (Guardrails / Constraints)
- Statistics: optional (log as signal)
ROLLOUT_100
- Required: Safety Gate (of course)
- Required: Promotion Gate (winning condition)

Treat Promotion Gate as a separate layer and include “promotion evidence” inside the snapshot, for example:

sample_size
effect_estimate
ci_low / ci_high (confidence interval)
p_value (if needed)
sequential_test_state (for sequential tests)

If Promotion Gate is not satisfied, fail safe with DEGRADE:

precondition_unknown:sample_size_not_reached
precondition_unknown:significance_not_ready
precondition_unknown:assignment_not_stable

4.6.6.2 Make “100% promotion conditions” explicit in policy

Operationally, the clearest approach is to express promotion conditions directly in policy:

{
  "promotion_gate": {
    "required_for_stages": ["ROLLOUT_100"],
    "min_sample_size": 100000,
    "require_ci_low_gt": 0.0,
    "on_not_ready": "DEGRADE"
  }
}

Now “Safety Gate passed, but we are not winning, so we do not go to 100%” becomes a deterministic decision—not vibes.

Tip:

Canary stages stop on safety.
100% promotion stops on winning conditions (Promotion Gate).

This two-layer stop mechanism makes the whole rollout dramatically easier to operate.

5) Closing: three pieces, one operable system

Goal Surface prevents single-metric Goodhart.
Deterministic logs make incidents replayable.
Safe automation makes staged rollout enforceable with REJECT/DEGRADE gates.

Ultimately, turning experiments into operations is not about “smartness.”
It’s about determinism—contracts, evidence, and safe boundaries.

DEV Community

Make A/B Tests Operable: Goal Surface Deterministic Logs Safe Automation (dry-run / shadow / canary / rollout)

0) Fix one failure story first

1) Evaluation is not a scalar: Goal Surface (multi-objective + constraints)

1.1 What is a Goal Surface?

1.2 Minimal Goal Surface format (JSON policy)

2) Make failures replayable: a minimal deterministic log spec

2.1 Minimal fields you should not skip

2.2 NDJSON append-only example

3) Safe automation boundary: dry-run / shadow / canary / rollout

3.1 Recommended stages

3.2 Stop conditions must be deterministic

3.3 Turn DEGRADE into an operational metric (SLO)

3.3.1 Reason code taxonomy (aggregate-able granularity)

3.3.2 Two SLOs that actually work

3.3.3 Make DEGRADE re-enterable (resume-friendly)

4) Minimal implementation (stdlib-only): gate evaluation + append logs + replay

4.1 Stable input fingerprint: `input_digest` (sha256)

4.2 Gate evaluation: Goal Surface → Verdict

4.3 Append-only NDJSON logging

4.4 Stage driver (minimal)

4.5 Replay (“what happened?” becomes queryable)

4.6 The missing link from PoC to production (the real constraints)

4.6.1 Treat `metrics_snapshot` as evidence (with freshness)

4.6.2 Don’t mix “promotion (winner)” with “safety gate”

4.6.3 Rollback atomicity: automate only reversible changes

4.6.4 Align by watermark contract, not “perfect 5-min windows”

4.6.5 Implement the driver as an orchestrator (state machine)

4.6.5.1 Split into two layers: Orchestrator vs Evaluator

4.6.5.2 DEGRADE is not “stop”—it’s “stop in a re-enterable way”

4.6.6 Make Promotion Gate mandatory for 100% rollout (not only safety gates)

4.6.6.1 Canary uses Safety Gate; 100% requires Promotion Gate

4.6.6.2 Make “100% promotion conditions” explicit in policy

5) Closing: three pieces, one operable system

Top comments (0)

0) Fix one failure story first

1) Evaluation is not a scalar: Goal Surface (multi-objective + constraints)

1.1 What is a Goal Surface?

1.2 Minimal Goal Surface format (JSON policy)

2) Make failures replayable: a minimal deterministic log spec

2.1 Minimal fields you should not skip

2.2 NDJSON append-only example

3) Safe automation boundary: dry-run / shadow / canary / rollout

3.1 Recommended stages

3.2 Stop conditions must be deterministic

3.3 Turn DEGRADE into an operational metric (SLO)

3.3.1 Reason code taxonomy (aggregate-able granularity)

3.3.2 Two SLOs that actually work

3.3.3 Make DEGRADE re-enterable (resume-friendly)

4) Minimal implementation (stdlib-only): gate evaluation + append logs + replay

4.1 Stable input fingerprint: input_digest (sha256)

4.2 Gate evaluation: Goal Surface → Verdict

4.3 Append-only NDJSON logging

4.4 Stage driver (minimal)

4.5 Replay (“what happened?” becomes queryable)

4.6 The missing link from PoC to production (the real constraints)

4.6.1 Treat metrics_snapshot as evidence (with freshness)

4.6.2 Don’t mix “promotion (winner)” with “safety gate”

4.6.3 Rollback atomicity: automate only reversible changes

4.6.4 Align by watermark contract, not “perfect 5-min windows”

4.6.5 Implement the driver as an orchestrator (state machine)

4.6.5.1 Split into two layers: Orchestrator vs Evaluator

4.6.5.2 DEGRADE is not “stop”—it’s “stop in a re-enterable way”

4.6.6 Make Promotion Gate mandatory for 100% rollout (not only safety gates)

4.6.6.1 Canary uses Safety Gate; 100% requires Promotion Gate

4.6.6.2 Make “100% promotion conditions” explicit in policy

5) Closing: three pieces, one operable system

4.1 Stable input fingerprint: `input_digest` (sha256)

4.6.1 Treat `metrics_snapshot` as evidence (with freshness)