The moment you put an AI agent into a real workflow, two realities show up fast:
- Models and prompts wobble (updates, infra, tools, inputs)
- Most failures are “missing grounds,” not “rule violations.”
In the previous posts, we fixed the division of labor:
- LLM generates a proposal (a plan)
- Verifier deterministically returns ACCEPT / REJECT / DEGRADE, and may normalize the plan
- Executor runs Typed Actions only (dry-run → approval → production)
Now the real question:
What do you “grow” so the system doesn’t collapse in ops?
My answer is simple:
Don’t start by tuning the prompt.
Start by freezing 10 golden cases for the verifier.
0) The premise (re-stated)
LLMs are probabilistic. Output variance is not evil.
What’s evil is executing variance.
Also: LLM-generated “explanations” (including chain-of-thought-like text) are not audit-grade grounds. What you should pin is:
- input schema (what counts as admissible grounds)
-
policy_id/policy_version - deterministic rule-evaluation logs
- evidence / trace IDs
1) Why “10 cases” is enough to start
You’re not trying to get coverage.
You’re trying to freeze the handful of patterns that kill you in production:
- Boundaries: deadlines, caps, percentage steps, state transitions
- Forbidden: privilege, SoD violations, legal holds, prohibited fields
- Missing: approvals, identity verification, evidence links, observability
- Ordering: no revoke, retry without idempotency, double execution
- Exceptions become normal: DEGRADE missing → humans keep rescuing via interpretation
A small number of “representative accidents” eliminates most catastrophic behavior. After that, you grow from incidents and DEGRADE logs.
So: 10 is not a magic number.
It’s “the smallest number that forces an operational skeleton.”
2) What a golden case actually freezes
Do not freeze LLM output.
Freeze verifier output.
A golden case is a contract:
- Schema’d input + proposed plan
-
Expected verifier outputs:
-
verdict(ACCEPT / REJECT / DEGRADE) -
reasons(typed reason codes) -
missing(machine-readable missing list, for DEGRADE) -
normalized_plan(the executable, verified Typed Actions)
-
If the model changes, prompts change, tools change—your ops can still survive as long as the verifier continues to:
- stop correctly,
- request missing grounds correctly,
- normalize into safe executable plans.
3) Minimal case format (portable)
YAML is nicer to read, but if you want a portable “single truth,” JSON is the safest. You can author in YAML and convert later.
3.1 One-case JSON example
{
"name": "jit_access_missing_security_approval_degrade",
"input": {
"policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
"access_request": {
"request_id": "AR-2026-00077",
"requester_user_id": "u-1234",
"target_resource": "prod-db:billing",
"requested_role": "db.readonly",
"requested_duration_minutes": 60,
"reason_code": "INCIDENT_RESPONSE",
"incident_id": "INC-88921",
"ticket_id": "T-2026-004512"
},
"approvals": { "manager_approved": true, "security_approved": false },
"context": { "on_call": true, "break_glass": false },
"evidence": { "runbook_id": "rbk-prod-db-read" }
},
"proposed_plan": {
"actions": [
{
"name": "iam.grant_temporary_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly",
"duration_minutes": 60
}
},
{
"name": "iam.revoke_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly"
}
}
]
},
"expect": {
"verdict": "DEGRADE",
"reasons": ["missing_security_approval"],
"missing": ["approvals.security_approved"],
"normalized_plan": []
}
}
Key point: missing is not prose. It’s a machine-readable list (paths). That’s what makes DEGRADE operable.
4) Which 10 cases to pick (template)
A practical starting distribution:
-
ACCEPT: 3
- minimal happy path
- boundary-but-OK path
- happy path where normalization happens (proposal → normalized plan)
-
DEGRADE: 4
- missing approval
- missing evidence
- state uncertain
- observability missing (SLO/metric not available)
-
REJECT: 3
- clear forbidden (privilege/SoD/legal hold)
- window/time violation
- prohibited field or process violation
This set forces you to implement:
- stopping safely (DEGRADE),
- denying deterministically (REJECT),
- producing an executable “true plan” (normalized_plan).
5) Ten starter cases (names + intent)
You can cut/paste these ideas into JSON files and expand your schemas:
JIT Access (3 cases)
1. jit_access_accept_minimal (ACCEPT)
{
"name": "jit_access_accept_minimal",
"input": {
"policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
"access_request": {
"request_id": "AR-1",
"requester_user_id": "u-1234",
"target_resource": "prod-db:billing",
"requested_role": "db.readonly",
"requested_duration_minutes": 60,
"incident_id": "INC-1",
"ticket_id": "T-1"
},
"approvals": { "manager_approved": true, "security_approved": true },
"context": { "break_glass": false }
},
"proposed_plan": {
"actions": [
{
"name": "iam.grant_temporary_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly",
"duration_minutes": 60
}
},
{
"name": "iam.revoke_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly"
}
}
]
},
"expect": {
"verdict": "ACCEPT",
"reasons": [],
"missing": [],
"normalized_plan": [
{
"name": "iam.grant_temporary_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly",
"duration_minutes": 60
}
},
{
"name": "iam.revoke_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly"
}
}
]
}
}
2. jit_access_degrade_missing_security_approval (DEGRADE)
{
"name": "jit_access_degrade_missing_security_approval",
"input": {
"policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
"access_request": {
"request_id": "AR-2",
"requester_user_id": "u-1234",
"target_resource": "prod-db:billing",
"requested_role": "db.readonly",
"requested_duration_minutes": 60,
"incident_id": "INC-2",
"ticket_id": "T-2"
},
"approvals": { "manager_approved": true, "security_approved": false },
"context": { "break_glass": false }
},
"proposed_plan": {
"actions": [
{
"name": "iam.grant_temporary_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.readonly",
"duration_minutes": 60
}
}
]
},
"expect": {
"verdict": "DEGRADE",
"reasons": ["missing_security_approval"],
"missing": ["approvals.security_approved"],
"normalized_plan": []
}
}
3. jit_access_reject_admin_role_without_break_glass (REJECT)
{
"name": "jit_access_reject_admin_role_without_break_glass",
"input": {
"policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
"access_request": {
"request_id": "AR-3",
"requester_user_id": "u-1234",
"target_resource": "prod-db:billing",
"requested_role": "db.admin",
"requested_duration_minutes": 60,
"incident_id": "INC-3",
"ticket_id": "T-3"
},
"approvals": { "manager_approved": true, "security_approved": true },
"context": { "break_glass": false }
},
"proposed_plan": {
"actions": [
{
"name": "iam.grant_temporary_role",
"params": {
"user_id": "u-1234",
"resource": "prod-db:billing",
"role": "db.admin",
"duration_minutes": 60
}
}
]
},
"expect": {
"verdict": "REJECT",
"reasons": ["admin_role_requires_break_glass"],
"missing": [],
"normalized_plan": []
}
}
Production Change Management (3 cases)
4. change_degrade_missing_rollback_plan (DEGRADE)
{
"name": "change_degrade_missing_rollback_plan",
"input": {
"policy": { "policy_id": "prod-change-policy", "policy_version": "2026-01-10" },
"change_request": {
"change_type": "feature_flag_rollout",
"flag_key": "new_invoice_flow",
"to": { "percent": 10 },
"rollback_plan_id": null
},
"guardrails": {
"canary": { "step_percent": [10, 25, 50, 100] },
"slo_gates": [{ "metric": "error_rate_5m", "op": "<=", "threshold": 0.01 }]
},
"approvals": { "owner_approved": true, "sre_approved": true }
},
"proposed_plan": {
"actions": [
{
"name": "feature_flag.set_percent",
"params": { "flag_key": "new_invoice_flow", "percent": 10 }
}
]
},
"expect": {
"verdict": "DEGRADE",
"reasons": ["missing_rollback_plan"],
"missing": ["change_request.rollback_plan_id"],
"normalized_plan": []
}
}
5. change_reject_no_canary_steps (REJECT)
{
"name": "change_reject_no_canary_steps",
"input": {
"policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
"change_request": {
"change_type": "feature_flag_rollout",
"flag_key": "new_invoice_flow",
"rollback_plan_id": "rb-1"
},
"guardrails": {
"canary": {"step_percent": [100]},
"slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}]
},
"approvals": {"owner_approved": true, "sre_approved": true}
},
"proposed_plan": {"actions": [{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 100}}]},
"expect": {
"verdict": "REJECT",
"reasons": ["canary_steps_required"],
"missing": [],
"normalized_plan": []
}
}
6. change_accept_normalize_force_rollback_hook (ACCEPT + normalize)
{
"name": "change_accept_normalize_force_rollback_hook",
"input": {
"policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
"change_request": {
"change_type": "feature_flag_rollout",
"flag_key": "new_invoice_flow",
"risk_level": "MEDIUM",
"rollback_plan_id": "rb-2026-0091"
},
"guardrails": {
"canary": {"step_percent": [10, 25, 50, 100], "step_wait_minutes": 15},
"slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}],
"rollback": {"auto_rollback_enabled": true}
},
"approvals": {"owner_approved": true, "sre_approved": true}
},
"proposed_plan": {
"actions": [
{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
{"name": "slo_gate.check", "params": {"window_minutes": 15}}
]
},
"expect": {
"verdict": "ACCEPT",
"reasons": [],
"missing": [],
"normalized_plan": [
{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
{"name": "slo_gate.check", "params": {"window_minutes": 15}},
{"name": "rollback.hook.ensure", "params": {"rollback_plan_id": "rb-2026-0091"}}
]
}
}
Personal Data Erasure (3 cases)
7. erasure_degrade_identity_not_verified (DEGRADE)
{
"name": "erasure_degrade_identity_not_verified",
"input": {
"policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
"erasure_request": {"subject_user_id": "C-1", "identity_verification": {"verified": false}},
"holds": {"legal_hold": false}
},
"proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-1"}}]},
"expect": {
"verdict": "DEGRADE",
"reasons": ["identity_verification_required"],
"missing": ["erasure_request.identity_verification.verified"],
"normalized_plan": []
}
}
8. erasure_reject_legal_hold (REJECT)
{
"name": "erasure_reject_legal_hold",
"input": {
"policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
"erasure_request": {"subject_user_id": "C-2", "identity_verification": {"verified": true}},
"holds": {"legal_hold": true}
},
"proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-2"}}]},
"expect": {
"verdict": "REJECT",
"reasons": ["legal_hold_blocks_erasure"],
"missing": [],
"normalized_plan": []
}
}
9. erasure_accept_normalize_retention_to_redact (ACCEPT + normalize)
{
"name": "erasure_accept_normalize_retention_to_redact",
"input": {
"policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
"erasure_request": {"subject_user_id": "C-3", "identity_verification": {"verified": true}},
"holds": {"legal_hold": false, "accounting_retention_required": true}
},
"proposed_plan": {
"actions": [
{"name": "privacy.delete", "params": {"system": "billing", "subject_user_id": "C-3"}},
{"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
]
},
"expect": {
"verdict": "ACCEPT",
"reasons": [],
"missing": [],
"normalized_plan": [
{"name": "privacy.redact", "params": {"system": "billing", "subject_user_id": "C-3", "mode": "accounting_retention"}},
{"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
]
}
}
Underwriting packet (1 case)
10. uw_degrade_missing_required_documents (DEGRADE)
(This is explicitly about preventing the LLM from becoming the decision-maker.)
{
"name": "uw_degrade_missing_employment_proof",
"input": {
"policy": {"policy_id": "credit-underwriting-policy", "policy_version": "2026-01-01"},
"application": {"application_id": "APP-1", "requested_amount_jpy": 500000},
"documents": {"identity_verified": true, "income_proof": {"provided": true}, "employment_proof": {"provided": false}},
"fairness_controls": {"prohibited_fields_present": false}
},
"proposed_plan": {"actions": [{"name": "uw.emit_decision", "params": {"decision": "APPROVE", "reason": "looks good"}}]},
"expect": {
"verdict": "DEGRADE",
"reasons": ["missing_required_documents"],
"missing": ["documents.employment_proof.provided"],
"normalized_plan": [
{"name": "uw.request_more_documents", "params": {"missing": ["employment_proof"]}}
]
}
}
The point is not “these domains.”
The point is the shape of failures: missing/forbidden/boundary/normalization.
6) A minimal golden harness (stdlib-only)
This is the part that makes everything real: run the cases in CI.
6.1 Directory layout
golden/
cases/
01_jit_access_accept.json
02_jit_access_degrade_missing_approval.json
...
10_uw_degrade_missing_docs.json
run_golden.py
verifier_stub.py # replace with your real verifier
6.2 Harness (run_golden.py)
(Python 3.10+; stdlib only.)
from __future__ import annotations
import json
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Tuple
from verifier_stub import verify # replace with your verifier
@dataclass(frozen=True)
class Expect:
verdict: str
reasons: Tuple[str, ...]
missing: Tuple[str, ...]
normalized_plan: Tuple[Dict[str, Any], ...]
def load_case(path: Path) -> Dict[str, Any]:
with path.open("r", encoding="utf-8") as f:
return json.load(f)
def normalize_plan(plan: Any) -> Tuple[Dict[str, Any], ...]:
if plan is None:
return ()
if not isinstance(plan, list):
raise TypeError(f"normalized_plan must be list[dict], got {type(plan)}")
out: List[Dict[str, Any]] = []
for i, a in enumerate(plan):
if not isinstance(a, dict):
raise TypeError(f"normalized_plan[{i}] must be dict, got {type(a)}")
out.append(a)
return tuple(out)
def to_expect(d: Dict[str, Any]) -> Expect:
return Expect(
verdict=str(d.get("verdict")),
reasons=tuple(d.get("reasons", [])),
missing=tuple(d.get("missing", [])),
normalized_plan=normalize_plan(d.get("normalized_plan", [])),
)
def diff(a: Any, b: Any) -> str:
ja = json.dumps(a, ensure_ascii=False, sort_keys=True, indent=2)
jb = json.dumps(b, ensure_ascii=False, sort_keys=True, indent=2)
return f"--- expected\n{ja}\n--- actual\n{jb}"
def main() -> int:
cases_dir = Path(__file__).parent / "cases"
paths = sorted(cases_dir.glob("*.json"))
if not paths:
print("No golden cases found.", file=sys.stderr)
return 2
failed: List[str] = []
for p in paths:
case = load_case(p)
name = case.get("name", p.name)
inp = case["input"]
proposed = case["proposed_plan"]
exp = to_expect(case["expect"])
actual = verify(inp, proposed)
verdict = actual.get("verdict")
if not isinstance(verdict, str) or not verdict:
raise KeyError(f"verify() must return non-empty 'verdict' for case={name}")
act = Expect(
verdict=verdict,
reasons=tuple(actual.get("reasons", [])),
missing=tuple(actual.get("missing", [])),
normalized_plan=normalize_plan(actual.get("normalized_plan", [])),
)
if exp != act:
failed.append(name)
print(f"\n[FAIL] {name}")
print(diff(exp.__dict__, act.__dict__))
if failed:
print(f"\nFAILED {len(failed)}/{len(paths)} cases: {', '.join(failed)}", file=sys.stderr)
return 1
print(f"OK {len(paths)}/{len(paths)} cases")
return 0
if __name__ == "__main__":
raise SystemExit(main())
6.3 Verifier “plug-in” shape (verifier_stub.py)
from __future__ import annotations
from typing import Any, Dict
def verify(inp: Dict[str, Any], proposed_plan: Dict[str, Any]) -> Dict[str, Any]:
"""
Replace this with your real verifier.
Required return shape:
verdict: "ACCEPT" | "REJECT" | "DEGRADE"
reasons: [reason_code...]
missing: [path...]
normalized_plan: [typed_action...]
"""
return {
"verdict": "DEGRADE",
"reasons": ["stub"],
"missing": ["replace_with_real_verifier"],
"normalized_plan": [],
}
7) Run it in CI (minimal GitHub Actions)
name: golden
on: [push, pull_request]
jobs:
golden:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: python golden/run_golden.py
Now PRs will fail if:
- you accidentally changed
DEGRADEtoACCEPT, - normalization disappeared,
- reason codes drifted without a deliberate update,
- missing lists broke.
That’s how “agent ops” becomes engineering—not vibes.
8) How to grow the verifier (the operational loop)
Golden cases are not write-once artifacts. Grow them like this:
- Aggregate DEGRADE logs (top missing paths / top reason codes)
- If a missing pattern repeats, add one golden case
- When a golden case breaks:
- bug (unintended drift) → fix verifier
- policy change (intended) → update case and record the change reason
Only after this skeleton exists does “prompt improvement” become meaningful.
Otherwise, prompt tuning becomes untestable, vibe-driven optimization.
9) Common traps
- Trying to freeze LLM output Freeze verifier outputs instead. LLM proposals can wobble.
- Trying to operate with REJECT only Real ops fails on missing grounds. Without DEGRADE, humans rescue everything forever.
- Reason codes as free text If you can’t aggregate it, you can’t SLO it.
- No normalization If the verifier can’t emit the executable “true plan,” ops gets pulled by whatever the LLM proposed.
Summary
- Grow the verifier, not the prompt.
- Freeze verifier outputs:
verdict / reasons / missing / normalized_plan - Start with 10 golden cases that represent the ways you die in ops
- Run them in CI
- Grow from DEGRADE logs + incidents, not from vibes
If you treat an LLM as a proposer, you can even ask it to suggest new golden cases.
But first, build the harness and lock the initial ten.
Top comments (0)