DEV Community

kanaria007
kanaria007

Posted on

Grow the Verifier, Not the Prompt: Run Production with 10 Golden Cases

The moment you put an AI agent into a real workflow, two realities show up fast:

  1. Models and prompts wobble (updates, infra, tools, inputs)
  2. Most failures are “missing grounds,” not “rule violations.”

In the previous posts, we fixed the division of labor:

  • LLM generates a proposal (a plan)
  • Verifier deterministically returns ACCEPT / REJECT / DEGRADE, and may normalize the plan
  • Executor runs Typed Actions only (dry-run → approval → production)

Now the real question:

What do you “grow” so the system doesn’t collapse in ops?

My answer is simple:

Don’t start by tuning the prompt.
Start by freezing 10 golden cases for the verifier.


0) The premise (re-stated)

LLMs are probabilistic. Output variance is not evil.

What’s evil is executing variance.

Also: LLM-generated “explanations” (including chain-of-thought-like text) are not audit-grade grounds. What you should pin is:

  • input schema (what counts as admissible grounds)
  • policy_id / policy_version
  • deterministic rule-evaluation logs
  • evidence / trace IDs

1) Why “10 cases” is enough to start

You’re not trying to get coverage.

You’re trying to freeze the handful of patterns that kill you in production:

  • Boundaries: deadlines, caps, percentage steps, state transitions
  • Forbidden: privilege, SoD violations, legal holds, prohibited fields
  • Missing: approvals, identity verification, evidence links, observability
  • Ordering: no revoke, retry without idempotency, double execution
  • Exceptions become normal: DEGRADE missing → humans keep rescuing via interpretation

A small number of “representative accidents” eliminates most catastrophic behavior. After that, you grow from incidents and DEGRADE logs.

So: 10 is not a magic number.
It’s “the smallest number that forces an operational skeleton.”


2) What a golden case actually freezes

Do not freeze LLM output.
Freeze verifier output.

A golden case is a contract:

  • Schema’d input + proposed plan
  • Expected verifier outputs:

    • verdict (ACCEPT / REJECT / DEGRADE)
    • reasons (typed reason codes)
    • missing (machine-readable missing list, for DEGRADE)
    • normalized_plan (the executable, verified Typed Actions)

If the model changes, prompts change, tools change—your ops can still survive as long as the verifier continues to:

  • stop correctly,
  • request missing grounds correctly,
  • normalize into safe executable plans.

3) Minimal case format (portable)

YAML is nicer to read, but if you want a portable “single truth,” JSON is the safest. You can author in YAML and convert later.

3.1 One-case JSON example

{
  "name": "jit_access_missing_security_approval_degrade",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-2026-00077",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "reason_code": "INCIDENT_RESPONSE",
      "incident_id": "INC-88921",
      "ticket_id": "T-2026-004512"
    },
    "approvals": { "manager_approved": true, "security_approved": false },
    "context": { "on_call": true, "break_glass": false },
    "evidence": { "runbook_id": "rbk-prod-db-read" }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_security_approval"],
    "missing": ["approvals.security_approved"],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

Key point: missing is not prose. It’s a machine-readable list (paths). That’s what makes DEGRADE operable.


4) Which 10 cases to pick (template)

A practical starting distribution:

  • ACCEPT: 3

    • minimal happy path
    • boundary-but-OK path
    • happy path where normalization happens (proposal → normalized plan)
  • DEGRADE: 4

    • missing approval
    • missing evidence
    • state uncertain
    • observability missing (SLO/metric not available)
  • REJECT: 3

    • clear forbidden (privilege/SoD/legal hold)
    • window/time violation
    • prohibited field or process violation

This set forces you to implement:

  • stopping safely (DEGRADE),
  • denying deterministically (REJECT),
  • producing an executable “true plan” (normalized_plan).

5) Ten starter cases (names + intent)

You can cut/paste these ideas into JSON files and expand your schemas:

JIT Access (3 cases)

1. jit_access_accept_minimal (ACCEPT)

{
  "name": "jit_access_accept_minimal",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-1",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "incident_id": "INC-1",
      "ticket_id": "T-1"
    },
    "approvals": { "manager_approved": true, "security_approved": true },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

2. jit_access_degrade_missing_security_approval (DEGRADE)

{
  "name": "jit_access_degrade_missing_security_approval",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-2",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "incident_id": "INC-2",
      "ticket_id": "T-2"
    },
    "approvals": { "manager_approved": true, "security_approved": false },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_security_approval"],
    "missing": ["approvals.security_approved"],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

3. jit_access_reject_admin_role_without_break_glass (REJECT)

{
  "name": "jit_access_reject_admin_role_without_break_glass",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-3",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.admin",
      "requested_duration_minutes": 60,
      "incident_id": "INC-3",
      "ticket_id": "T-3"
    },
    "approvals": { "manager_approved": true, "security_approved": true },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.admin",
          "duration_minutes": 60
        }
      }
    ]
  },
  "expect": {
    "verdict": "REJECT",
    "reasons": ["admin_role_requires_break_glass"],
    "missing": [],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

Production Change Management (3 cases)

4. change_degrade_missing_rollback_plan (DEGRADE)

{
  "name": "change_degrade_missing_rollback_plan",
  "input": {
    "policy": { "policy_id": "prod-change-policy", "policy_version": "2026-01-10" },
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "to": { "percent": 10 },
      "rollback_plan_id": null
    },
    "guardrails": {
      "canary": { "step_percent": [10, 25, 50, 100] },
      "slo_gates": [{ "metric": "error_rate_5m", "op": "<=", "threshold": 0.01 }]
    },
    "approvals": { "owner_approved": true, "sre_approved": true }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "feature_flag.set_percent",
        "params": { "flag_key": "new_invoice_flow", "percent": 10 }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_rollback_plan"],
    "missing": ["change_request.rollback_plan_id"],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

5. change_reject_no_canary_steps (REJECT)

{
  "name": "change_reject_no_canary_steps",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "rollback_plan_id": "rb-1"
    },
    "guardrails": {
      "canary": {"step_percent": [100]},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}]
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {"actions": [{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 100}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["canary_steps_required"],
    "missing": [],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

6. change_accept_normalize_force_rollback_hook (ACCEPT + normalize)

{
  "name": "change_accept_normalize_force_rollback_hook",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "risk_level": "MEDIUM",
      "rollback_plan_id": "rb-2026-0091"
    },
    "guardrails": {
      "canary": {"step_percent": [10, 25, 50, 100], "step_wait_minutes": 15},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}],
      "rollback": {"auto_rollback_enabled": true}
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}},
      {"name": "rollback.hook.ensure", "params": {"rollback_plan_id": "rb-2026-0091"}}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Personal Data Erasure (3 cases)

7. erasure_degrade_identity_not_verified (DEGRADE)

{
  "name": "erasure_degrade_identity_not_verified",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-1", "identity_verification": {"verified": false}},
    "holds": {"legal_hold": false}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-1"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["identity_verification_required"],
    "missing": ["erasure_request.identity_verification.verified"],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

8. erasure_reject_legal_hold (REJECT)

{
  "name": "erasure_reject_legal_hold",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-2", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": true}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-2"}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["legal_hold_blocks_erasure"],
    "missing": [],
    "normalized_plan": []
  }
}
Enter fullscreen mode Exit fullscreen mode

9. erasure_accept_normalize_retention_to_redact (ACCEPT + normalize)

{
  "name": "erasure_accept_normalize_retention_to_redact",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-3", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": false, "accounting_retention_required": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "privacy.delete", "params": {"system": "billing", "subject_user_id": "C-3"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "privacy.redact", "params": {"system": "billing", "subject_user_id": "C-3", "mode": "accounting_retention"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Underwriting packet (1 case)

10. uw_degrade_missing_required_documents (DEGRADE)

(This is explicitly about preventing the LLM from becoming the decision-maker.)

{
  "name": "uw_degrade_missing_employment_proof",
  "input": {
    "policy": {"policy_id": "credit-underwriting-policy", "policy_version": "2026-01-01"},
    "application": {"application_id": "APP-1", "requested_amount_jpy": 500000},
    "documents": {"identity_verified": true, "income_proof": {"provided": true}, "employment_proof": {"provided": false}},
    "fairness_controls": {"prohibited_fields_present": false}
  },
  "proposed_plan": {"actions": [{"name": "uw.emit_decision", "params": {"decision": "APPROVE", "reason": "looks good"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_required_documents"],
    "missing": ["documents.employment_proof.provided"],
    "normalized_plan": [
      {"name": "uw.request_more_documents", "params": {"missing": ["employment_proof"]}}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The point is not “these domains.”
The point is the shape of failures: missing/forbidden/boundary/normalization.


6) A minimal golden harness (stdlib-only)

This is the part that makes everything real: run the cases in CI.

6.1 Directory layout

golden/
  cases/
    01_jit_access_accept.json
    02_jit_access_degrade_missing_approval.json
    ...
    10_uw_degrade_missing_docs.json
  run_golden.py
  verifier_stub.py   # replace with your real verifier
Enter fullscreen mode Exit fullscreen mode

6.2 Harness (run_golden.py)

(Python 3.10+; stdlib only.)

from __future__ import annotations

import json
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Tuple

from verifier_stub import verify  # replace with your verifier


@dataclass(frozen=True)
class Expect:
    verdict: str
    reasons: Tuple[str, ...]
    missing: Tuple[str, ...]
    normalized_plan: Tuple[Dict[str, Any], ...]


def load_case(path: Path) -> Dict[str, Any]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)


def normalize_plan(plan: Any) -> Tuple[Dict[str, Any], ...]:
    if plan is None:
        return ()
    if not isinstance(plan, list):
        raise TypeError(f"normalized_plan must be list[dict], got {type(plan)}")

    out: List[Dict[str, Any]] = []
    for i, a in enumerate(plan):
        if not isinstance(a, dict):
            raise TypeError(f"normalized_plan[{i}] must be dict, got {type(a)}")
        out.append(a)
    return tuple(out)


def to_expect(d: Dict[str, Any]) -> Expect:
    return Expect(
        verdict=str(d.get("verdict")),
        reasons=tuple(d.get("reasons", [])),
        missing=tuple(d.get("missing", [])),
        normalized_plan=normalize_plan(d.get("normalized_plan", [])),
    )


def diff(a: Any, b: Any) -> str:
    ja = json.dumps(a, ensure_ascii=False, sort_keys=True, indent=2)
    jb = json.dumps(b, ensure_ascii=False, sort_keys=True, indent=2)
    return f"--- expected\n{ja}\n--- actual\n{jb}"


def main() -> int:
    cases_dir = Path(__file__).parent / "cases"
    paths = sorted(cases_dir.glob("*.json"))
    if not paths:
        print("No golden cases found.", file=sys.stderr)
        return 2

    failed: List[str] = []

    for p in paths:
        case = load_case(p)
        name = case.get("name", p.name)
        inp = case["input"]
        proposed = case["proposed_plan"]
        exp = to_expect(case["expect"])

        actual = verify(inp, proposed)
        verdict = actual.get("verdict")
        if not isinstance(verdict, str) or not verdict:
            raise KeyError(f"verify() must return non-empty 'verdict' for case={name}")

        act = Expect(
            verdict=verdict,
            reasons=tuple(actual.get("reasons", [])),
            missing=tuple(actual.get("missing", [])),
            normalized_plan=normalize_plan(actual.get("normalized_plan", [])),
        )

        if exp != act:
            failed.append(name)
            print(f"\n[FAIL] {name}")
            print(diff(exp.__dict__, act.__dict__))

    if failed:
        print(f"\nFAILED {len(failed)}/{len(paths)} cases: {', '.join(failed)}", file=sys.stderr)
        return 1

    print(f"OK {len(paths)}/{len(paths)} cases")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())
Enter fullscreen mode Exit fullscreen mode

6.3 Verifier “plug-in” shape (verifier_stub.py)

from __future__ import annotations

from typing import Any, Dict


def verify(inp: Dict[str, Any], proposed_plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Replace this with your real verifier.

    Required return shape:
      verdict: "ACCEPT" | "REJECT" | "DEGRADE"
      reasons: [reason_code...]
      missing: [path...]
      normalized_plan: [typed_action...]
    """
    return {
        "verdict": "DEGRADE",
        "reasons": ["stub"],
        "missing": ["replace_with_real_verifier"],
        "normalized_plan": [],
    }
Enter fullscreen mode Exit fullscreen mode

7) Run it in CI (minimal GitHub Actions)

name: golden
on: [push, pull_request]
jobs:
  golden:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: python golden/run_golden.py
Enter fullscreen mode Exit fullscreen mode

Now PRs will fail if:

  • you accidentally changed DEGRADE to ACCEPT,
  • normalization disappeared,
  • reason codes drifted without a deliberate update,
  • missing lists broke.

That’s how “agent ops” becomes engineering—not vibes.


8) How to grow the verifier (the operational loop)

Golden cases are not write-once artifacts. Grow them like this:

  1. Aggregate DEGRADE logs (top missing paths / top reason codes)
  2. If a missing pattern repeats, add one golden case
  3. When a golden case breaks:
  • bug (unintended drift) → fix verifier
  • policy change (intended) → update case and record the change reason

Only after this skeleton exists does “prompt improvement” become meaningful.
Otherwise, prompt tuning becomes untestable, vibe-driven optimization.


9) Common traps

  • Trying to freeze LLM output Freeze verifier outputs instead. LLM proposals can wobble.
  • Trying to operate with REJECT only Real ops fails on missing grounds. Without DEGRADE, humans rescue everything forever.
  • Reason codes as free text If you can’t aggregate it, you can’t SLO it.
  • No normalization If the verifier can’t emit the executable “true plan,” ops gets pulled by whatever the LLM proposed.

Summary

  • Grow the verifier, not the prompt.
  • Freeze verifier outputs: verdict / reasons / missing / normalized_plan
  • Start with 10 golden cases that represent the ways you die in ops
  • Run them in CI
  • Grow from DEGRADE logs + incidents, not from vibes

If you treat an LLM as a proposer, you can even ask it to suggest new golden cases.
But first, build the harness and lock the initial ten.

Top comments (0)