kanaria007

Posted on Apr 6

Grow the Verifier, Not the Prompt: Run Production with 10 Golden Cases

#ai #testing #sre #llm

The moment you put an AI agent into a real workflow, two realities show up fast:

Models and prompts wobble (updates, infra, tools, inputs)
Most failures are “missing grounds,” not “rule violations.”

In the previous posts, we fixed the division of labor:

LLM generates a proposal (a plan)
Verifier deterministically returns ACCEPT / REJECT / DEGRADE, and may normalize the plan
Executor runs Typed Actions only (dry-run → approval → production)

Now the real question:

What do you “grow” so the system doesn’t collapse in ops?

My answer is simple:

Don’t start by tuning the prompt.
Start by freezing 10 golden cases for the verifier.

0) The premise (re-stated)

LLMs are probabilistic. Output variance is not evil.

What’s evil is executing variance.

Also: LLM-generated “explanations” (including chain-of-thought-like text) are not audit-grade grounds. What you should pin is:

input schema (what counts as admissible grounds)
policy_id / policy_version
deterministic rule-evaluation logs
evidence / trace IDs

1) Why “10 cases” is enough to start

You’re not trying to get coverage.

You’re trying to freeze the handful of patterns that kill you in production:

Boundaries: deadlines, caps, percentage steps, state transitions
Forbidden: privilege, SoD violations, legal holds, prohibited fields
Missing: approvals, identity verification, evidence links, observability
Ordering: no revoke, retry without idempotency, double execution
Exceptions become normal: DEGRADE missing → humans keep rescuing via interpretation

A small number of “representative accidents” eliminates most catastrophic behavior. After that, you grow from incidents and DEGRADE logs.

So: 10 is not a magic number.
It’s “the smallest number that forces an operational skeleton.”

2) What a golden case actually freezes

Do not freeze LLM output.
Freeze verifier output.

A golden case is a contract:

Schema’d input + proposed plan
Expected verifier outputs:
- verdict (ACCEPT / REJECT / DEGRADE)
- reasons (typed reason codes)
- missing (machine-readable missing list, for DEGRADE)
- normalized_plan (the executable, verified Typed Actions)

If the model changes, prompts change, tools change—your ops can still survive as long as the verifier continues to:

stop correctly,
request missing grounds correctly,
normalize into safe executable plans.

3) Minimal case format (portable)

YAML is nicer to read, but if you want a portable “single truth,” JSON is the safest. You can author in YAML and convert later.

3.1 One-case JSON example

{
  "name": "jit_access_missing_security_approval_degrade",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-2026-00077",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "reason_code": "INCIDENT_RESPONSE",
      "incident_id": "INC-88921",
      "ticket_id": "T-2026-004512"
    },
    "approvals": { "manager_approved": true, "security_approved": false },
    "context": { "on_call": true, "break_glass": false },
    "evidence": { "runbook_id": "rbk-prod-db-read" }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_security_approval"],
    "missing": ["approvals.security_approved"],
    "normalized_plan": []
  }
}

Key point: missing is not prose. It’s a machine-readable list (paths). That’s what makes DEGRADE operable.

4) Which 10 cases to pick (template)

A practical starting distribution:

ACCEPT: 3
- minimal happy path
- boundary-but-OK path
- happy path where normalization happens (proposal → normalized plan)
DEGRADE: 4
- missing approval
- missing evidence
- state uncertain
- observability missing (SLO/metric not available)
REJECT: 3
- clear forbidden (privilege/SoD/legal hold)
- window/time violation
- prohibited field or process violation

This set forces you to implement:

stopping safely (DEGRADE),
denying deterministically (REJECT),
producing an executable “true plan” (normalized_plan).

5) Ten starter cases (names + intent)

You can cut/paste these ideas into JSON files and expand your schemas:

JIT Access (3 cases)

1. `jit_access_accept_minimal` (ACCEPT)

{
  "name": "jit_access_accept_minimal",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-1",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "incident_id": "INC-1",
      "ticket_id": "T-1"
    },
    "approvals": { "manager_approved": true, "security_approved": true },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  }
}

2. `jit_access_degrade_missing_security_approval` (DEGRADE)

{
  "name": "jit_access_degrade_missing_security_approval",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-2",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "incident_id": "INC-2",
      "ticket_id": "T-2"
    },
    "approvals": { "manager_approved": true, "security_approved": false },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_security_approval"],
    "missing": ["approvals.security_approved"],
    "normalized_plan": []
  }
}

3. `jit_access_reject_admin_role_without_break_glass` (REJECT)

{
  "name": "jit_access_reject_admin_role_without_break_glass",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-3",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.admin",
      "requested_duration_minutes": 60,
      "incident_id": "INC-3",
      "ticket_id": "T-3"
    },
    "approvals": { "manager_approved": true, "security_approved": true },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.admin",
          "duration_minutes": 60
        }
      }
    ]
  },
  "expect": {
    "verdict": "REJECT",
    "reasons": ["admin_role_requires_break_glass"],
    "missing": [],
    "normalized_plan": []
  }
}

Production Change Management (3 cases)

4. `change_degrade_missing_rollback_plan` (DEGRADE)

{
  "name": "change_degrade_missing_rollback_plan",
  "input": {
    "policy": { "policy_id": "prod-change-policy", "policy_version": "2026-01-10" },
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "to": { "percent": 10 },
      "rollback_plan_id": null
    },
    "guardrails": {
      "canary": { "step_percent": [10, 25, 50, 100] },
      "slo_gates": [{ "metric": "error_rate_5m", "op": "<=", "threshold": 0.01 }]
    },
    "approvals": { "owner_approved": true, "sre_approved": true }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "feature_flag.set_percent",
        "params": { "flag_key": "new_invoice_flow", "percent": 10 }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_rollback_plan"],
    "missing": ["change_request.rollback_plan_id"],
    "normalized_plan": []
  }
}

5. `change_reject_no_canary_steps` (REJECT)

{
  "name": "change_reject_no_canary_steps",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "rollback_plan_id": "rb-1"
    },
    "guardrails": {
      "canary": {"step_percent": [100]},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}]
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {"actions": [{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 100}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["canary_steps_required"],
    "missing": [],
    "normalized_plan": []
  }
}

6. `change_accept_normalize_force_rollback_hook` (ACCEPT + normalize)

{
  "name": "change_accept_normalize_force_rollback_hook",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "risk_level": "MEDIUM",
      "rollback_plan_id": "rb-2026-0091"
    },
    "guardrails": {
      "canary": {"step_percent": [10, 25, 50, 100], "step_wait_minutes": 15},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}],
      "rollback": {"auto_rollback_enabled": true}
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}},
      {"name": "rollback.hook.ensure", "params": {"rollback_plan_id": "rb-2026-0091"}}
    ]
  }
}

Personal Data Erasure (3 cases)

7. `erasure_degrade_identity_not_verified` (DEGRADE)

{
  "name": "erasure_degrade_identity_not_verified",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-1", "identity_verification": {"verified": false}},
    "holds": {"legal_hold": false}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-1"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["identity_verification_required"],
    "missing": ["erasure_request.identity_verification.verified"],
    "normalized_plan": []
  }
}

8. `erasure_reject_legal_hold` (REJECT)

{
  "name": "erasure_reject_legal_hold",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-2", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": true}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-2"}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["legal_hold_blocks_erasure"],
    "missing": [],
    "normalized_plan": []
  }
}

9. `erasure_accept_normalize_retention_to_redact` (ACCEPT + normalize)

{
  "name": "erasure_accept_normalize_retention_to_redact",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-3", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": false, "accounting_retention_required": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "privacy.delete", "params": {"system": "billing", "subject_user_id": "C-3"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "privacy.redact", "params": {"system": "billing", "subject_user_id": "C-3", "mode": "accounting_retention"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  }
}

Underwriting packet (1 case)

10. `uw_degrade_missing_required_documents` (DEGRADE)

(This is explicitly about preventing the LLM from becoming the decision-maker.)

{
  "name": "uw_degrade_missing_employment_proof",
  "input": {
    "policy": {"policy_id": "credit-underwriting-policy", "policy_version": "2026-01-01"},
    "application": {"application_id": "APP-1", "requested_amount_jpy": 500000},
    "documents": {"identity_verified": true, "income_proof": {"provided": true}, "employment_proof": {"provided": false}},
    "fairness_controls": {"prohibited_fields_present": false}
  },
  "proposed_plan": {"actions": [{"name": "uw.emit_decision", "params": {"decision": "APPROVE", "reason": "looks good"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_required_documents"],
    "missing": ["documents.employment_proof.provided"],
    "normalized_plan": [
      {"name": "uw.request_more_documents", "params": {"missing": ["employment_proof"]}}
    ]
  }
}

The point is not “these domains.”
The point is the shape of failures: missing/forbidden/boundary/normalization.

6) A minimal golden harness (stdlib-only)

This is the part that makes everything real: run the cases in CI.

6.1 Directory layout

golden/
  cases/
    01_jit_access_accept.json
    02_jit_access_degrade_missing_approval.json
    ...
    10_uw_degrade_missing_docs.json
  run_golden.py
  verifier_stub.py   # replace with your real verifier

6.2 Harness (run_golden.py)

(Python 3.10+; stdlib only.)

from __future__ import annotations

import json
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Tuple

from verifier_stub import verify  # replace with your verifier


@dataclass(frozen=True)
class Expect:
    verdict: str
    reasons: Tuple[str, ...]
    missing: Tuple[str, ...]
    normalized_plan: Tuple[Dict[str, Any], ...]


def load_case(path: Path) -> Dict[str, Any]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)


def normalize_plan(plan: Any) -> Tuple[Dict[str, Any], ...]:
    if plan is None:
        return ()
    if not isinstance(plan, list):
        raise TypeError(f"normalized_plan must be list[dict], got {type(plan)}")

    out: List[Dict[str, Any]] = []
    for i, a in enumerate(plan):
        if not isinstance(a, dict):
            raise TypeError(f"normalized_plan[{i}] must be dict, got {type(a)}")
        out.append(a)
    return tuple(out)


def to_expect(d: Dict[str, Any]) -> Expect:
    return Expect(
        verdict=str(d.get("verdict")),
        reasons=tuple(d.get("reasons", [])),
        missing=tuple(d.get("missing", [])),
        normalized_plan=normalize_plan(d.get("normalized_plan", [])),
    )


def diff(a: Any, b: Any) -> str:
    ja = json.dumps(a, ensure_ascii=False, sort_keys=True, indent=2)
    jb = json.dumps(b, ensure_ascii=False, sort_keys=True, indent=2)
    return f"--- expected\n{ja}\n--- actual\n{jb}"


def main() -> int:
    cases_dir = Path(__file__).parent / "cases"
    paths = sorted(cases_dir.glob("*.json"))
    if not paths:
        print("No golden cases found.", file=sys.stderr)
        return 2

    failed: List[str] = []

    for p in paths:
        case = load_case(p)
        name = case.get("name", p.name)
        inp = case["input"]
        proposed = case["proposed_plan"]
        exp = to_expect(case["expect"])

        actual = verify(inp, proposed)
        verdict = actual.get("verdict")
        if not isinstance(verdict, str) or not verdict:
            raise KeyError(f"verify() must return non-empty 'verdict' for case={name}")

        act = Expect(
            verdict=verdict,
            reasons=tuple(actual.get("reasons", [])),
            missing=tuple(actual.get("missing", [])),
            normalized_plan=normalize_plan(actual.get("normalized_plan", [])),
        )

        if exp != act:
            failed.append(name)
            print(f"\n[FAIL] {name}")
            print(diff(exp.__dict__, act.__dict__))

    if failed:
        print(f"\nFAILED {len(failed)}/{len(paths)} cases: {', '.join(failed)}", file=sys.stderr)
        return 1

    print(f"OK {len(paths)}/{len(paths)} cases")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

6.3 Verifier “plug-in” shape (verifier_stub.py)

from __future__ import annotations

from typing import Any, Dict


def verify(inp: Dict[str, Any], proposed_plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Replace this with your real verifier.

    Required return shape:
      verdict: "ACCEPT" | "REJECT" | "DEGRADE"
      reasons: [reason_code...]
      missing: [path...]
      normalized_plan: [typed_action...]
    """
    return {
        "verdict": "DEGRADE",
        "reasons": ["stub"],
        "missing": ["replace_with_real_verifier"],
        "normalized_plan": [],
    }

7) Run it in CI (minimal GitHub Actions)

name: golden
on: [push, pull_request]
jobs:
  golden:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: python golden/run_golden.py

Now PRs will fail if:

you accidentally changed DEGRADE to ACCEPT,
normalization disappeared,
reason codes drifted without a deliberate update,
missing lists broke.

That’s how “agent ops” becomes engineering—not vibes.

8) How to grow the verifier (the operational loop)

Golden cases are not write-once artifacts. Grow them like this:

Aggregate DEGRADE logs (top missing paths / top reason codes)
If a missing pattern repeats, add one golden case
When a golden case breaks:

bug (unintended drift) → fix verifier
policy change (intended) → update case and record the change reason

Only after this skeleton exists does “prompt improvement” become meaningful.
Otherwise, prompt tuning becomes untestable, vibe-driven optimization.

9) Common traps

Trying to freeze LLM output Freeze verifier outputs instead. LLM proposals can wobble.
Trying to operate with REJECT only Real ops fails on missing grounds. Without DEGRADE, humans rescue everything forever.
Reason codes as free text If you can’t aggregate it, you can’t SLO it.
No normalization If the verifier can’t emit the executable “true plan,” ops gets pulled by whatever the LLM proposed.

Summary

Grow the verifier, not the prompt.
Freeze verifier outputs: verdict / reasons / missing / normalized_plan
Start with 10 golden cases that represent the ways you die in ops
Run them in CI
Grow from DEGRADE logs + incidents, not from vibes

If you treat an LLM as a proposer, you can even ask it to suggest new golden cases.
But first, build the harness and lock the initial ten.

Top comments (6)

Lark Angel • Apr 10

You are assuming the LLM read function (intake) is trustworthy.

It is not. After repeated failures to access a file clearly, explicitly stored in memory, Claude finally admitted commands are merely "suggestions".

So you need to extend your deterministic logic to input as well. But even that doesn't really help because you are still processing LLM.

That is, relying on it to actually perform some task: read, write, analyze, report.

Recent papers have shown medical scanners weren't actually analyzing images, or math functions were simply using pattern matching, are just a few examples suggesting any LLM action is trustworthy.

If we simply wrap the entire proposition in a deterministic shackle, then what is the point? Where are the efficiencies?

And if there are none, and the labs know this, then what they are really selling is a token maxing bait & switch.

Lark Angel • Apr 10

Even read is a problem. I tell claude to bring up a file. It shows the wrong one. I ask how come? It's right there in memory.md or database.

Once again it replies it violated its own rule that explicitly states load the mem before replying.

I ask how to fix the error and it replies there is no fix! Claude says ultimately it's a judgment call to follow or not; at heart they are merely suggestions, guidelines.

Ok ...

kanaria007 • Apr 12

You’re right about one important thing: intake/read itself is not something to trust blindly.

But that is exactly why I separate proposal, parse, gate, and effect.

My claim is not “LLM read/write/analyze is trustworthy.”
My claim is the opposite:

wrapper output is only a proposal
parse failure must be preserved as a first-class status
effect must remain under separate runtime authority
and what should be frozen is verifier output, not LLM output

So if a model brings the wrong file, misreads memory, or emits a malformed action, that is not a reason to give the model more authority.
It is a reason to ensure the system can only move from proposal to effect through explicit parse/gate/evidence checks.

In that sense, “even read is a problem” is not a counterexample to the verifier idea.
It is one of the reasons the verifier idea is needed in the first place.

Lark Angel • Apr 13 • Edited

We are both in agreement: as llm mechanisms are probabilistic, we are held hostage to their subjective decisions. that is no way to run a business. crafting elaborate complicated harnesses seems to be the (short term) answer, but they themselve become so large and unwieldy that you spend your time tweeking the harness not the actual project.

kanaria007 • Apr 14

I think we’re still talking past each other.
You’re criticizing prompt theater, not the architecture I described.

What I’m describing is not a giant natural-language harness wrapped around the model.

It is a much smaller boundary discipline:

the model proposes
parse/gate checks whether the proposal is admissible
only typed, verified effects are allowed to execute

So the organizational rule is not “micro-manage every LLM behavior.”
It is more like:
what must be observed, what approvals are required, what is missing, what must degrade, and what must never commit.

That is not unnecessary complexity.
It is simply making effect boundaries explicit instead of leaving them to model judgment.

kanaria007 • Apr 14

At that point I think the issue is not disagreement but a failure to observe the actual structure of the post.

The post explicitly says:

the LLM proposes
the verifier returns ACCEPT / REJECT / DEGRADE
and golden cases freeze verifier output, not LLM output

So if someone still reads that as “trust the model and wrap it in a giant prompt harness,” they are not really arguing with the architecture I described.
They are arguing with a different one in their head.

And if an organization cannot make those effect boundaries explicit for a given class of decisions, that is not primarily an AI problem.

It means the organization itself cannot reliably determine what is admissible, what is forbidden, what is missing, who has authority, and what must never commit.

At that point, the real problem is not model behavior.
It is that the organization lacks a functioning judgment surface for that domain.

That is a pre-system problem.