DEV Community

kanaria007
kanaria007

Posted on

Don’t “Execute” the LLM: Typed Actions + Verifiers for Safe Business Agents

More AI “agents” now look like they work in real systems.

But what actually makes them work is not just model capability—it’s a deterministic verifier + operations that decides what’s allowed to run.

In a previous post I used a refund example.
In this one I’ll intentionally pick scarier scenarios—ones that make senior engineers’ blood run cold—and show a minimal design pattern:

Propose (probability) → Verify (determinism) → Execute (authority + audit)
…so you never “execute the LLM.”

This article is a minimal pattern. It’s not a complete product spec.


0) Why this design exists (the premise)

LLMs are probabilistic. Output variance itself isn’t the problem.
The real problem is executing wobbly output directly.

So split the roles:

  • LLM: propose a plan
  • Verifier: deterministically accept/reject (and optionally normalize the plan)
  • Executor: runs only verified Typed Actions (dry-run → approval → production)

The closer you get to “execute free text,” the more accidents you’ll have.
The more you close execution behind types, the more operable the system becomes.

Also: the LLM’s “explanation-like text” (including chain-of-thought style reasoning) is not audit-grade grounds.
What you should pin as grounds are: input schema, policy ID/version, rule-evaluation logs, evidence/trace IDs.
Explanations are at best UI helper text.


1) The minimal architecture that removes most accidents

1) Structure the input (Schema)

Fix “what counts as grounds.”
Free text becomes optional notes; decision inputs are typed.

2) Make output Typed Actions

The LLM outputs an action list (a plan), not prose.

3) Verifier returns ACCEPT / REJECT / DEGRADE

  • ACCEPT: verified OK → return normalized plan
  • REJECT: rule violation → return reason codes → LLM can re-propose
  • DEGRADE: insufficient grounds → branch to “request missing info” (LLMs are good at this)

4) Execute only a verified plan

The executor accepts nothing but typed actions.
That’s the guardrail core.

flowchart LR
  I[Schema Input] --> LLM[LLM: propose plan]
  LLM --> P[Plan: Typed Actions]
  P --> V[Deterministic Verifier]
  V -->|ACCEPT| E["Execute (dry-run/approve/prod)"]
  V -->|REJECT| R[Return reason codes]
  V -->|DEGRADE| D[Request missing evidence]
Enter fullscreen mode Exit fullscreen mode

2) Thought experiment 1: Just-In-Time production access (JIT Access)

“Need to read the prod DB during an incident.”
“Need to temporarily change a prod setting.”

If you let an LLM decide this, responsibility boundaries collapse—and everyone’s stomach drops.

So split roles:

  • LLM: summarize the situation, list missing info, propose a plan, draft request messages
  • Compiler-equivalent: deterministically decide eligibility / least privilege / expiration / approval / SoD
  • Executor: runs only approved typed actions (dry-run → approval → grant → auto-revoke)

1) Input schema (grounds)

access_request:
  request_id: "AR-2026-00077"
  requester_user_id: "u-1234"
  target_resource: "prod-db:billing"
  requested_role: "db.readonly"
  requested_duration_minutes: 60
  reason_code: "INCIDENT_RESPONSE"
  incident_id: "INC-88921"
  ticket_id: "T-2026-004512"
  created_at: "2026-02-15T22:10:00+09:00"

approvals:
  manager_approved: true
  security_approved: false      # if required, false => DEGRADE/REJECT
  approver_user_ids: ["u-9001"]

context:
  requester_team: "SRE"
  requester_level: "L2"
  on_call: true
  break_glass: false

policy:
  policy_id: "iam-jit-access"
  policy_version: "2026-01-20"

evidence:
  runbook_id: "rbk-prod-db-read"
  logs_link: "log://..."
Enter fullscreen mode Exit fullscreen mode

2) Typed Actions (execution closed by type)

  • iam.grant_temporary_role
  • iam.revoke_role
  • ticket.comment
  • audit.append
  • access.request_more_info (for DEGRADE)

3) Deterministic rules (examples)

  • Required grounds: missing incident_id / ticket_id / policy_versionDEGRADE
  • Approvals: manager approval required; security approval required for certain roles → missing → DEGRADE
  • Two-person approval: unless break_glass, fewer than 2 approvers → DEGRADE
  • Least privilege: db.admin forbidden; only allowed in break_glassREJECT
  • Expiration: max 120 minutes; unspecified → DEGRADE
  • SoD (segregation of duties): SRE must not hold “refund/payment-change” privileges on billing prod DB → REJECT
  • Auto-revoke required: if you can’t schedule revoke at grant time → REJECT

4) LLM outputs a plan proposal only

{
  "actions": [
    {
      "name": "iam.grant_temporary_role",
      "params": {
        "user_id": "u-1234",
        "resource": "prod-db:billing",
        "role": "db.readonly",
        "duration_minutes": 60
      }
    },
    {
      "name": "ticket.comment",
      "params": {
        "ticket_id": "T-2026-004512",
        "message": "JIT access requested for incident INC-88921. Duration 60m. Role db.readonly."
      }
    },
    {
      "name": "iam.revoke_role",
      "params": {
        "user_id": "u-1234",
        "resource": "prod-db:billing",
        "role": "db.readonly"
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The verifier returns ACCEPT/REJECT/DEGRADE, and only the normalized plan is passed to execution
(e.g., normalize resource names, substitute role aliases, clip duration, force-add revoke).


3) Thought experiment 2: production change management (Change Management)

Feature-flag rollouts and config changes can look small but have massive blast radius.

Letting an LLM decide “is it okay to change prod?” is terrifying.
So again:

  • LLM: summarize, scope impact, propose execution plan, draft announcements
  • Compiler-equivalent: deterministically enforce approvals, canary steps, SLO gates, rollback readiness, time windows
  • Executor: runs only verified plan (dry-run → approval → staged rollout → monitoring → auto-rollback if needed)

1) Input schema (grounds fixed)

change_request:
  change_id: "CHG-2026-00112"
  requester_user_id: "u-1234"
  service: "billing-api"
  environment: "prod"
  change_type: "feature_flag_rollout"   # or "config_change"
  flag_key: "new_invoice_flow"
  from: { enabled: false, percent: 0 }
  to:   { enabled: true,  percent: 10 }
  window:
    start_at: "2026-02-16T01:00:00+09:00"
    end_at:   "2026-02-16T03:00:00+09:00"
  risk_level: "MEDIUM"
  rollback_plan_id: "rb-2026-0091"
  runbook_id: "rbk-billing-rollout"
  ticket_id: "INC-90121"               # incident link or change ticket
  created_at: "2026-02-15T22:30:00+09:00"

approvals:
  owner_approved: true
  sre_approved: false
  security_approved: false
  approver_user_ids: ["u-9001"]

guardrails:
  canary:
    step_percent: [10, 25, 50, 100]
    step_wait_minutes: 15
  slo_gates:
    - metric: "error_rate_5m"
      op: "<="
      threshold: 0.01
    - metric: "p95_latency_ms_5m"
      op: "<="
      threshold: 400
  rollback:
    auto_rollback_enabled: true
    rollback_on:
      - gate_violation
      - manual_trigger

observability:
  dashboard_id: "dash-billing-api"
  alert_policy_id: "alert-billing-slo"

policy:
  policy_id: "prod-change-policy"
  policy_version: "2026-01-10"
Enter fullscreen mode Exit fullscreen mode

2) Typed Actions

  • change.plan.publish (dry-run preview)
  • feature_flag.set_percent
  • config.apply
  • slo_gate.check
  • rollback.execute
  • audit.append (store gate results + thresholds + timestamps)
  • change.request_more_info (DEGRADE)
  • change.block (REJECT)

3) Deterministic rules (examples)

  • Owner approval required: owner_approved=falseDEGRADE (request approval)
  • Prod window constraint: outside allowed window → REJECT (re-propose timing)
  • Staged rollout required: only 100% as a single step → REJECT
  • SLO gates required: slo_gates empty → REJECT
  • Rollback required: missing rollback_plan_idDEGRADE
  • Observability required: missing dashboard_id / alert_policy_idDEGRADE
  • High risk requires extra approval: risk_level=HIGH and sre_approved=falseDEGRADE

4) LLM outputs a staged plan proposal only

{
  "actions": [
    {"name": "change.plan.publish", "params": {"change_id": "CHG-2026-00112"}},
    {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
    {"name": "slo_gate.check", "params": {"window_minutes": 15}},
    {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 25}},
    {"name": "slo_gate.check", "params": {"window_minutes": 15}}
  ]
}
Enter fullscreen mode Exit fullscreen mode

The verifier can stop on missing grounds (DEGRADE) or normalize the plan
(e.g., force-add rollback.execute, clip percent limits).


4) Thought experiment 3: personal data erasure (GDPR-like requests)

Erasure is not “just delete stuff.” The core is:

  • eligibility / exceptions
  • legal holds
  • evidence
  • irreversible deletion mechanics

You never want an LLM to decide deletion eligibility. So:

  • LLM: enumerate systems, propose steps, draft user-facing messages
  • Compiler-equivalent: deterministically enforce eligibility, verification, holds, retention rules, evidence bundling
  • Executor: runs only verified plan (dry-run → execute → bundle evidence)

1) Input schema (grounds)

erasure_request:
  request_id: "DEL-2026-0042"
  subject_user_id: "C-9182"
  subject_email: "user@example.com"
  received_at: "2026-02-15T19:05:00+09:00"
  request_channel: "web_form"
  identity_verification:
    method: "email_otp"
    verified: true
    verified_at: "2026-02-15T19:10:00+09:00"

holds:
  legal_hold: false
  fraud_investigation_hold: false
  accounting_retention_required: true  # retention required by accounting / law

scope:
  systems:
    - "crm"
    - "billing"
    - "support_tickets"
    - "analytics"
  data_categories:
    - "profile"
    - "support_messages"
    - "payment_metadata"              # may be non-deletable due to retention

policy:
  policy_id: "privacy-erasure-policy"
  policy_version: "2026-01-05"

audit:
  ticket_id: "PRIV-2026-0018"
  operator_user_id: "u-privacy-01"
Enter fullscreen mode Exit fullscreen mode

2) Typed Actions

  • privacy.assess.deletion_eligibility
  • privacy.redact (mask/anonymize instead of delete)
  • privacy.delete
  • privacy.tombstone.write (tombstone to prevent restoration/double-processing)
  • legal.hold.apply / legal.hold.check
  • audit.append
  • subject.notify
  • privacy.request_more_info (DEGRADE)

3) Deterministic rules (examples)

  • Identity verification required: verified=falseDEGRADE
  • Legal hold: legal_hold=trueREJECT (deletion forbidden)
  • Retention-required categories: auto-convert delete → redaction (normalization)
  • Tombstone caution: do not store PII in tombstones; use irreversible hash/pseudo IDs
  • Audit fields required: missing policy_version / ticket_idDEGRADE

4) LLM proposes a plan (delete + redact)

{
  "actions": [
    {"name": "privacy.assess.deletion_eligibility", "params": {"subject_user_id": "C-9182"}},
    {"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-9182"}},
    {"name": "privacy.redact", "params": {"system": "billing", "subject_user_id": "C-9182", "fields": ["name", "email"]}},
    {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-9182", "request_id": "DEL-2026-0042"}},
    {"name": "audit.append", "params": {"ticket_id": "PRIV-2026-0018"}}
  ]
}
Enter fullscreen mode Exit fullscreen mode

What matters is not the LLM’s prose, but having auditable records of:

  • which policy/version
  • what was deleted vs retained
  • what tombstone was created

5) Thought experiment 4: credit underwriting (loans, etc.)

Underwriting is the clearest “do not let an LLM decide” domain.

This section is not advocating automated approval/denial.
It’s about designing a process so the LLM cannot become the decision maker.

You have institutional constraints:

  • adverse action reasons (explainability obligations)
  • discrimination risk (protected attributes / proxies / disparate impact)
  • regulation, audits, model governance

So split even harder:

  • LLM: structure application data, detect missing documents, draft instructions, organize the “underwriting packet”
  • Compiler-equivalent: decisions only via deterministic policy (rulebook / scorecards / audit logic)
  • Executor: if missing → DEGRADE; decisions emitted with audit-grade reason codes

1) Input schema (underwriting packet)

application:
  application_id: "APP-2026-00991"
  applicant_id: "A-5512"
  product: "personal_loan"
  requested_amount_jpy: 500000
  term_months: 24
  submitted_at: "2026-02-15T18:00:00+09:00"

documents:
  identity_verified: true
  income_proof:
    provided: true
    doc_id: "DOC-INC-7712"
  employment_proof:
    provided: false
  bank_statements:
    provided: true
    doc_id: "DOC-BANK-1142"

features:
  monthly_income_jpy: 320000
  employment_years: null
  existing_debt_jpy: 800000
  delinquency_flags:
    last_12m: 0

policy:
  policy_id: "credit-underwriting-policy"
  policy_version: "2026-01-01"

fairness_controls:
  prohibited_fields_present: false
  proxy_feature_flags:
    - "zip_code"         # potential proxy; usage must be deterministically governed
  audit_required: true

audit:
  case_id: "UW-2026-00440"
  operator_user_id: "u-uw-01"
Enter fullscreen mode Exit fullscreen mode

Key point: do not let protected attributes (or inferred attributes) flow into the LLM decision path.
If they appear, block them at ingestion and make it auditable.

2) Typed Actions

  • uw.request_more_documents (DEGRADE)
  • uw.compute_scorecard (deterministic)
  • uw.apply_policy_rules (deterministic thresholds/exceptions)
  • uw.emit_decision (ACCEPT/REJECT/DEGRADE + reason codes)
  • audit.append
  • applicant.notify (LLM may draft, but reasons must come from reason codes)

3) Deterministic rules (examples)

  • Missing documents: employment_proof missing → DEGRADE
  • Prohibited/proxy misuse: prohibited/proxy features mixed into decision inputs → REJECT (process violation)
  • Decision requires reason codes: free-text-only reasons → REJECT
  • Explainability: denial must attach adverse action reason codes (deterministic)
  • Audit required: if audit_required=true, inability to produce an audit bundle (inputs/rules/outputs) → REJECT

4) LLM only does “missing detection” + guidance (no decision)

Example proposal (routes to DEGRADE):

{
  "actions": [
    {"name": "uw.request_more_documents", "params": {"application_id": "APP-2026-00991", "missing": ["employment_proof"]}},
    {"name": "audit.append", "params": {"case_id": "UW-2026-00440", "note": "Employment proof missing; request additional documents."}}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Approval/denial must come only from deterministic uw.compute_scorecard + uw.apply_policy_rules.


6) In real operations, DEGRADE matters more than REJECT

Across these examples, the most common failure mode is not rule violations—it’s missing grounds:

  • missing documents
  • missing data
  • uncertain state
  • missing policy prerequisites

So what you want the LLM to do is not “decide eligibility,” but:

  • identify missing information
  • draft evidence requests
  • propose next procedural steps

And what you want the verifier to do is:

detect missing grounds and stop safely (re-enterably).

If you can’t design DEGRADE first, the agent won’t be operable.


7) Common failure patterns (avoid these)

Avoiding these alone reduces accidents drastically:

  • executing free text (sending emails / creating tickets / updating DB rows)
  • fully delegating tool-call parameters to the LLM
  • treating unverified output as “success” (no durable logs)
  • only having REJECT paths, so humans must rescue everything via interpretation

Summary

  • LLMs are proposers. What businesses need is a deterministic verifier.
  • Close execution behind Typed Actions, and run only verified plans.
  • DEGRADE (deferral) design is as important as REJECT.
  • Even a minimal split—Propose → Verify → Execute—turns an “agent” into a reliable operational component.

Yes, I picked intentionally scary examples. But we’re also seeing more products trying to put AI at the core.

If you actually want probabilistic AI to touch the core of business operations, you’ll need “hard” domain design—responsibility boundaries, policies, and a serious rule-based domain compiler.

That’s a difficult and sometimes messy path—but also the kind of engineering that can be genuinely interesting.

Still… personally, I don’t want current LLMs anywhere near credit underwriting decisions.

Top comments (0)