kanaria007

Posted on Feb 14 • Originally published at zenn.dev

The Real Reason AI Agents “Work” in Software

#ai #llm #agents #sre

Agents don’t work. Verifiers do.
LLMs propose; deterministic systems decide what’s allowed to run.
Code agents succeed because software already has compilers, tests, linters, and CI—the “domain validator” for free.

(This is a personal analysis. I’m not trying to criticize any specific company or product.)

We’re seeing more examples of AI agents that “run well” in real work settings. But in most success stories, the secret isn’t a smarter model—it’s the surrounding guardrails.

In this article, guardrails means:

Not “trust the LLM output,” but a deterministic validator layer (and an operating process) that accepts/rejects proposed actions and makes execution auditable.

If you want agents to work outside software—legal, accounting, healthcare, ops, customer support—you need the same idea:

LLM + domain validator (policy engine / deterministic gate / “domain compiler equivalent”)

1) Why code-generation agents seem to work

Software is unusually friendly to agents because verifiability is built into the environment:

A compiler tells you immediately if you’re wrong.
Tests give you hard pass/fail feedback.
Types, linters, and static analysis reject dangerous shapes.
CI automates checks so humans don’t have to manually inspect everything.
Diff reviews create a clean responsibility boundary at the end.

So even if the LLM output is merely “plausible,” deterministic systems keep pushing it toward something usable.

The model doesn’t have to be perfect. The environment is.

2) An LLM is a probabilistic proposer (not a correctness engine)

LLMs output a probability distribution. Outputs wobble because of sampling, tool calls, state, evolving prompts, model updates, and the world changing underneath you.

Even with deterministic decoding, the operational question is not:

“Will the model return the same text?”

It’s:

Can we decide—reproducibly—whether a proposed action is correct and safe, with auditable grounds?

A healthier mental model:

LLM: generates proposals (plans, actions, explanations)
Validator layer: deterministically enforces rules, invariants, and safety
Executor: runs only approved, typed actions (preferably after dry-run)

3) Why agents break as soon as you leave software

Many business domains don’t have anything like a compile error.

Legal: correctness depends on interpretation, assumptions, and risk tolerance.
Healthcare: exceptions are common; the cost of errors is huge.
Accounting: rules may be clear, but evidence and classification drift.
Customer support: policies, PR risk, and PII constraints dominate.

When an LLM outputs something “reasonable,” you often can’t quickly determine:

whether it’s wrong,
where it’s wrong,
which policy/evidence/state transition it violates.

So “review” turns into interpretation and responsibility negotiation—where cost and risk become hard to compute. That’s the moment agents stop being automation and become scaled guesswork.

4) What a “domain compiler equivalent” actually is

I don’t mean a literal compiler. I mean:

A deterministic layer that turns domain correctness and safety into checks.

A minimal “domain validator” usually has five parts:

Structured inputs (schema)
Reduce free text. Decide what counts as admissible grounds.
Typed actions (not free-text execution)
Define allowed actions and parameter shapes.
Preconditions / postconditions
Enforce state transitions and invariants.
Safe execution boundary (dry-run / sandbox / least privilege)
Separate proposal from execution.
Verification + audit trail
Store reason codes, policy versions, evidence references, and replayable logs.

If you can’t do (2)–(4), you don’t have an agent—you have autocomplete with privileges.

5) A concrete example: refunds in customer support (proposal → verify → execute)

Refund handling is a classic task people want to “agentify.” The safe split looks like this:

LLM: summarize, draft, propose a plan
Validator: deterministically check eligibility, amount, policy, evidence
Executor: run only approved actions (dry-run → approval → production)

5.1 Pin inputs into a schema (“what counts as grounds?”)

ticket:
  ticket_id: "T-2026-000123"
  customer_id: "C-9182"
  order_id: "O-551923"
  requested_refund_amount_jpy: 3980
  reason_code: "DAMAGED_ON_ARRIVAL"
  received_at: "2026-02-09T10:12:00+09:00"

order:
  status: "DELIVERED"
  delivered_at: "2026-02-03T14:20:00+09:00"
  paid_amount_jpy: 3980
  already_refunded_amount_jpy: 0
  chargeback_flag: false

policy:
  policy_id: "cs-refund-policy"
  policy_version: "2026-01-15"

evidence:
  attachments:
    - kind: "photo"
      content_id: "E-IMG-7812"

5.2 Turn operations into typed actions

actions:
  - name: "refund.create"
    params:
      order_id: "string"
      amount_jpy: "int"
      reason_code: "string"
      policy_version: "string"
      ticket_id: "string"

  - name: "email.send"
    params:
      to_customer_id: "string"
      template_id: "string"
      variables: "object"
      ticket_id: "string"

5.3 Deterministic validation (tiny core)

Return a verdict like:

ACCEPT: safe to proceed (often after dry-run)
REJECT: violates a rule (return reason codes)
DEGRADE: missing evidence/state uncertainty → request more info

(Code below uses PEP 604 union types like str | None, so it requires **Python 3.10+.)
Here’s a minimal Python “validator core” (stdlib only):

from dataclasses import dataclass
from enum import Enum
from datetime import datetime

class Verdict(Enum):
    ACCEPT = "ACCEPT"
    REJECT = "REJECT"
    DEGRADE = "DEGRADE"

@dataclass(frozen=True)
class Result:
    verdict: Verdict
    reasons: tuple[str, ...] = ()

MAX_DAYS_FROM_DELIVERY = 14

def validate_refund(*, delivered_at: str | None, received_at: str,
                    remaining_jpy: int, amount_jpy: int,
                    chargeback_flag: bool, has_evidence: bool) -> Result:
    if chargeback_flag:
        return Result(Verdict.REJECT, ("chargeback_flagged",))
    if remaining_jpy <= 0:
        return Result(Verdict.REJECT, ("already_fully_refunded",))
    if delivered_at is None:
        return Result(Verdict.DEGRADE, ("delivery_date_unknown",))
    if not has_evidence:
        return Result(Verdict.DEGRADE, ("missing_evidence",))
    if amount_jpy <= 0:
        return Result(Verdict.REJECT, ("invalid_amount",))
    if amount_jpy > remaining_jpy:
        return Result(Verdict.REJECT, ("amount_exceeds_remaining",))

    d = datetime.fromisoformat(delivered_at)
    r = datetime.fromisoformat(received_at)
    if (r - d).days > MAX_DAYS_FROM_DELIVERY:
        return Result(Verdict.REJECT, ("refund_window_expired",))
    if r < d:
        return Result(Verdict.DEGRADE, ("timestamp_contradiction",))

    return Result(Verdict.ACCEPT)

Key operational point:
When the validator returns DEGRADE, the agent’s job is not “decide anyway.” It’s to ask for missing material and stop. That’s how you avoid automation-by-guessing.

Full example (expand): typed actions + normalization + reason codes

(This is a longer, more complete version with action parsing/normalization and structured reject/degrade reasons.)

from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from enum import Enum, unique
from typing import Any, Sequence


@unique
class VerdictLevel(Enum):
    ACCEPT = "ACCEPT"
    REJECT = "REJECT"
    DEGRADE = "DEGRADE"


@unique
class RejectReason(Enum):
    UNKNOWN_ACTION = "unknown_action"
    MALFORMED_ACTION_PARAMS = "malformed_action_params"
    CHARGEBACK_FLAGGED = "chargeback_flagged"
    ALREADY_FULLY_REFUNDED = "already_fully_refunded"
    REFUND_WINDOW_EXPIRED = "refund_window_expired"
    INVALID_REFUND_AMOUNT = "invalid_refund_amount"
    AMOUNT_EXCEEDS_REMAINING = "amount_exceeds_remaining"


@unique
class DegradeReason(Enum):
    MISSING_EVIDENCE = "missing_evidence"
    DELIVERY_DATE_UNKNOWN = "delivery_date_unknown"
    TIMESTAMP_CONTRADICTION = "timestamp_contradiction"


@dataclass(frozen=True)
class Ticket:
    ticket_id: str
    customer_id: str
    order_id: str
    received_at: str


@dataclass(frozen=True)
class Order:
    delivered_at: str | None
    remaining_refundable_jpy: int
    chargeback_flag: bool


@dataclass(frozen=True)
class Policy:
    policy_version: str


@dataclass(frozen=True)
class Evidence:
    has_attachments: bool


@dataclass(frozen=True)
class RefundRequest:
    ticket: Ticket
    order: Order
    policy: Policy
    evidence: Evidence


@dataclass(frozen=True)
class RefundCreateAction:
    amount_jpy: int


@dataclass(frozen=True)
class EmailSendAction:
    template_id: str
    variables: dict[str, Any]


TypedAction = RefundCreateAction | EmailSendAction


@dataclass(frozen=True)
class Verdict:
    level: VerdictLevel
    reject_reasons: tuple[RejectReason, ...] = ()
    degrade_reasons: tuple[DegradeReason, ...] = ()
    normalized_plan: tuple[TypedAction, ...] = ()


MAX_DAYS_FROM_DELIVERY = 14


def _parse_iso8601(raw: str) -> datetime:
    return datetime.fromisoformat(raw)


def _parse_actions(raw_actions: Sequence[dict[str, Any]]) -> tuple[list[TypedAction], list[RejectReason]]:
    parsed: list[TypedAction] = []
    rejects: list[RejectReason] = []

    for a in raw_actions:
        name = a.get("name")
        params = a.get("params")
        if not isinstance(params, dict):
            rejects.append(RejectReason.MALFORMED_ACTION_PARAMS)
            continue

        if name == "refund.create":
            try:
                parsed.append(RefundCreateAction(amount_jpy=int(params["amount_jpy"])))
            except Exception:
                rejects.append(RejectReason.MALFORMED_ACTION_PARAMS)
        elif name == "email.send":
            vars_ = params.get("variables", {})
            if not isinstance(vars_, dict):
                rejects.append(RejectReason.MALFORMED_ACTION_PARAMS)
                continue
            parsed.append(EmailSendAction(
                template_id=str(params.get("template_id", "refund_notice_default")),
                variables=vars_,
            ))
        else:
            rejects.append(RejectReason.UNKNOWN_ACTION)

    return parsed, rejects


def validate(doc: RefundRequest, proposed_actions: Sequence[dict[str, Any]]) -> Verdict:
    # Hard rejects (policy invariants)
    if doc.order.chargeback_flag:
        return Verdict(VerdictLevel.REJECT, reject_reasons=(RejectReason.CHARGEBACK_FLAGGED,))
    if doc.order.remaining_refundable_jpy <= 0:
        return Verdict(VerdictLevel.REJECT, reject_reasons=(RejectReason.ALREADY_FULLY_REFUNDED,))

    # Degrade if we can’t justify
    if not doc.evidence.has_attachments:
        return Verdict(VerdictLevel.DEGRADE, degrade_reasons=(DegradeReason.MISSING_EVIDENCE,))
    if doc.order.delivered_at is None:
        return Verdict(VerdictLevel.DEGRADE, degrade_reasons=(DegradeReason.DELIVERY_DATE_UNKNOWN,))

    delivered_at = _parse_iso8601(doc.order.delivered_at)
    received_at = _parse_iso8601(doc.ticket.received_at)

    if received_at < delivered_at:
        return Verdict(VerdictLevel.DEGRADE, degrade_reasons=(DegradeReason.TIMESTAMP_CONTRADICTION,))

    if (received_at - delivered_at).days > MAX_DAYS_FROM_DELIVERY:
        return Verdict(VerdictLevel.REJECT, reject_reasons=(RejectReason.REFUND_WINDOW_EXPIRED,))

    # Parse + validate proposed plan
    parsed, parse_rejects = _parse_actions(proposed_actions)
    if parse_rejects:
        return Verdict(VerdictLevel.REJECT, reject_reasons=tuple(parse_rejects))

    normalized: list[TypedAction] = []
    rejects: list[RejectReason] = []

    for act in parsed:
        if isinstance(act, RefundCreateAction):
            if act.amount_jpy <= 0:
                rejects.append(RejectReason.INVALID_REFUND_AMOUNT)
                continue
            if act.amount_jpy > doc.order.remaining_refundable_jpy:
                rejects.append(RejectReason.AMOUNT_EXCEEDS_REMAINING)
                continue
            normalized.append(RefundCreateAction(amount_jpy=act.amount_jpy))
        else:
            normalized.append(act)

    if rejects:
        return Verdict(VerdictLevel.REJECT, reject_reasons=tuple(rejects))

    return Verdict(VerdictLevel.ACCEPT, normalized_plan=tuple(normalized))

6) Will vendors provide “domain compilers”?

Vendors will absolutely provide useful building blocks (tooling, orchestration, guardrail frameworks). But a fully general “domain compiler” is structurally hard because:

“Correctness” differs by company, industry, country, and time.
The definition of correctness becomes the boundary of liability and accountability.
Policies change often; policy changes themselves become audit targets.
Exceptions and tacit rules are everywhere.

So in practice, you usually end up implementing policy-as-code + state checks + audit-grade traces locally.

7) How to build a minimum validator (before you “agentify” anything)

You don’t need a full formal system on day one. Start small:

Make prohibitions deterministic first
PII handling, permissions, forbidden actions, “never do X” rules.
Type the actions
Replace “free-text execution” with Action(type, params) plus schemas.
Standardize dry-run
Always show “what will happen” before execution. Make approval explicit.
Create ~10 golden cases
5 normal + 5 edge cases. Aim for reproducibility, not coverage.

This is the fastest path from “cool demo” to “operational component.”

Conclusion

Many “successful agent” stories are powered by something boring and powerful: deterministic verification.

LLMs are great proposers.
Agents become real when you add deterministic validators, typed actions, and audit trails.
Code agents succeed because software already has compilers, tests, and CI.
Outside software, if you can’t reject deterministically, you can’t automate safely.

Programming is a special environment: compilers, tests, and CI make proposals cheap to reject.
Outside software, if you can’t reject deterministically, you can’t automate safely.
Agents don’t work. Verifiers do.

DEV Community