Dmitrii

Posted on Jun 7 • Originally published at quotyai.com

How to build AI agents in next 6-12 months: determinism, schemas, interpreters, and rubrics

#agents #llm #ai #architecture

The models aren't the differentiator anymore. The runtime is.

I've spent the last year building an agentic AI platform. Voice calls, chatbots, sales agents, workflow automation — systems that run in production, talk to real customers, touch real data. A pattern keeps showing up that I don't see discussed much, probably because it isn't flattering to the usual narrative about AI progress.

The most reliable AI systems aren't the ones with the smartest models. They're the ones with the most deterministic runtimes underneath them.

§ 01 — What Coding Agents Actually Proved

Coding agents didn't take off because models crossed a capability threshold. LLMs were capable at code generation in 2021. What changed was the runtime underneath — a compiler, a test runner, an interpreter that gives unambiguous feedback on every attempt.

The 2023 RAG wave had no equivalent. Retrieval + generation, no execution step, no correction signal. Every verification burden fell on the human. Coding agents moved that burden to the machine.

The insight: AI doesn't need to be certain. It needs a fast way to be wrong.

flowchart LR
  subgraph RAG["RAG · 2023 — dead end"]
    direction LR
    A[request] --> B[LLM] --> C[response] -.->|human must verify| X["❓ unknown"]
  end

  subgraph INT["INTERPRETER · 2025 — feedback loop"]
    direction LR
    D[request] --> E[LLM] --> F[code] --> G[runtime]
    G -->|"✓ pass"| H[done]
    G -->|error → retry| E
  end

When you put a deterministic execution layer under a probabilistic model, the model's uncertainty stops being the bottleneck. The runtime handles verification. The model keeps iterating.

This pattern generalises. Wherever you can attach a deterministic execution layer to an LLM, you convert guessing into a feedback loop. Coding was first because the execution layer already existed. The next wave is about building it deliberately for every other domain.

§ 02 — The Interpreter Layer

LangChain's DeepAgents shipped QuickJS in-process: a JavaScript interpreter inside the agent harness. The AI now reasons through code as a first-class operation — not via round-trips to an external runner.

flowchart LR
  subgraph harness["AGENT HARNESS"]
    direction LR
    LLM["LLM\nprobabilistic reasoning"] <-->|generate / eval| QJS["QuickJS\ndeterministic execution"]
    QJS --> T1[tool composition]
    QJS --> T2[state management]
    QJS --> T3[context filtering]
  end

Tool composition, state management, context filtering, conditional orchestration: all deterministic, all inside the harness.

At QuotyAI, this is what moved our agent latency from ~2.5s average to 24ms fixed. Conditional logic — booking validation, conflict resolution, escalation rules — came out of the LLM and into deterministic code.

	Agent chain	Deterministic runtime
Average latency	~2,500ms	24ms
Variance	high	fixed
Debuggability	prompt hunting	stack traces

The LLM handles judgment. The runtime handles facts.

§ 03 — Agents Need to Know What "Done" Looks Like

Most agents have no concept of task completion. They generate a response and stop. Whether the task is actually finished is left to the caller.

Claude Code introduced /goal — you define a target upfront, the agent works toward it explicitly across iterations. LangChain went further with RubricMiddleware for DeepAgents.

The mechanic: a grader sub-agent (cheaper model, specific tools) evaluates output against a typed rubric before the run concludes. If any criterion fails, the grader injects per-criterion feedback — not "try again", but exactly which criterion failed and why — and the loop reruns.

flowchart LR
  R[request] --> A

  subgraph middleware["RUBRIC MIDDLEWARE"]
    direction LR
    A["Agent\nclaude-sonnet-4-6"] --> G["Grader\nclaude-haiku-4-5"]
    G --> D{"all criteria\npassing?"}
    D -->|"no — per-criterion\nfeedback injected"| A
  end

  D -->|yes| Done["✓ done"]

from deepagents import RubricMiddleware, create_deep_agent

rubric = RubricMiddleware(
    model="anthropic:claude-haiku-4-5",        # grader: fast + cheap
    system_prompt="Grade against rubric. Return per-criterion verdicts.",
    tools=[run_test_suite, validate_schema],    # grader can call tools
    max_iterations=5,
)

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",       # main agent: reasoning
    middleware=[rubric],
)

result = agent.invoke({
    "messages": [HumanMessage(content="Write find_duplicates(lst)")],
    "rubric": (
        "- All tests pass in run_test_suite\n"
        "- Handles unhashable types (lists, dicts)\n"
        "- Returns elements in order of first appearance\n"
    ),
})

Two things worth noting:

The grader uses a different, cheaper model than the main agent. You're not paying for Sonnet to check if tests pass — Haiku does it with run_test_suite.
Feedback is per-criterion, not generic. The agent doesn't get "try again" — it gets "criterion 3 failed: crashes on unhashable input."

"Done" is now a schema, not a feeling. You write it once. The grader evaluates deterministically. When iteration 1 fails criterion 3, the agent retargets that criterion specifically.

§ 04 — Two-Phase Application Architecture

Most teams wire up an LLM, give it tools, add a system prompt, and ship. This works for demos. In production, the AI calls the wrong function, returns unexpected shapes, or makes decisions that were supposed to be deterministic.

Root cause: two distinct phases treated as one.

flowchart LR
  subgraph P1["PHASE 1 · DESIGN TIME (human)"]
    direction TB
    S1["typed inputs / outputs"]
    S2["tool contracts"]
    S3["allowed error codes"]
    S4["versioned, tested once"]
  end

  subgraph P2["PHASE 2 · RUNTIME (AI)"]
    direction TB
    E1["model reasons within contract"]
    E2["output validated against schema"]
    E3["mismatch → error + retry"]
    E4["deterministic fallback paths"]
  end

  P1 -->|"contract\n(immutable boundary)"| P2

The payoff is debuggability. When something breaks: was the contract wrong (Phase 1) or did the model make a bad decision within a correct contract (Phase 2)? Different bugs. Different fixes. Conflating the phases means debugging both at once.

At QuotyAI this is the workflow editor: trigger with typed JSON payload → condition block with explicit logic → action with defined output schema. The AI maps customer intent to that structure. It can't invent new structures. The contract is the boundary.

§ 05 — Contracts over Protocols

MCP is a reasonable answer to a real problem — no standard existed for connecting AI to external tools. For third-party tools (Slack, GitHub, Notion) it's useful.

For first-party application logic, it's the wrong layer.

MCP standardises tool discovery and invocation. It doesn't give you typed I/O contracts per tool, versioning, per-domain error schemas, or enforced output shapes.

MCP tool definition:

{
  "inputSchema": {
    "type": "object",
    "properties": {
      "data": { "type": "object" }
    }
  }
}

No type enforcement. No error schema. No version. Anything goes in, anything comes out.

Typed skill contract:

{
  "name":    "create_booking",
  "version": "2.1.0",
  "input": {
    "customer_id": "string:uuid",
    "service_id":  "string:uuid",
    "slot":        "datetime:iso8601",
    "notes":       "string:optional"
  },
  "output": {
    "booking_id":    "string:uuid",
    "slot_confirmed": "datetime:iso8601"
  },
  "errors": ["SLOT_UNAVAILABLE", "CUSTOMER_NOT_FOUND", "SERVICE_DISABLED"]
}

The execution contract: AI emits structured output → code validates against schema → mismatch triggers error + retry → nothing ambiguous reaches business logic.

MCP will grow for third-party connectivity. Custom contracts will win for first-party logic, because they're the thing that makes AI output auditable.

§ 06 — Model Routing Is Infrastructure

One model for every task is an accounting failure.

Task	Model	Latency	Cost/call
Intent classification	claude-haiku-4-5	~150ms	$0.001
Response generation	claude-sonnet-4-6	~1.5s	$0.015
Sensitive PII handling	llama-3-8b (self-hosted)	~400ms	$0

These are not the same constraint. Running Sonnet on every classification call is expensive and slow. Running Haiku on a complex reasoning task gives you wrong answers confidently.

MODEL_ROUTES = {
    "intent_classification": {
        "model": "claude-haiku-4-5",
        "max_tokens": 100,
        "latency_budget_ms": 200,
    },
    "response_generation": {
        "model": "claude-sonnet-4-6",
        "max_tokens": 1000,
        "latency_budget_ms": 2000,
    },
    "sensitive_data": {
        "model": "llama-3-8b",     # self-hosted — no data leaves
        "max_tokens": 500,
        "private": True,
    },
}

def call_agent(task_type: str, payload: dict):
    config = MODEL_ROUTES[task_type]
    return call_model(config["model"], payload, config)

flowchart LR
  T[task] --> R["router\ntask_type → model config"]
  R --> M1["intent classification\nhaiku · 150ms · $0.001"]
  R --> M2["response generation\nsonnet · 1.5s · $0.015"]
  R --> M3["sensitive data\nself-hosted · 400ms · $0"]

This isn't a user-facing feature. It's a routing decision before any LLM call. Most teams don't do this because frameworks don't enforce it. The latency and cost gaps between tiers are too large to ignore at scale.

§ 07 — What Ships on This Foundation

The 2023 RAG wave built better search. The ceiling was "better lookup."

What gets built on a deterministic foundation is different in kind — the AI executes, not synthesises.

Schema-first agent frameworks. You'll define AI contracts the way you currently define database schemas — with validation, versioning, and migration tooling. The schema layer will be a CLI artifact, not a system prompt.

Rubric-based agent CI/CD. Before merging a change to your agent's tool set or model version, a rubric test suite runs. Green means the agent still satisfies its defined completion criteria. Same pattern as unit tests, applied to agent behavior.

Model routing as a managed layer. Task-type → model selection moves below the application layer entirely, the way DNS sits below HTTP. Your application won't know which model ran — only that the contract was satisfied.

The deeper point: the value isn't in the model. Models improve continuously and the best ones keep getting cheaper. The value is in the runtime that makes model output trustworthy enough to act on — typed contracts, deterministic execution, verifiable completion.

That runtime is being built right now. Mostly quietly. Mostly by people who got burned by the first wave.

I'm building QuotyAI — an agentic AI platform for voice, chat, and business automation. If you're working on deterministic agent infrastructure, I'd like to hear what you're seeing.