Stop doing improv theatre in production. Ship agents like software.
Agentic tooling is moving fast: CLIs that edit repositories, frameworks that orchestrate swarms, tool-calling APIs everywhere. And still, most teams that try to run “agents” in production hit the same wall:
- outputs drift between runs
- “structured output” breaks at the worst moment
- tool calls happen at the wrong time, with the wrong shape
- debugging turns into story-time (“it worked yesterday…”)
- trust collapses exactly when you need it most
The root cause isn’t that models aren’t smart enough.
It’s that we keep shipping non-contractual behavior.
This post argues a simple thesis:
Reliability in LLM systems doesn’t come from better prompts.
It comes from contracts and gates — with the system holding veto power.
FACET v2.0 is a compiler-grade, deterministic agent configuration language designed around that thesis: strict AST → type checking (FTS) → reactive compute (R-DAG) → deterministic context packing (Token Box Model) → canonical JSON render.
A short failure story: “theatre in production”
A team ships an “agentic PR bot”. It edits code, runs tests, and posts a confident summary.
One day the bot “fixes” an issue by adding a dependency. Tests pass locally. The PR merges.
In production, a transitive change triggers a locale/timezone edge case. A downstream service fails for a subset of users. Rollback takes hours because nobody can answer:
- Was the agent allowed to introduce new dependencies?
- Which tool calls did it run, with what arguments, in what order?
- Can we replay the run?
- What evidence exists beyond “agent said it’s fine”?
The bot didn’t “misbehave”. It acted exactly as designed: it operated without enforceable boundaries.
That’s the pattern: not “bad model”, but missing veto power.
Contracts and gates: the difference between a demo and a pipeline
Most agent stacks look like this:
Prompt + JSON hope → model writes → parse fails → retry culture → merge anyway
A contractual pipeline looks like this:
Contract → validate inputs + permissions → generate artifact → validate artifact → gates → commit (or reject)
Two key primitives make this real:
- Contracts: define what’s allowed and what “valid” means
- Gates: run reality checks (tests, security, perf) and block state changes
FACET makes both primitives first-class — not conventions, not best-effort prompts.
Part 1 — Contracts in FACET (real examples)
FACET v2.0 treats agent behavior as a compiled spec. That starts with strict structure and typing.
1) Tool contracts with @interface (typed tools, not “tool descriptions”)
In FACET, tools aren’t loose JSON blobs. They are typed interfaces that compile into provider tool schemas.
@interface WeatherAPI
fn get_current(city: string) -> struct {
temp: float
condition: string
}
@system
tools: [$WeatherAPI]
This is a contract:
- the tool name exists
- args are typed (
city: string) - return shape is typed (
struct { temp: float, condition: string }) - the compiler can emit canonical provider schemas during render
In practice, this eliminates a whole class of runtime failures: wrong arg names, wrong types, ambiguous “tool results”.
2) Inputs are explicit with @input (no hidden dependencies)
FACET forces you to declare runtime inputs in @vars via @input(...).
@vars
user_query: @input(type="string")
user_photo: @input(type="image", max_dim=1024)
This matters because:
- missing input is not “guess it” — it’s an error
- constraints (like image size) are enforced at runtime
- inputs become leaf nodes in the R-DAG (deterministic dependency graph)
This is fail-closed engineering: if data isn’t provided, the system does not hallucinate a substitute.
3) Variables are reactive, deterministic, and immutable after compute (R-DAG)
FACET variables can depend on other variables. Evaluation happens via R-DAG in topological order; cycles and invalid orders are errors.
@vars
raw_query: $user_query |> trim()
query_lang: $raw_query |> detect_lang()
normalized: $raw_query |> normalize(lang=$query_lang)
Key point: once computed, the variable map becomes immutable.
This makes runs reproducible and debuggable: the same inputs produce the same computed state (in Pure Mode).
4) Lenses have trust levels (Pure / Bounded / Volatile)
FACET introduces trust levels for transformations (lenses):
- Level 0 — Pure: deterministic, no I/O
- Level 1 — Bounded external: allowed only with deterministic params, cacheable
- Level 2 — Volatile: nondeterministic, only in Execution Mode
A pipeline makes the contract explicit:
@vars
summary: $normalized
|> summarize(model="gpt-5.2", temperature=0) # Level 1 (bounded)
|> to_markdown() # Level 0 (pure)
This is where “determinism is a property of the system” becomes concrete.
If you’re in Pure Mode: you simply cannot smuggle volatility in “because it felt right”.
Part 2 — Gates in FACET (not vibes, executable checks)
A contract without gates is still fragile. Gates give the system the right to say: no.
FACET v2.0 includes a first-class testing system via @test.
5) Tests as executable gates with mocks and assertions (@test)
@test "basic greeting"
vars:
username: "TestUser"
mock:
WeatherAPI.get_current: { temp: 10, condition: "Rain" }
assert:
- output contains "umbrella"
- cost < 0.01
This is CI thinking applied to agent specs:
- tests execute the full 5-phase pipeline
- tools can be mocked (deterministic runs)
- assertions can check output and telemetry
In other words: “agent done” is not a feeling — it’s passing checks.
Part 3 — Deterministic context packing (Token Box Model) is a gate too
Even when contracts and tests exist, real systems fail because context is managed ad hoc. Prompts overflow, critical instructions get truncated, and the model “drifts” because the context layout changed.
FACET treats context like layout, not like concatenated strings.
6) Token Box Model: deterministic allocation + critical overflow as a hard failure
The model is simple:
- your prompt is a set of sections (
@system,@user, history, docs, etc.) - each section has min/grow/shrink/priority
-
critical sections are those with
shrink == 0and must never be dropped or compressed
If critical sections can’t fit, FACET raises a hard error (critical overflow).
This is a gate: the system refuses to ship an invalid prompt.
That single decision kills an entire class of “mysterious agent regressions” caused by silent truncation.
Part 4 — What “enforced before generation” actually means (no magic)
This phrase can sound controversial, so here’s the precise version:
FACET enforces a double barrier:
Before action (pre-check):
validate inputs, tool interfaces, allowed operations, budgets, deterministic mode constraintsBefore state change (post-check):
validate produced artifacts, run gates, reject if any invariant breaks
So the flow is:
validate → generate → validate → gate → commit
This is how compilers and CI pipelines behave.
Production agent systems should do the same.
Part 5 — A small, concrete canonical output artifact
FACET’s final output is a canonical JSON structure (before provider-specific transformations). Here’s a simplified “what your orchestration layer can log and replay” shape:
{
"meta": {"profile": "hypervisor", "mode": "pure"},
"tools": [
{"name": "WeatherAPI.get_current", "input_schema": {"city": "string"}}
],
"sections_order": ["system", "tools", "history", "user"],
"user": {"query": "what to wear today in Berlin?"},
"gates": [
{"gate": "tests_green", "pass": true},
{"gate": "critical_overflow", "pass": true}
]
}
Notice the difference vs typical systems:
- there is an explicit mode
- tools are typed
- section order is deterministic
- gates and outcomes are visible
- this is loggable and replayable
Part 6 — Tooling matters: the reference CLI (fct) makes this operational
FACET isn’t only a philosophy; it specifies tooling expectations. A reference CLI (fct) is part of the standard:
-
fct build file.facet— resolution + type checking -
fct run file.facet --input input.json— full 5-phase pipeline → canonical JSON -
fct test file.facet— execute@testblocks, report failures + telemetry -
fct inspect ...— introspect AST/R-DAG/context allocation (debuggability)
When the language includes these operations, teams stop inventing bespoke glue.
Closing: stop shipping theatre — ship standards
LLMs are powerful components — but without enforceable boundaries they introduce entropy at the exact moment correctness, security, and reliability matter most.
Contracts + gates aren’t bureaucracy.
They’re the difference between a cool demo and a shippable system.
FACET’s core bet is simple:
Treat agent behavior like compiled software:
parse, type-check, compute deterministically, pack context deterministically, render canonical JSON — and never commit state unless gates pass.
Repositories
- FACET Compiler: https://github.com/rokoss21/facet-compiler
- FACET Standard: https://github.com/rokoss21/facet-standard
Top comments (0)