Hector Flores

Posted on May 25 • Originally published at htek.dev

Per-Turn Evaluation: Dynamic Governance for AI Agents

#aiagents #agenticdevelopment #platformengineering #security

Static governance is fine right up until your agent changes modes mid-session. The same agent can spend turn 1 researching docs, turn 8 editing code, turn 14 fixing failed tests, and turn 20 preparing a production deploy. Pretending one startup-time config should govern all of that is the harness equivalent of hardcoding production policy into a shell alias.

That is why I'm increasingly convinced that per-turn evaluation needs to be a first-class primitive in agentic systems. If you're serious about governed autonomy, you need the harness to ask a fresh question at the start of every turn: given the current state, which rules should be active right now?

This is a core idea behind what I call Harness as Code. And in AI Harness, per-turn artifact evaluation is implemented as a runtime feature in v0.4.0, following the design described in issue #7 for per-turn artifact evaluation.

Why Static Rules Aren't Enough

Static governance overloads context and under-governs risk. Per-turn evaluation adapts rules to what the agent is actually doing right now.

A lot of agent stacks still treat governance as startup configuration: load the prompt, register the tools, inject the rules, and go. That works for short-lived demos. It gets shaky fast in long-running or multi-phase sessions.

There are three failure modes I keep seeing:

You over-load the context window. Every possible rule ships on every turn, even when most of them are irrelevant.
You under-govern risky phases. The same loose rules that were fine during research stay active during writes, approvals, or deployment.
You mix policy with prompt hacks. Instead of the harness making deterministic decisions, the model gets a giant wall of “if you're doing X, remember Y.”

That last part is the killer. Once governance lives primarily inside prompts, you lose clean separation between policy and reasoning. You also lose the discipline that policy systems like Open Policy Agent were built around: keep decisions declarative, versioned, and evaluated against current input.

The better analogy is feature toggles. Martin Fowler's framing still holds: you define behavior once, then let runtime context determine which path is active. AI agents need the same pattern for governance.

What Per-Turn Evaluation Actually Means

The evaluation loop runs at every turn boundary — gathering state, evaluating conditions, and composing only active governance artifacts.

Per-turn evaluation is simple in concept:

The agent loop starts a new turn.
The harness gathers live state for that turn.
Every conditional artifact is re-evaluated against that state.
Only the active artifacts participate in context composition.

The crucial shift is this: governance becomes a function of state, not a snapshot captured at startup.

That state can include things like:

turn: 14
mode: "implementation"
active_files: ["artifact/composer.go"]
error_count: 1
tools_called: ["read_file", "edit_file", "run_tests"]

Now your harness can make deterministic decisions such as:

enable stricter review guidance after repeated failures
load Go-specific conventions only when Go files are active
apply extra deployment guardrails only when the agent enters a production path
switch to more concise context after a long session to protect the token budget

That is dramatically more precise than one giant prompt trying to anticipate every future branch of execution.

Why Starlark Fits This Problem

For conditional governance, I want something declarative and constrained. Starlark's specification and Bazel's language overview make it a strong fit for this kind of work: it is Python-like enough to read quickly, intentionally restricted, deterministic, and designed around predictable evaluation with strong immutability bias.

That matters. A governance condition language should not be a hidden side-effect engine. It should evaluate expressions against input and return a decision.

Here's the kind of artifact-level condition AI Harness supports:

---
name: production-guard
type: override
priority: 100
condition: 'ctx.get("mode") == "production" and ctx.get("turn", 0) > 5'
---
# Production Guard

Require post-write verification.
Block destructive shortcuts.
Confirm the target before deployment actions.

I like this model because the rule is local to the artifact. You don't have to open the runtime and add another hardcoded if statement. You define the condition where the behavior lives.

How AI Harness Implements Dynamic Governance

In AI Harness's current v0.4.0 implementation, per-turn evaluation is not a blog concept. It's wired into the runtime.

At the start of Agent.Run, the loop creates turn-scoped state, increments the turn counter, and sets the current turn in that scratchpad before the model does any work.

a.turnNumber++
scripting.SetTurnState(turnCtx, "turn", a.turnNumber)

if a.composer != nil {
    if err := a.composer.EvaluateConditions(turnCtx); err != nil {
        a.logger.Printf("WARN condition re-evaluation failed: %v", err)
    }
}

That call flows into Composer.EvaluateConditions, which reads the live values from the per-turn scratchpad via TurnStateValues and evaluates each artifact condition against the current turn context.

The registry then updates each artifact's Active field through Registry.UpdateConditions. That's an important implementation detail, because it makes activation status part of the artifact model itself rather than an external side table.

func (r *Registry) UpdateConditions(evalFn func(condition string) (bool, error)) error {
    // re-evaluates every artifact and updates Active in place
}

Even better, the failure mode is sane: per-artifact condition errors are non-fatal. If one expression is malformed, the whole session does not implode. The registry keeps evaluating the rest and preserves the prior active state for the broken artifact. That's the kind of degradation behavior you want in a production harness.

This design also plays nicely with AI Harness's typed artifact model and composition order:

override (100) > harness (80) > builtin (60) > plugin (40) > model (20)

So the runtime isn't just deciding what is active each turn. It's also deciding how active artifacts compose when they conflict.

Patterns This Unlocks

Per-turn evaluation makes four powerful governance patterns practical: escalation after failures, lazy-loading conventions, risk-proportional controls, and automatic concise mode.

Once governance is evaluated per-turn, a bunch of useful patterns stop being awkward.

Progressive escalation

After repeated failures, activate a recovery artifact that tells the agent to stop retrying blindly and explain what changed.

Phase-aware context

Load language or workflow conventions only when the agent is actually operating in that phase. That is the governance equivalent of lazy loading.

Risk-proportional controls

Keep early research turns lightweight, then tighten verification and approval rules when the session crosses into write-heavy or deployment-heavy work.

Token-aware governance

Long sessions can activate concise-mode artifacts that reduce exploration and prioritize completion before the context window gets messy.

This is also why per-turn evaluation pairs so well with context observability. If you are going to make governance dynamic, you need to be able to inspect which artifacts were active, which were inactive, and why.

Why This Is Better Than Prompt Conditionals

Could you write a giant system prompt that says, “if you are deploying, be more careful”? Sure.

I don't think that's governance.

That's advice.

In my view, real governance means the harness decides what the model is allowed to see and what policy surfaces are active before the next reasoning step begins. The model should not be responsible for remembering which governance branch applies. The harness should.

That's the same reason I'm more interested in governance architectures than in ever-bigger prompts. Prompts are necessary. But when they become your only control plane, you're still building too much of the system on vibes.

The Bigger Shift

Per-turn evaluation is one of those ideas that sounds small until you realize it changes the entire posture of the system.

Instead of asking, “What rules should this agent always have?” you start asking, “What rules should be true now?”

That is a much better question for long-running, stateful, tool-using agents.

It's also a cleaner path toward the broader discipline I care about: harness engineering. The same way DevOps normalized pipelines, policy, observability, and infrastructure definitions as real engineering surfaces, agent systems need their own equivalent control plane. Dynamic governance is part of that.

If you're building agents today, my recommendation is straightforward:

keep the static core small
move situational rules into conditional artifacts
re-evaluate those artifacts every turn
make activation observable
make failures degrade gracefully rather than crash the session

That's the practical lesson behind AI Harness so far. And it's why I think per-turn evaluation is going to look obvious in hindsight.

Not because it's flashy — but because for serious, stateful agent systems, static governance was rarely going to be enough.

Top comments (1)

Harjot Singh • May 31

Per-turn evaluation is the right granularity and most people get it wrong by evaluating only the final output - by then a bad turn three steps back has already compounded into the result, and you can't tell which step actually broke. Checking each turn (did this action match intent, is the state still valid, did the agent drift from the goal) is how you catch the failure at the point it happens instead of autopsying the corpse. "Dynamic governance" is the key word too: a static policy can't anticipate every state an agent reaches, so the governor has to evaluate in context, turn by turn.

This is squarely the thing I build around - it's the core of Moonshift, a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer gates each step rather than trusting the run to be correct at the end. Per-turn evaluation is exactly what keeps a long agent run from silently going off the rails and burning tokens on a wrong path. Multi-model routing keeps a build ~$3 flat, first run's free no card. Really aligned thinking. Two questions: what's doing the per-turn eval - a cheaper judge model, rules, or both? And when a turn fails the check, do you halt, retry that turn, or roll back to a known-good state? The recovery policy is where this gets genuinely hard, because a wrong correction can be worse than the original error.