Alex Delov

Posted on Jun 7

Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.

#architecture #llmops #infrastructure #opensource

Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.

We’ve all seen the standard AI agent architecture:

LLM → Tool → Reflection → Retry → More Tools → Chaos

It works well for demos.

It fails the moment you need:

auditability
replayability
deterministic boundaries
regulator-facing guarantees
operational observability

The core problem is simple:

Most AI systems use probabilistic orchestration.

The LLM controls:

execution flow
tool selection
branching semantics
retry topology

That means your runtime behavior changes dynamically based on latent model state.

For enterprise systems — especially FinTech, KYC/AML, DevSecOps, LegalTech — this is operationally unacceptable.

So we built something different:

Governed Probabilistic Execution

Instead of treating the LLM as the subject of orchestration, we treat it as a constrained compute unit operating inside a deterministic runtime.

Traditional agents:
LLM decides → System adapts

Governed execution:
System decides → LLM computes

The project:

llm-nano-vm
nano-vm-mcp
kyc-demo-streamlit

implements this model explicitly.

The Runtime Model

The architecture is built around a deterministic Finite State Machine (FSM).

The LLM:

does not own control flow
does not mutate execution topology
does not dynamically create new execution semantics
cannot escape governance boundaries

Instead, every execution step is bounded and explicitly governed.

FSM Runtime
    ↓
Projection Layer
    ↓
Bounded LLM Step
    ↓
Typed Transition
    ↓
Execution Trace

ProjectionLayer: Evaluator Blindness

One of the most important architectural properties is evaluator blindness.

The model never receives full runtime context.

It only sees a target-specific projection:

ProjectionLayer(target=LLM)

This creates several important guarantees:

1. Reduced semantic contamination

The model cannot overfit on:

governance prompts
rollback metrics
entropy alerts
audit metadata
unrelated historical state

2. Anti-Hawthorne behavior

The evaluator cannot adapt its behavior simply because it knows it is being monitored.

3. Capability isolation

The Projection Layer acts as:

a semantic firewall
a capability boundary
an information minimization layer

This architecture is closer to capability-security systems than to prompt engineering.

ASTEngine Instead of eval()

The runtime never executes arbitrary Python.

There is:

no eval()
no exec()
no unrestricted expression execution

Conditions are evaluated through a constrained AST engine.

The important point is not just security.

The real goal is bounded semantic expressiveness.

The DSL intentionally forbids:

method calls
arbitrary arithmetic
dynamic execution
unrestricted Python semantics

Why?

Because unrestricted expressiveness destroys:

replayability
analyzability
deterministic guarantees
formal reasoning

This design philosophy is much closer to:

Terraform HCL
Open Policy Agent (Rego)
AWS IAM policy DSLs

than to traditional AI orchestration frameworks.

Observability Beyond Tokens

Most AI observability tooling measures:

latency
token usage
cost
prompt traces

We wanted to measure something deeper:

Structural execution instability

The runtime tracks:

path variance
rollback density
transition sequence variance
transition entropy

Transition entropy is especially important.

If execution entropy exceeds an empirical threshold (2.5 bits), the runtime flags structural degradation.

This is not “AI monitoring”.

It is execution topology observability.

Failure Laboratory

The KYC Governance Simulator intentionally includes adversarial injectors:

tool_injection
policy_bypass
skip_step
reorder_steps
corrupt_receipt
gdpr_erase

The point is not to showcase a happy path.

The point is to demonstrate deterministic failure semantics under attack conditions.

Most AI demos try to hide instability.

We intentionally surface it.

Trace ≠ Receipt

Another core architectural principle:

Execution → Trace → Analyzer → Receipt

Where:

Trace = source of truth
Receipt = deterministic projection
Analyzer = post-hoc interpretation layer

Receipts are:

recomputable
deterministic
derived artifacts

They are not mutable runtime state.

This is heavily inspired by event-sourcing philosophy.

Transactional AI Code Mutation

We applied the same principles to repository mutation.

The companion nano-vm-dev-agent performs code changes transactionally:

stage_patch()
→ validate_staged_mypy(tmpdir)
→ pytest
→ commit OR rollback

The repository is never mutated before type validation succeeds.

This creates CI-grade mutation safety for AI-assisted development.

Most coding agents operate on best-effort mutation semantics.

This runtime applies transactional guarantees instead.

Why Streamlit?

We intentionally skipped:

React
Vite
complex async frontend state systems

The UI is built entirely in Python using Streamlit.

Why?

Because the project optimizes for:

governance correctness
deterministic behavior
engineering simplicity
type safety
operational transparency

Not frontend maximalism.

Current Status

Current ecosystem status:

llm-nano-vm v0.8.4
nano-vm-mcp v0.4.3
kyc-demo-streamlit
nano-vm-dev-agent v0.2.0

Engineering discipline:

mypy --strict
pytest
deterministic constraints
no arbitrary runtime execution

The KYC demo currently passes:

51/51 tests
0 mypy errors

The Bigger Shift

The industry is saturated with autonomous agent hype.

But critical infrastructure does not need autonomous orchestration.

It needs:

bounded execution
deterministic governance
replayability
auditability
operational observability

The future may not belong to autonomous agents.

It may belong to governed execution runtimes for probabilistic systems.

DEV Community

Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.

Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.

Governed Probabilistic Execution

The Runtime Model

ProjectionLayer: Evaluator Blindness

1. Reduced semantic contamination

2. Anti-Hawthorne behavior

3. Capability isolation

ASTEngine Instead of eval()

Observability Beyond Tokens

Structural execution instability

Failure Laboratory

Trace ≠ Receipt

Transactional AI Code Mutation

Why Streamlit?

Current Status

The Bigger Shift

Repositories

Top comments (0)