<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Delov</title>
    <description>The latest articles on DEV Community by Alex Delov (@ale007xd).</description>
    <link>https://dev.to/ale007xd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3943262%2Fead831e3-7141-4c6e-8903-282ea5a80e86.jpg</url>
      <title>DEV Community: Alex Delov</title>
      <link>https://dev.to/ale007xd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ale007xd"/>
    <language>en</language>
    <item>
      <title>Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.</title>
      <dc:creator>Alex Delov</dc:creator>
      <pubDate>Sun, 07 Jun 2026 05:09:32 +0000</pubDate>
      <link>https://dev.to/ale007xd/stop-building-autonomous-ai-agents-build-governed-execution-runtimes-instead-j77</link>
      <guid>https://dev.to/ale007xd/stop-building-autonomous-ai-agents-build-governed-execution-runtimes-instead-j77</guid>
      <description>&lt;h1&gt;
  
  
  Stop Building Autonomous AI Agents. Build Governed Execution Runtimes Instead.
&lt;/h1&gt;

&lt;p&gt;We’ve all seen the standard AI agent architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM → Tool → Reflection → Retry → More Tools → Chaos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works well for demos.&lt;/p&gt;

&lt;p&gt;It fails the moment you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;auditability&lt;/li&gt;
&lt;li&gt;replayability&lt;/li&gt;
&lt;li&gt;deterministic boundaries&lt;/li&gt;
&lt;li&gt;regulator-facing guarantees&lt;/li&gt;
&lt;li&gt;operational observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core problem is simple:&lt;/p&gt;

&lt;p&gt;Most AI systems use &lt;strong&gt;probabilistic orchestration&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The LLM controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;execution flow&lt;/li&gt;
&lt;li&gt;tool selection&lt;/li&gt;
&lt;li&gt;branching semantics&lt;/li&gt;
&lt;li&gt;retry topology&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means your runtime behavior changes dynamically based on latent model state.&lt;/p&gt;

&lt;p&gt;For enterprise systems — especially FinTech, KYC/AML, DevSecOps, LegalTech — this is operationally unacceptable.&lt;/p&gt;

&lt;p&gt;So we built something different:&lt;/p&gt;

&lt;h2&gt;
  
  
  Governed Probabilistic Execution
&lt;/h2&gt;

&lt;p&gt;Instead of treating the LLM as the subject of orchestration, we treat it as a constrained compute unit operating inside a deterministic runtime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional agents:
LLM decides → System adapts

Governed execution:
System decides → LLM computes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm-nano-vm&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nano-vm-mcp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kyc-demo-streamlit&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;implements this model explicitly.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Runtime Model
&lt;/h1&gt;

&lt;p&gt;The architecture is built around a deterministic Finite State Machine (FSM).&lt;/p&gt;

&lt;p&gt;The LLM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does not own control flow&lt;/li&gt;
&lt;li&gt;does not mutate execution topology&lt;/li&gt;
&lt;li&gt;does not dynamically create new execution semantics&lt;/li&gt;
&lt;li&gt;cannot escape governance boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, every execution step is bounded and explicitly governed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FSM Runtime
    ↓
Projection Layer
    ↓
Bounded LLM Step
    ↓
Typed Transition
    ↓
Execution Trace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  ProjectionLayer: Evaluator Blindness
&lt;/h1&gt;

&lt;p&gt;One of the most important architectural properties is evaluator blindness.&lt;/p&gt;

&lt;p&gt;The model never receives full runtime context.&lt;/p&gt;

&lt;p&gt;It only sees a target-specific projection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ProjectionLayer(target=LLM)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates several important guarantees:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Reduced semantic contamination
&lt;/h2&gt;

&lt;p&gt;The model cannot overfit on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;governance prompts&lt;/li&gt;
&lt;li&gt;rollback metrics&lt;/li&gt;
&lt;li&gt;entropy alerts&lt;/li&gt;
&lt;li&gt;audit metadata&lt;/li&gt;
&lt;li&gt;unrelated historical state&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Anti-Hawthorne behavior
&lt;/h2&gt;

&lt;p&gt;The evaluator cannot adapt its behavior simply because it knows it is being monitored.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Capability isolation
&lt;/h2&gt;

&lt;p&gt;The Projection Layer acts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a semantic firewall&lt;/li&gt;
&lt;li&gt;a capability boundary&lt;/li&gt;
&lt;li&gt;an information minimization layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture is closer to capability-security systems than to prompt engineering.&lt;/p&gt;




&lt;h1&gt;
  
  
  ASTEngine Instead of eval()
&lt;/h1&gt;

&lt;p&gt;The runtime never executes arbitrary Python.&lt;/p&gt;

&lt;p&gt;There is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no &lt;code&gt;eval()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;no &lt;code&gt;exec()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;no unrestricted expression execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conditions are evaluated through a constrained AST engine.&lt;/p&gt;

&lt;p&gt;The important point is not just security.&lt;/p&gt;

&lt;p&gt;The real goal is bounded semantic expressiveness.&lt;/p&gt;

&lt;p&gt;The DSL intentionally forbids:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;method calls&lt;/li&gt;
&lt;li&gt;arbitrary arithmetic&lt;/li&gt;
&lt;li&gt;dynamic execution&lt;/li&gt;
&lt;li&gt;unrestricted Python semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because unrestricted expressiveness destroys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;replayability&lt;/li&gt;
&lt;li&gt;analyzability&lt;/li&gt;
&lt;li&gt;deterministic guarantees&lt;/li&gt;
&lt;li&gt;formal reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design philosophy is much closer to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform HCL&lt;/li&gt;
&lt;li&gt;Open Policy Agent (Rego)&lt;/li&gt;
&lt;li&gt;AWS IAM policy DSLs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;than to traditional AI orchestration frameworks.&lt;/p&gt;




&lt;h1&gt;
  
  
  Observability Beyond Tokens
&lt;/h1&gt;

&lt;p&gt;Most AI observability tooling measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;token usage&lt;/li&gt;
&lt;li&gt;cost&lt;/li&gt;
&lt;li&gt;prompt traces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We wanted to measure something deeper:&lt;/p&gt;

&lt;h2&gt;
  
  
  Structural execution instability
&lt;/h2&gt;

&lt;p&gt;The runtime tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;path variance&lt;/li&gt;
&lt;li&gt;rollback density&lt;/li&gt;
&lt;li&gt;transition sequence variance&lt;/li&gt;
&lt;li&gt;transition entropy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transition entropy is especially important.&lt;/p&gt;

&lt;p&gt;If execution entropy exceeds an empirical threshold (&lt;code&gt;2.5 bits&lt;/code&gt;), the runtime flags structural degradation.&lt;/p&gt;

&lt;p&gt;This is not “AI monitoring”.&lt;/p&gt;

&lt;p&gt;It is execution topology observability.&lt;/p&gt;




&lt;h1&gt;
  
  
  Failure Laboratory
&lt;/h1&gt;

&lt;p&gt;The KYC Governance Simulator intentionally includes adversarial injectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;tool_injection&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;policy_bypass&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;skip_step&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reorder_steps&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;corrupt_receipt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gdpr_erase&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to showcase a happy path.&lt;/p&gt;

&lt;p&gt;The point is to demonstrate deterministic failure semantics under attack conditions.&lt;/p&gt;

&lt;p&gt;Most AI demos try to hide instability.&lt;/p&gt;

&lt;p&gt;We intentionally surface it.&lt;/p&gt;




&lt;h1&gt;
  
  
  Trace ≠ Receipt
&lt;/h1&gt;

&lt;p&gt;Another core architectural principle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Execution → Trace → Analyzer → Receipt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Trace&lt;/code&gt; = source of truth&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Receipt&lt;/code&gt; = deterministic projection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Analyzer&lt;/code&gt; = post-hoc interpretation layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Receipts are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recomputable&lt;/li&gt;
&lt;li&gt;deterministic&lt;/li&gt;
&lt;li&gt;derived artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are not mutable runtime state.&lt;/p&gt;

&lt;p&gt;This is heavily inspired by event-sourcing philosophy.&lt;/p&gt;




&lt;h1&gt;
  
  
  Transactional AI Code Mutation
&lt;/h1&gt;

&lt;p&gt;We applied the same principles to repository mutation.&lt;/p&gt;

&lt;p&gt;The companion &lt;code&gt;nano-vm-dev-agent&lt;/code&gt; performs code changes transactionally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stage_patch()
→ validate_staged_mypy(tmpdir)
→ pytest
→ commit OR rollback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repository is never mutated before type validation succeeds.&lt;/p&gt;

&lt;p&gt;This creates CI-grade mutation safety for AI-assisted development.&lt;/p&gt;

&lt;p&gt;Most coding agents operate on best-effort mutation semantics.&lt;/p&gt;

&lt;p&gt;This runtime applies transactional guarantees instead.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Streamlit?
&lt;/h1&gt;

&lt;p&gt;We intentionally skipped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React&lt;/li&gt;
&lt;li&gt;Vite&lt;/li&gt;
&lt;li&gt;complex async frontend state systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The UI is built entirely in Python using Streamlit.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because the project optimizes for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;governance correctness&lt;/li&gt;
&lt;li&gt;deterministic behavior&lt;/li&gt;
&lt;li&gt;engineering simplicity&lt;/li&gt;
&lt;li&gt;type safety&lt;/li&gt;
&lt;li&gt;operational transparency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not frontend maximalism.&lt;/p&gt;




&lt;h1&gt;
  
  
  Current Status
&lt;/h1&gt;

&lt;p&gt;Current ecosystem status:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llm-nano-vm&lt;/code&gt; v0.8.4&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nano-vm-mcp&lt;/code&gt; v0.4.3&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kyc-demo-streamlit&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nano-vm-dev-agent&lt;/code&gt; v0.2.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Engineering discipline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mypy --strict&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pytest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;deterministic constraints&lt;/li&gt;
&lt;li&gt;no arbitrary runtime execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The KYC demo currently passes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;51/51&lt;/code&gt; tests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; mypy errors&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  The Bigger Shift
&lt;/h1&gt;

&lt;p&gt;The industry is saturated with autonomous agent hype.&lt;/p&gt;

&lt;p&gt;But critical infrastructure does not need autonomous orchestration.&lt;/p&gt;

&lt;p&gt;It needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bounded execution&lt;/li&gt;
&lt;li&gt;deterministic governance&lt;/li&gt;
&lt;li&gt;replayability&lt;/li&gt;
&lt;li&gt;auditability&lt;/li&gt;
&lt;li&gt;operational observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future may not belong to autonomous agents.&lt;/p&gt;

&lt;p&gt;It may belong to governed execution runtimes for probabilistic systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repositories
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kyc.nanovm.space/" rel="noopener noreferrer"&gt;kyc-demo-streamlit&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Ale007XD/nano_vm" rel="noopener noreferrer"&gt;llm-nano-vm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Ale007XD/nano-vm-mcp" rel="noopener noreferrer"&gt;nano-vm-mcp&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>llmops</category>
      <category>infrastructure</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Hermes Agent Needs a Flight Recorder - So I Built One</title>
      <dc:creator>Alex Delov</dc:creator>
      <pubDate>Fri, 29 May 2026 11:01:26 +0000</pubDate>
      <link>https://dev.to/ale007xd/hermes-agent-needs-a-flight-recorder-so-i-built-one-3gea</link>
      <guid>https://dev.to/ale007xd/hermes-agent-needs-a-flight-recorder-so-i-built-one-3gea</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Autonomous agents can now write code, call tools, browse the web, mutate files, and delegate to subagents. But when they fail, they fail invisibly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"An agent ran overnight, caught an unhandled exception loop, and burned $50 in tokens while corrupting our staging database."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you've spent more than a week building production systems with autonomous agents, you've lived some version of this nightmare.&lt;/p&gt;

&lt;p&gt;Most agent runtimes don't crash cleanly. They slide into retry storms, silently ignore failed tool calls, or recurse through delegation loops until budgets evaporate.&lt;/p&gt;

&lt;p&gt;Airplanes have flight recorders. Distributed systems have OpenTelemetry. &lt;strong&gt;Autonomous agents need TraceGuard.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TraceGuard&lt;/strong&gt; is a lightweight Python library and CLI that acts as an isolated, non-invasive execution flight recorder for autonomous agent runtimes.&lt;/p&gt;

&lt;p&gt;It consumes append-only JSONL execution traces and detects the three silent killers of agentic workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry Storms&lt;/li&gt;
&lt;li&gt;Silent Failures&lt;/li&gt;
&lt;li&gt;Recursive Delegation Cycles
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;traceguard traces/my_agent_run.jsonl &lt;span class="nt"&gt;--strict&lt;/span&gt;
&lt;span class="c"&gt;# exit 0 = clean · exit 1 = WARN · exit 2 = CRITICAL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of scraping human-readable terminal logs, TraceGuard turns runtime execution into a structured, replayable execution event contract.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Ale007XD/traceguard" rel="noopener noreferrer"&gt;https://github.com/Ale007XD/traceguard&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Modern agent frameworks can browse the web, write files, execute shell commands, and coordinate sub-agents. But when something goes wrong, you're usually left with a giant wall of terminal output and one impossible question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually happened?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not what the LLM said. Not the final output. The actual execution state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What tool calls executed?&lt;/li&gt;
&lt;li&gt;Which failures were silently ignored?&lt;/li&gt;
&lt;li&gt;Where did the retry loop begin?&lt;/li&gt;
&lt;li&gt;Which sub-agent delegated back into itself?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distributed systems engineers solved these problems decades ago using structured traces, append-only logs, and replayable execution histories. Agent runtimes are now complex enough to require the same discipline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;Autonomous agents are stochastic distributed runtimes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Distributed System Failure&lt;/th&gt;
&lt;th&gt;Agent Equivalent&lt;/th&gt;
&lt;th&gt;Observability Primitive&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retry storm&lt;/td&gt;
&lt;td&gt;Same tool called repeatedly without progress&lt;/td&gt;
&lt;td&gt;Sliding window counter over event stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silent failure&lt;/td&gt;
&lt;td&gt;Tool fails, agent continues anyway&lt;/td&gt;
&lt;td&gt;Error propagation trace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Circular dependency&lt;/td&gt;
&lt;td&gt;Agent A delegates to B which delegates back to A&lt;/td&gt;
&lt;td&gt;Delegation cycle detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State divergence&lt;/td&gt;
&lt;td&gt;Agent acts on corrupted or stale state&lt;/td&gt;
&lt;td&gt;Replayable transition history&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  δ(S, E) → S'
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent Runtime
      │
      ▼
Append-Only Event Stream
      │
      ▼
  TraceGuard
      │
  ┌───┴───────┬──────────────┐
  ▼           ▼              ▼
Retry      Silent       Recursive
Storms    Failures     Delegation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every execution step becomes a formal state transition. The runtime stops being an opaque, ephemeral process and becomes a replayable execution artifact.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Missing Primitive
&lt;/h2&gt;

&lt;p&gt;Hermes Agent currently exposes beautiful terminal output optimized for humans. Production observability requires something fundamentally different: machine-readable execution semantics.&lt;/p&gt;

&lt;p&gt;Example event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3f8a1c2d-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hermes-session-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-29T10:00:00.050Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool_args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"git status --porcelain"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each event is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Immutable&lt;/strong&gt; — append-only after creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-describing&lt;/strong&gt; — schema versioned and typed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replayable&lt;/strong&gt; — execution can be reconstructed offline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composable&lt;/strong&gt; — detectors operate over the same event stream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The missing primitive is not another dashboard. It is a structured execution event stream.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Detectors. One Governance Layer.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Retry Storm Detector
&lt;/h3&gt;

&lt;p&gt;Detects identical tool invocations repeating without successful progress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;code&gt;bash → fail&lt;/code&gt; → &lt;code&gt;bash → fail&lt;/code&gt; → &lt;code&gt;bash → fail&lt;/code&gt; (retry storm)&lt;/p&gt;

&lt;h3&gt;
  
  
  Silent Failure Detector
&lt;/h3&gt;

&lt;p&gt;Detects agents continuing execution after failed or empty tool outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;code&gt;read_file → empty&lt;/code&gt; → &lt;code&gt;continue execution&lt;/code&gt; (silent corruption)&lt;/p&gt;

&lt;h3&gt;
  
  
  Recursive Delegation Detector
&lt;/h3&gt;

&lt;p&gt;Detects sub-agent delegation cycles and self-recursion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;code&gt;planner → coder → coder → planner&lt;/code&gt; (recursive loop)&lt;/p&gt;

&lt;p&gt;Each detector operates independently over the same append-only event stream. Multiple detectors can fire simultaneously on the same execution trace.&lt;/p&gt;




&lt;h2&gt;
  
  
  Execution Governance
&lt;/h2&gt;

&lt;p&gt;TraceGuard is intentionally designed as an external execution observer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No monkey-patching&lt;/li&gt;
&lt;li&gt;No framework lock-in&lt;/li&gt;
&lt;li&gt;No invasive runtime hooks&lt;/li&gt;
&lt;li&gt;No dependency on Hermes internals
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM proposes
      │
      ▼
Runtime executes
      │
      ▼
TraceGuard observes
      │
      ▼
Governance layer enforces invariants
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the critical distinction. Prompt engineering cannot reliably solve retry storms, hidden execution corruption, or delegation cycles. Prompt-layer control is insufficient. &lt;strong&gt;Execution-layer governance is required.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TraceEvent (&lt;code&gt;schema.py&lt;/code&gt;)&lt;/strong&gt; — Immutable Pydantic v2 execution events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TraceRecorder (&lt;code&gt;recorder.py&lt;/code&gt;)&lt;/strong&gt; — Append-only JSONL persistence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detectors (&lt;code&gt;detectors.py&lt;/code&gt;)&lt;/strong&gt; — Streaming anomaly detectors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TraceGuard (&lt;code&gt;guard.py&lt;/code&gt;)&lt;/strong&gt; — Batch + real-time governance pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core invariant is simple: &lt;strong&gt;Record every transition. Analyze the record.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once execution becomes replayable, agent runtimes stop behaving like black boxes.&lt;/p&gt;




&lt;h2&gt;
  
  
  How This Connects to Hermes
&lt;/h2&gt;

&lt;p&gt;Hermes Agent currently produces terminal output optimized for human inspection. TraceGuard proposes a complementary execution event contract — a machine-readable stream of typed, versioned, append-only events emitted alongside the human-readable output.&lt;/p&gt;

&lt;p&gt;This aligns with the discussion in &lt;a href="https://github.com/NousResearch/hermes-agent/issues/169" rel="noopener noreferrer"&gt;issue #169&lt;/a&gt; on structured execution semantics.&lt;/p&gt;

&lt;p&gt;The integration path is additive: TraceGuard requires no changes to Hermes internals. Emit events to a JSONL file; TraceGuard reads them externally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;traceguard traces/retry_storm.jsonl
&lt;span class="o"&gt;[&lt;/span&gt;WARN] RetryStormDetector: tool &lt;span class="s1"&gt;'bash'&lt;/span&gt; called 4 &lt;span class="nb"&gt;times &lt;/span&gt;without success &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;WARN] SilentFailureDetector: step 2 failed, execution continued without error handling
&lt;span class="o"&gt;[&lt;/span&gt;WARN] SilentFailureDetector: step 4 failed, execution continued without error handling
&lt;span class="o"&gt;[&lt;/span&gt;WARN] SilentFailureDetector: step 6 failed, execution continued without error handling
&lt;span class="o"&gt;[&lt;/span&gt;WARN] SilentFailureDetector: step 7 failed, execution continued without error handling

&lt;span class="nv"&gt;$ &lt;/span&gt;traceguard traces/recursive_delegation.jsonl
&lt;span class="o"&gt;[&lt;/span&gt;CRITICAL] RecursiveDelegationDetector: delegation cycle detected — planner → coder → planner

&lt;span class="nv"&gt;$ &lt;/span&gt;traceguard traces/clean.jsonl
✓ No anomalies detected.

&lt;span class="nv"&gt;$ &lt;/span&gt;traceguard traces/retry_storm.jsonl &lt;span class="nt"&gt;--strict&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"exit: &lt;/span&gt;&lt;span class="nv"&gt;$?&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;exit&lt;/span&gt;: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;traceguard&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceGuard&lt;/span&gt;

&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceGuard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traces/my_agent_run.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;anomaly&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;detector&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_clean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓ No anomalies detected.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  My Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.10+&lt;/strong&gt; — minimum target, tested on 3.14&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic v2&lt;/strong&gt; — immutable &lt;code&gt;frozen=True&lt;/code&gt; event models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typer + Rich&lt;/strong&gt; — CLI with structured terminal output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSONL&lt;/strong&gt; — append-only trace persistence format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pytest&lt;/strong&gt; — 13/13 tests passing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hatchling&lt;/strong&gt; — packaging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No external runtime dependencies. No framework lock-in.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Used Hermes
&lt;/h2&gt;

&lt;p&gt;TraceGuard was developed and iterated with Hermes Agent as the primary development environment — reading files, applying patches, running tests, and diagnosing failures through FSM-structured execution loops.&lt;/p&gt;

&lt;p&gt;The irony is deliberate: a tool for governing agent execution traces was built by an agent whose execution was governed by the same FSM principles.&lt;/p&gt;

&lt;p&gt;Hermes drove: reading source files → generating S&amp;amp;R patches → applying changes → running pytest → diagnosing failures → iterating.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most failures in autonomous systems are not model failures. They are execution failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infinite retries&lt;/li&gt;
&lt;li&gt;Ignored exceptions&lt;/li&gt;
&lt;li&gt;Corrupted state propagation&lt;/li&gt;
&lt;li&gt;Delegation recursion&lt;/li&gt;
&lt;li&gt;Unbounded token burn&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is usually doing exactly what it was asked to do. The runtime simply lacks governance.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"LLMs propose. Runtimes govern."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replay Engine&lt;/strong&gt; — Re-execute traces against patched tool implementations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral Regression Testing&lt;/strong&gt; — Compare execution traces across models and versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Export&lt;/strong&gt; — Emit OTLP spans for Grafana, Datadog, and distributed tracing platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TraceGuard is to autonomous agents what OpenTelemetry became for distributed systems.&lt;/p&gt;




&lt;p&gt;Built for the &lt;strong&gt;Hermes Agent Challenge 2026&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/Ale007XD/traceguard" rel="noopener noreferrer"&gt;https://github.com/Ale007XD/traceguard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built on &lt;a href="https://github.com/Ale007XD/nano_vm" rel="noopener noreferrer"&gt;llm-nano-vm&lt;/a&gt; — deterministic FSM execution infrastructure.&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>devchallenge</category>
      <category>agents</category>
    </item>
    <item>
      <title>llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts</title>
      <dc:creator>Alex Delov</dc:creator>
      <pubDate>Sat, 23 May 2026 04:36:37 +0000</pubDate>
      <link>https://dev.to/ale007xd/llm-nano-vm-v080-deterministic-fsm-runtime-for-llm-pipelines-now-with-output-validation-and-57ch</link>
      <guid>https://dev.to/ale007xd/llm-nano-vm-v080-deterministic-fsm-runtime-for-llm-pipelines-now-with-output-validation-and-57ch</guid>
      <description>&lt;p&gt;PyPI: &lt;code&gt;pip install llm-nano-vm&lt;/code&gt;&lt;br&gt;&lt;br&gt;
GitHub: &lt;a href="http://github.com/Ale007XD/nano_vm" rel="noopener noreferrer"&gt;http://github.com/Ale007XD/nano_vm&lt;/a&gt;&lt;br&gt;&lt;br&gt;
MCP gateway: &lt;a href="http://github.com/Ale007XD/nano-vm-mcp" rel="noopener noreferrer"&gt;http://github.com/Ale007XD/nano-vm-mcp&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;I've been building a deterministic FSM execution kernel for LLM workflows. v0.8.0 just shipped to PyPI. Here's what it is, what's new, and where it's going.&lt;/p&gt;


&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;Most LLM frameworks treat the model as the orchestrator. nano-vm flips that: the runtime is the orchestrator, the model is just one step in a deterministic graph.&lt;/p&gt;

&lt;p&gt;δ(S, E) → S'&lt;br&gt;&lt;br&gt;
Current state + validated event = next state. The model cannot skip steps, reorder them, or escape guardrails. The FSM is the source of truth.&lt;/p&gt;

&lt;p&gt;Four step types: &lt;code&gt;llm&lt;/code&gt;, &lt;code&gt;tool&lt;/code&gt;, &lt;code&gt;condition&lt;/code&gt;, &lt;code&gt;parallel&lt;/code&gt;. Programs are plain Python dicts. No DSL parser, no heavy framework magic, and zero dependency overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;program&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Program&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Valid refund? Reply &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;yes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Request: $user_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed_outputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# ← v0.8.0
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;guardrail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;condition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;condition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;yes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; in &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$decision&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;then&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process_refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;otherwise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process_refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_terminal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;send_rejection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_terminal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The guardrail step cannot be bypassed regardless of what the model returns.&lt;/p&gt;

&lt;p&gt;What's new in v0.8.0&lt;/p&gt;

&lt;p&gt;allowed_outputs — LLM enum guard&lt;br&gt;&lt;br&gt;
Validates the model's raw output against an explicit list before the value touches anything downstream.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"classify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Classify. Reply ONLY with: refund / query / other"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed_outputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"refund"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"other"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"skip"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;falls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;back&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"refund"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(first&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;element)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;mismatch&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three policies on mismatch: fail (default, trace → FAILED), skip (substitute allowed_outputs), retry (retry up to max_retries, then FAILED).&lt;/p&gt;

&lt;p&gt;timeout_seconds + on_timeout — per-step LLM timeout&lt;br&gt;&lt;br&gt;
Prevents a hung API call from stalling the entire FSM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"analyze"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timeout_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fallback"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;falls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;back&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;allowed_outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;''&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two policies: fail (default) and fallback. Both features are independent and composable — you can use either or both on any llm step.&lt;/p&gt;

&lt;p&gt;What it can do right now&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Suspend / resume. Return "PENDING" from any tool → FSM → SUSPENDED, cursor persisted. Resume from any external event (webhook, approval, settlement). RUNNING → SUSPENDED → RUNNING → SUCCESS&lt;/li&gt;
&lt;li&gt;Condition branching with ASTEngine. eval() is gone. Conditions are parsed into a validated JSON AST and evaluated by a sandboxed interpreter. No Python builtins accessible. Method calls (.lower() etc.) raise ASTEvalError at parse time, not silently return False.&lt;/li&gt;
&lt;li&gt;GDPR tombstoning. Sensitive values stored as CapabilityRef tokens (vault://secret/). On erasure event: ref tombstoned, all projections return [REDACTED_TOMBSTONE], hash chain stays valid.&lt;/li&gt;
&lt;li&gt;GovernanceEnvelope. Every successful step produces an immutable, append-only audit record: execution_id, step_id, policy_hash, canonical_snapshot_hash, sanitized payload.&lt;/li&gt;
&lt;li&gt;MCP gateway (nano-vm-mcp). Exposes run_program, get_trace, list_programs etc. over stdio or SSE transport with bearer auth and SQLite WAL persistence. Works with Claude Desktop and any MCP client.&lt;/li&gt;
&lt;li&gt;Budget guardrails. max_steps, max_tokens, max_stalled_steps — FSM halts with BUDGET_EXCEEDED or STALLED before the next step, not after.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Benchmark — v0.8.0 (WSL2 · Python 3.12 · MockAdapter · 3×5×10k)&lt;br&gt;
10/10 PASS · 1,096,500 ops · 0 violations&lt;br&gt;
ScenarioMean TPSp95&lt;br&gt;
Refund pipeline&lt;br&gt;
2,200/s&lt;br&gt;
123 ms&lt;br&gt;
Double-execution guard&lt;br&gt;
2,800/s&lt;br&gt;
69 ms&lt;br&gt;
Budget enforcement&lt;br&gt;
2,400/s&lt;br&gt;
97 ms&lt;br&gt;
Parallel throughput&lt;br&gt;
1,000/s&lt;br&gt;
196 ms&lt;br&gt;
MCP store round-trip&lt;br&gt;
11,000/s&lt;br&gt;
0.13 ms&lt;br&gt;
GovernanceEnvelope&lt;br&gt;
2,100/s&lt;br&gt;
108 ms&lt;br&gt;
Crash consistency&lt;br&gt;
11/s&lt;br&gt;
115 ms&lt;br&gt;
Replay equivalence&lt;br&gt;
1,300/s&lt;br&gt;
164 ms&lt;br&gt;
Adversarial retries&lt;br&gt;
2,600/s&lt;br&gt;
87 ms&lt;br&gt;
Long-horizon (1k steps)&lt;br&gt;
95/s&lt;br&gt;
11,887 ms&lt;/p&gt;

&lt;p&gt;BM-INT-07 (Crash consistency): crash_rate=100% hash_match=100% — replay after simulated crash produces identical trace hash every time.&lt;br&gt;&lt;br&gt;
BM-INT-10 (Memory footprint): peak RSS 76.5 MB, alloc 3.62 MB for 1,000-step programs — no memory leaks detected.&lt;/p&gt;

&lt;p&gt;Validated on real payment APIs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two PoCs, both 9/9 tests passing with mock adapters:&lt;/li&gt;
&lt;li&gt;MoMo Payment API v4 — 3-way condition branch, HMAC-SHA256 IPN verification, polling loop with retry, next_step/is_terminal DSL.&lt;/li&gt;
&lt;li&gt;Stripe Payment API v1 — 3DS flow (REQUIRES_ACTION sentinel), refund pipeline with LLM classifier, webhook verification. Found and fixed two bugs in the process: "PENDING" sentinel collision (Stripe was returning it as a domain status, triggering FSM suspend), and silent ASTEvalError for .lower() in condition expressions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's coming next&lt;br&gt;
Phase 0 (Immediate): ProgramValidator — static analysis at Program build time. Catches missing then/otherwise/next_step targets, unreachable steps, and cycle detection. Currently these fail at runtime; when dealing with LLM-generated workflows, static analysis is a must.&lt;br&gt;&lt;br&gt;
Phase 1 (Gateway Correctness): StateContext persistence between MCP calls in SQLite WAL. Right now, if the gateway process restarts after /create but before polling completes, you get a new requestId — which is a real financial duplicate risk. Closing this with an execution_contexts table + upsert on every step. Up next: TRACE projection to SQLite, GovernedToolExecutor (policy-level tool capability enforcement), idempotency_store, and native vm.step() MCP wiring.&lt;br&gt;&lt;br&gt;
Phase 2 (Dev Agent): nano-vm-dev-agent — the FSM runtime managing its own development stack (read_repo_files → generate_patch(llm) → run_mypy → run_pytest → write_repo_files). DA-1 milestone is done (12/12 tests). DA-2 will be the first live run against a real sprint task (StateContext persistence). Still working on search_code and reproduce_bug tool-functions before launching live.&lt;br&gt;&lt;br&gt;
Phase 3 (Observability): OpenTelemetry span per FSM step + incremental counters in Trace (llm_calls, tool_calls, retries_total).&lt;/p&gt;

&lt;p&gt;Install&lt;br&gt;
pip install llm-nano-vm==0.8.0&lt;br&gt;&lt;br&gt;
pip install llm-nano-vm[litellm]==0.8.0   # LiteLLM provider support&lt;br&gt;&lt;br&gt;
pip install nano-vm-mcp                    # MCP gateway&lt;/p&gt;

&lt;p&gt;LLMs are completely optional. The runtime works perfectly fine as a pure, lightweight deterministic workflow engine.&lt;/p&gt;

&lt;p&gt;Questions / feedback welcome!&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>backend</category>
      <category>opensource</category>
      <category>fintech</category>
    </item>
    <item>
      <title>Models shouldn't have execution authority. Why we built a deterministic FSM runtime for AI agents.</title>
      <dc:creator>Alex Delov</dc:creator>
      <pubDate>Thu, 21 May 2026 04:49:39 +0000</pubDate>
      <link>https://dev.to/ale007xd/models-shouldnt-have-execution-authority-why-we-built-a-deterministic-fsm-runtime-for-ai-agents-1op5</link>
      <guid>https://dev.to/ale007xd/models-shouldnt-have-execution-authority-why-we-built-a-deterministic-fsm-runtime-for-ai-agents-1op5</guid>
      <description>&lt;p&gt;Modern agent frameworks implicitly treat a probabilistic model as an execution authority. That is acceptable for read-only tasks (e.g., summarizing logs or searching the web). But once an agent can mutate external state — payments, databases, infrastructure, PII — the architecture becomes fundamentally unsafe.&lt;/p&gt;

&lt;p&gt;When preparing our internal agents (PlanBot, SkillBot) for white-label distribution, we realized we needed to change the control plane. &lt;strong&gt;nano-vm&lt;/strong&gt; does not attempt to make the model trustworthy. Instead, it assumes model output is untrusted intent and constrains its blast radius through strict deterministic execution semantics.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Runtime Guarantees (Not just another wrapper)
&lt;/h3&gt;

&lt;p&gt;We built &lt;strong&gt;nano-vm&lt;/strong&gt; — a deterministic FSM runtime for stateful AI systems. The value isn't just in having an FSM; the value is that the execution graph is finite, verifiable, and known ahead of time.&lt;/p&gt;

&lt;p&gt;The runtime enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic transition graph:&lt;/strong&gt; Execution graph cannot self-modify at runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compile-time ordering:&lt;/strong&gt; Attempting a &lt;code&gt;reorder_steps&lt;/code&gt; attack is structurally impossible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability gating:&lt;/strong&gt; Strictly bounded side-effects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay resistance:&lt;/strong&gt; Idempotency boundaries built into the state transitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable auditability:&lt;/strong&gt; Cryptographic history of every action.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ASTEngine: Limitation as a Security Property
&lt;/h3&gt;

&lt;p&gt;In most agent runtimes, the execution loop is essentially: &lt;code&gt;prompt -&amp;gt; JSON -&amp;gt; dynamic dispatch -&amp;gt; side-effect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We completely removed &lt;code&gt;eval()&lt;/code&gt;. Conditions and side-effects are evaluated by a sandboxed &lt;code&gt;DeterministicSanitizer&lt;/code&gt; using an isolated &lt;code&gt;ASTEngine&lt;/code&gt;. It supports basic operators (&lt;code&gt;==&lt;/code&gt;, &lt;code&gt;contains&lt;/code&gt;, &lt;code&gt;$var.field&lt;/code&gt;) but completely lacks loops or system calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The policy layer is intentionally less expressive than Python.&lt;/strong&gt; That limitation is a security property, not a missing feature. Loop exhaustion and ReDoS attacks are structurally impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sabotage Mode: Demonstrating Failure Semantics
&lt;/h3&gt;

&lt;p&gt;To demonstrate the runtime under adversarial conditions, we built a 7-step fintech pipeline (PDF invoice -&amp;gt; Stripe test-mode adapter) with an integrated &lt;strong&gt;Sabotage Mode&lt;/strong&gt;. Instead of a happy-path demo, we built 5 injectors directly into the UI to demonstrate adversarial failure semantics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;tool_injection&lt;/code&gt; (Capability boundary violation)&lt;/strong&gt;&lt;br&gt;
Proposed tool invocations are treated as untrusted intent. If the LLM attempts to initiate an unauthorized &lt;code&gt;wire_transfer($50,000)&lt;/code&gt;, the &lt;code&gt;ExecutionVM&lt;/code&gt; resolves the request against a compile-time capability snapshot. The transition is rejected before any external side-effect layer becomes reachable. Zero side effects reach the network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbsbhyv16cp57d8bw34j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbsbhyv16cp57d8bw34j.png" alt="(The ExecutionVM blocking an unauthorized tool injection at the capability boundary)." width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;double_exec&lt;/code&gt; (Replay &amp;amp; Idempotency)&lt;/strong&gt;&lt;br&gt;
External side-effects are executed through idempotent adapters keyed by &lt;code&gt;execution_id&lt;/code&gt;, allowing deterministic replay of internal state recovery without duplicating external mutations. Once the FSM reaches a terminal state (&lt;code&gt;SUCCESS&lt;/code&gt; or &lt;code&gt;FAILED&lt;/code&gt;), it becomes an absorbing state (&lt;code&gt;δ(SUCCESS|FAILED, *) = NOP&lt;/code&gt;). Replays are silently dropped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. `corrupt_hash&lt;/strong&gt;&lt;code&gt;&lt;br&gt;
Tampering with the validation hash instantly throws the FSM into a &lt;/code&gt;FAILED` state, resulting in a zeroed envelope chain. The audit trail cannot be silently broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  GDPR Art.17 vs. Immutable Audit Trails
&lt;/h3&gt;

&lt;p&gt;Handling the "Right to Erasure" without breaking cryptographic audit chains is a major headache in fintech.&lt;/p&gt;

&lt;p&gt;We implemented a &lt;code&gt;GDPR-erase&lt;/code&gt; mechanism that targets specific &lt;code&gt;vault://secret/ref&lt;/code&gt; pointers and replaces the PII with a &lt;code&gt;[REDACTED_TOMBSTONE]&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The PII becomes completely inaccessible.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;hash_chain&lt;/code&gt; and &lt;code&gt;canonical_hash&lt;/code&gt; survive.&lt;/li&gt;
&lt;li&gt;Cryptographic continuity is maintained.&lt;/li&gt;
&lt;li&gt;Referential integrity is preserved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You delete the data, but you do not destroy the mathematical proof that the operation occurred safely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Execution Authority vs. Model Quality
&lt;/h3&gt;

&lt;p&gt;LLMs are excellent planners. They are terrible sources of execution truth.&lt;/p&gt;

&lt;p&gt;The core design question for stateful AI systems may not be model quality.&lt;br&gt;
It may be execution authority.&lt;/p&gt;

&lt;p&gt;Should a probabilistic model be allowed to mutate state directly?&lt;br&gt;
Or should execution pass through a deterministic control layer first?&lt;/p&gt;

&lt;p&gt;If you want to try breaking the FSM yourself, the Sabotage Mode is live, and the core is open-source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core runtime:&lt;/strong&gt; &lt;a href="https://github.com/Ale007XD/nano_vm" rel="noopener noreferrer"&gt;github.com/Ale007XD/nano_vm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP gateway layer:&lt;/strong&gt; &lt;a href="https://github.com/Ale007XD/nano-vm-mcp" rel="noopener noreferrer"&gt;github.com/Ale007XD/nano-vm-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Sabotage Demo:&lt;/strong&gt; &lt;a href="http://demo.bannerbot.ru:8843" rel="noopener noreferrer"&gt;demo.bannerbot.ru:8843&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Curious how others here are approaching capability boundaries, replay resistance, and auditability in agent runtimes.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>security</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
