2026-05-04 | OpenClaw Runtime Control Plane V37.9.24 | Stage 2 Position Article #5
TL;DR
I spent 11 days building one thing into a production Agent Runtime that most control plane frameworks don't do: automatic synchronization from declared state to runtime state.
Declared State Runtime State
(jobs_registry.yaml) (macOS crontab -l)
│ │
│ ──[ verify_convergence ]── │
│ │
└──[ machine_sync_via_helper ]─────┘
(V37.9.24 Plan B dry-run)
11 days ago, this sync chain depended on "Claude Code remembering to run crontab_safe.sh add after each commit." Today, the framework automatically detects drift on every governance audit cron, automatically generates 36 cron lines, and automatically syncs them into crontab via crontab_safe.sh add.
Memory is the weakest reliability primitive. This article explains why a "declare → decide" policy engine isn't enough, why a control plane must be upgraded to a convergence engine, and OpenClaw's engineering proof from walking this path across six versions (V37.9.19 → V37.9.24).
If you're building an Agent Runtime, internal platform, or tool governance system, this should save you several months of iteration.
First Illusion: Control Plane = Policy Engine
The mainstream "control plane" narrative is roughly:
Declare your policy → System evaluates at request time → Allow or deny.
OPA (Open Policy Agent) / Cedar / Casbin / Kyverno all follow this paradigm. So do Kubernetes admission controllers. They solve:
input (request) ──[policy]──→ decision (allow / deny / mutate)
Elegant. But they don't solve one thing: what happens when your declared state diverges from the system's actual runtime state?
Example: you declare 36 cron jobs, each with entry / interval / log. But the macOS crontab might be missing one, have an extra, or have drifted to the wrong interval. OPA helps you "judge whether the current state is compliant," but after the judgment, who does the syncing? The answer is always: someone remembers to run a command.
OPA Style OpenClaw Pre-V37.9.18
────────────── ──────────────────────────
Declare → Eval → Decide Declare → Eval → Alert → Wait
↑
Memory = Weakest Reliability Primitive
Second Illusion: More Audit Rules Make Systems Stable
In an earlier article, I quantified this: across 45 days with 53 governance invariants + 15 meta-rules, audit's prevention rate for unknown dimensions was 0%.
The numbers are brutal, but the meaning is clear: audit can't prevent failures that haven't happened yet—it can only ensure failures that have already happened don't recur.
V37.9.18 demonstrated this principle the hard way:
The kb_deep_dive job launched in V37.9.16 with enabled=true declared in jobs_registry, but nobody manually ran crontab_safe.sh add. Two expected 22:30 triggers fired into silence; 48 hours later the user noticed.
After root-causing this, I established MR-17:
declared-state-must-converge-to-runtime-via-machine-not-memory
Every declared resource (yaml/registry/config) must have a corresponding runtime fact (cron/process/http/filesystem). Drift detection must be upgraded from "humans remembering to run commands after commits" to "machines periodically detecting + syncing automatically."
This rule rewrote the boundary of what a control plane is: a control plane is no longer just a policy engine. It must include a convergence engine — the actual sync mechanism for declared → runtime, not just an evaluation mechanism.
Three Engineering Proofs: Convergence Framework's 11-Day Evolution
V37.9.19 → V37.9.24 spans six versions, each doing one thing:
V37.9.19 — Framework Bootstrap + First Spec
ontology/convergence.py introduces the ConvergenceResult namedtuple + verify_convergence(spec_id) top-level API + named-dispatch tables (extractors / observers / parsers). Decoupled from ONTOLOGY_MODE: convergence is governance-layer observability, not request-path enforcement.
The first spec: jobs_to_crontab (drift_action: alert_only — cautious start due to high blast radius).
- id: jobs_to_crontab declaration: source: jobs_registry.yaml extractor: registry_enabled_system_jobs runtime_observable: method: shell_command command: "crontab -l" parser: line_contains_identifier drift_action: alert_only # V37.9.19 — alert-only during one-week observation V37.9.20 — Extensibility Proof (named-dispatch first proof)
Added a providers_to_adapter spec — providers.py ProviderRegistry.list_names() vs the adapter :5001/health fallback_chain. Core framework changes = 0 lines. All extensions went through new entries in the named-dispatch tables:
_DECLARED_EXTRACTORS["providers_from_registry"] = _extract_providers_from_registry
_RUNTIME_OBSERVERS["http_endpoint"] = _observe_http_endpoint
_IDENTIFIER_PARSERS["json_set_union"] = _parse_json_set_union
This proves the framework's "adding new spec types requires zero framework changes" promise wasn't a hollow claim.
V37.9.22 — Cross-Granularity Extensions + Integration into Main Audit
Third spec: openclaw_config_to_runtime (mid-extension path: extracted _walk_json_paths_to_set shared helper). Fourth spec: kb_sources_to_index (minimal extension: only one new extractor, reusing V37.9.19's observer + parser).
The final step: integrate the framework into the main governance audit flow:
governance_checker.py main flow
results = run_invariants()
discovery = run_meta_discovery()
convergence = run_convergence_specs() # ← Added in V37.9.22
The framework was upgraded from "indirectly invoked by INV runtime checks" to "actively consumed on every audit cron."
V37.9.23 — Plan B Gradual Dry-Run + Real Sync Path
May 3rd decision window arrived (V37.9.19 baseline + 7d observation). One week of production data: declared=36 observed=36, zero drift, zero false positives. Upgraded jobs_to_crontab from drift_action: alert_only to machine_sync.
Introduced _format_cron_line(job) (a pure function emitting cron lines matching the V37.9.18 INV-CRON-003 pattern + rejecting shell metacharacters as defense-in-depth) + _apply_machine_sync(spec, missing, dry_run) orchestrator (calls crontab_safe.sh add for real sync) + _is_dry_run() env reader.
drift_action: machine_sync # V37.9.23 escalation
convergence_method:
implemented: machine_sync_via_helper # Replaces V37.9.19's planned
helper: "bash $HOME/crontab_safe.sh add '
dry_run_env_var: CONVERGENCE_DRY_RUN
dry_run_default: true # Safety net: V37.9.24+ flips it off
The key to Plan B (gradual dry-run): drift_action upgrade + default dry-run env control. Operators see the literal apply[dry-run]=36 in governance audit output to verify cron line construction is correct, then in V37.9.24+ flip the env to actually activate it. This mirrors the "shadow → on" pattern from V37.9.13's P2 context evaluator, applied at the convergence layer.
V37.9.24 — Named-Dispatch for Apply Functions + Second machine_sync Spec
We observed that kb_sources_to_index had a fundamentally different apply pattern from jobs_to_crontab:
Dimension jobs_to_crontab kb_sources_to_index
Helper crontab_safe.sh kb_embed.py
Pattern per-entry call one-shot incremental
Startup overhead <100ms ~3s (load embedding model)
Input single cron line entire KB (mtime diff)
If we made V37.9.23's _apply_machine_sync support both patterns simultaneously = if-else dispatch + spec_id-hardcoding. That violates V37.9.20's named-dispatch design principle.
V37.9.24 refactored _apply_machine_sync into a top-level dispatcher that routes by the spec yaml's convergence_method.apply_function field:
_APPLY_FUNCTIONS = {
"jobs_to_crontab_per_entry": _apply_jobs_to_crontab_per_entry,
"kb_embed_incremental": _apply_kb_embed_incremental,
}
def _apply_machine_sync(spec, missing_entries, dry_run=None):
method = spec.get("convergence_method") or {}
fn_name = method.get("apply_function") or ""
fn = _APPLY_FUNCTIONS.get(fn_name)
return fn(spec, missing_entries, dry_run)
Adding kb_sources_to_index machine_sync requires only:
Implement _apply_kb_embed_incremental (one-shot single subprocess call)
Register in the _APPLY_FUNCTIONS dict
Add apply_function: kb_embed_incremental in spec yaml
The _apply_machine_sync top-level dispatcher: zero changes.
Production Evidence: governance audit Output
Running python3 ontology/governance_checker.py on the production Mac Mini, the convergence section shows:
──────────────────────────────────────────────────────────────────────
CONVERGENCE FRAMEWORK (Phase 4 Layer 5) — 4 spec(s)
──────────────────────────────────────────────────────────────────────
✅ [jobs_to_crontab] — declared=36 observed=36 (no drift)
⚠️ [providers_to_adapter] — declared=7 observed=2 missing=5 (drift_action=alert_only)
⚠️ [openclaw_config_to_runtime] — declared=1 observed=1 (no drift)
⚠️ [kb_sources_to_index] — declared=14 observed=11 missing=3 (drift_action=machine_sync) apply[dry-run]=1 apply_errors=0
Four specs, three drift_action variants:
jobs_to_crontab (machine_sync, real sync) — zero drift, no apply needed
kb_sources_to_index (machine_sync, real sync) — 3 missing, 1 line of dry-run one-shot summary
providers_to_adapter (alert_only_permanent) — 5 providers missing API keys; the framework can't magically provision keys, this is an operator decision
openclaw_config_to_runtime (alert_only_permanent) — Gateway runtime state changes are intentional operator actions
The framework knows each spec's apply path is different → routes via named-dispatch → emits observable logs.
Third Insight: drift_action Is 4-Tier, Not 1-Tier
Mainstream policy engines have only "allow/deny" or "warn"-tier behaviors. OpenClaw's convergence framework explicitly splits drift_action into 4 tiers:
drift_action Meaning Typical spec
alert_only Emits alert only; operator decides how to fix (cautious bootstrap mode)
alert_only_permanent Structural decision — framework can never magically fix API keys / Gateway state
machine_sync Framework auto-syncs declared → runtime jobs_to_crontab / kb_sources_to_index
block_until_human Drift blocks subsequent audits until human confirmation Security-sensitive specs
Each tier corresponds to a different engineering commitment. Seeing a spec marked alert_only_permanent, an operator knows: "I shouldn't wait for the framework to fix this — it's a permanent dashboard signal I monitor." Seeing machine_sync + dry_run_default: true, an operator knows: "I should flip dry-run off in a week, otherwise the framework won't actually do anything."
The existence of drift_action turns declared → runtime sync from a binary decision into a gradient.
How This Differs from OPA / Kyverno
Dimension OPA / Kyverno OpenClaw Convergence Framework
Subject "Is the request compliant?" "Does declared state actually exist at runtime?"
Input request body declared spec + runtime observation
Output allow/deny/mutate 4-tier drift_action signal + auto-sync
Deployment sidecar / admission webhook governance audit cron + helper subprocess
Risk rejecting wrong requests wrong syncs can corrupt runtime state
Safety net rule simulation / shadow mode drift_action 4 tiers + dry-run env (Plan B gradient)
OPA is a gatekeeper on the request path. Convergence Framework is a sync engine for declared state. They aren't substitutes — they're two complementary pillars of a control plane. A complete control plane should have both.
V3 Roadmap: pip install ontology-engine
V37.9.19 → V37.9.24 worked internally for OpenClaw. The next step is upgrading this from "governance code for this project" to "a generic framework anyone can adopt":
pip install ontology-engine
from ontology_engine.convergence import verify_convergence, ConvergenceResult
from ontology_engine.governance import run_invariants
Users write their own yaml
result = verify_convergence("my_custom_spec",
path="my_project/convergence_ontology.yaml")
This is the core deliverable for V3 roadmap "let others extend it." OpenClaw's 11-day evolution is the engineering evidence: framework extensibility has been validated by 4 specs + 2 apply patterns + multiple extension granularities (full triplet / mid-extension shared helper / minimal 1 piece / named-dispatch refactor).
Five Actionable Principles
If you're building a similar control plane:
"Declare → Decide" isn't enough — you must have a declare → runtime fact sync mechanism.
drift_action needs at least 4 tiers — alert_only / alert_only_permanent / machine_sync / block_until_human. Each tier corresponds to a different engineering commitment.
machine_sync requires a dry-run safety net — env-var controlled, default safe. The Plan B gradient lets operators verify cron line construction before activating it for real.
named-dispatch is more extensible than if-else — new spec types / new apply patterns only need new dict entries, no framework changes.
The framework must integrate into the main audit flow — being called only in tests ≠ production consumption. Every audit cron must actively run verify_convergence.
One-Sentence Summary
Your control plane isn't just a policy engine — it's a convergence engine. The gap between declared state and runtime state should be closed by machines, not by human memory.
V37.9.18 lesson: memory is the weakest reliability primitive. V37.9.24 reply: replace memory with a framework.
References
ontology/convergence.py — Convergence Framework V37.9.19 ~ V37.9.24
ontology/convergence_ontology.yaml — 4 spec declarations
ontology/governance_ontology.yaml — INV-CONVERGENCE-* 5 invariants + MR-17
ontology/docs/cases/kb_deep_dive_cron_unregistered_case.md — V37.9.18 incident
audit_is_regression_not_prevention.md — companion position article
why_control_plane.md — project-level control plane narrative
Top comments (0)