wei wu

Posted on May 4

Why Your Control Plane Is a Convergence Engine, Not a Policy Engine

#controlplane #agentruntime #governance #devops

2026-05-04 | OpenClaw Runtime Control Plane V37.9.24 | Stage 2 Position Article #5

TL;DR

I spent 11 days building one thing into a production Agent Runtime that most control plane frameworks don't do: automatic synchronization from declared state to runtime state.

              Declared State                      Runtime State
       (jobs_registry.yaml)              (macOS crontab -l)
              |                                   |
              |     --[ verify_convergence ]--    |
              |                                   |
              +--[ machine_sync_via_helper ]------+
                  (V37.9.24 Plan B dry-run)

11 days ago, this sync chain depended on "Claude Code remembering to run crontab_safe.sh add after each commit." Today, the framework automatically detects drift on every governance audit cron, automatically generates 36 cron lines, and automatically syncs them into crontab via crontab_safe.sh add.

Memory is the weakest reliability primitive. This article explains why a "declare → decide" policy engine isn't enough, why a control plane must be upgraded to a convergence engine, and OpenClaw's engineering proof from walking this path across six versions (V37.9.19 → V37.9.24).

If you're building an Agent Runtime, internal platform, or tool governance system, this should save you several months of iteration.

First Illusion: Control Plane = Policy Engine

The mainstream "control plane" narrative is roughly:

Declare your policy → System evaluates at request time → Allow or deny.

OPA (Open Policy Agent) / Cedar / Casbin / Kyverno all follow this paradigm. So do Kubernetes admission controllers. They solve:

input (request) --[policy]--> decision (allow / deny / mutate)

Elegant. But they don't solve one thing: what happens when your declared state diverges from the system's actual runtime state?

Example: you declare 36 cron jobs, each with entry / interval / log. But the macOS crontab might be missing one, have an extra, or have drifted to the wrong interval. OPA helps you "judge whether the current state is compliant," but after the judgment, who does the syncing? The answer is always: someone remembers to run a command.

        OPA Style                    OpenClaw Pre-V37.9.18
     --------------             --------------------------
     Declare -> Eval -> Decide  Declare -> Eval -> Alert -> Wait
                                                          ^
                                       Memory = Weakest Reliability Primitive

Second Illusion: More Audit Rules Make Systems Stable

In an earlier position article ("Audit Is a Regression Engine, Not a Prevention Engine"), I quantified this: across 45 days with 53 governance invariants + 15 meta-rules, audit's prevention rate for unknown dimensions was 0%.

The numbers are brutal, but the meaning is clear: audit can't prevent failures that haven't happened yet — it can only ensure failures that have already happened don't recur.

V37.9.18 demonstrated this principle the hard way:

The kb_deep_dive job launched in V37.9.16 with enabled=true declared in jobs_registry, but nobody manually ran crontab_safe.sh add. Two expected 22:30 triggers fired into silence; 48 hours later the user noticed.

After root-causing this, I established MR-17:

declared-state-must-converge-to-runtime-via-machine-not-memory

Every declared resource (yaml/registry/config) must have a corresponding runtime fact (cron/process/http/filesystem). Drift detection must be upgraded from "humans remembering to run commands after commits" to "machines periodically detecting + syncing automatically."

This rule rewrote the boundary of what a control plane is: a control plane is no longer just a policy engine. It must include a convergence engine — the actual sync mechanism for declared → runtime, not just an evaluation mechanism.

Three Engineering Proofs: Convergence Framework's 11-Day Evolution

V37.9.19 → V37.9.24 spans six versions, each doing one thing:

V37.9.19 — Framework Bootstrap + First Spec

ontology/convergence.py introduces the ConvergenceResult namedtuple + verify_convergence(spec_id) top-level API + named-dispatch tables (extractors / observers / parsers). Decoupled from ONTOLOGY_MODE: convergence is governance-layer observability, not request-path enforcement.

The first spec: jobs_to_crontab (drift_action: alert_only — cautious start due to high blast radius).

- id: jobs_to_crontab
  declaration:
    source: jobs_registry.yaml
    extractor: registry_enabled_system_jobs
  runtime_observable:
    method: shell_command
    command: "crontab -l"
    parser: line_contains_identifier
  drift_action: alert_only   # V37.9.19 — alert-only during one-week observation

V37.9.20 — Extensibility Proof (named-dispatch first proof)

Added a providers_to_adapter spec — providers.py ProviderRegistry.list_names() vs the adapter :5001/health fallback_chain. Core framework changes = 0 lines. All extensions went through new entries in the named-dispatch tables:

_DECLARED_EXTRACTORS["providers_from_registry"] = _extract_providers_from_registry
_RUNTIME_OBSERVERS["http_endpoint"] = _observe_http_endpoint
_IDENTIFIER_PARSERS["json_set_union"] = _parse_json_set_union

This proves the framework's "adding new spec types requires zero framework changes" promise wasn't a hollow claim.

V37.9.22 — Cross-Granularity Extensions + Integration into Main Audit

Third spec: openclaw_config_to_runtime (mid-extension path: extracted _walk_json_paths_to_set shared helper). Fourth spec: kb_sources_to_index (minimal extension: only one new extractor, reusing V37.9.19's observer + parser).

The final step: integrate the framework into the main governance audit flow:

# governance_checker.py main flow
results = run_invariants()
discovery = run_meta_discovery()
convergence = run_convergence_specs()   # <-- Added in V37.9.22

The framework was upgraded from "indirectly invoked by INV runtime checks" to "actively consumed on every audit cron."

V37.9.23 — Plan B Gradual Dry-Run + Real Sync Path

May 3rd decision window arrived (V37.9.19 baseline + 7d observation). One week of production data: declared=36 observed=36, zero drift, zero false positives. Upgraded jobs_to_crontab from drift_action: alert_only to machine_sync.

Introduced _format_cron_line(job) (a pure function emitting cron lines matching the V37.9.18 INV-CRON-003 pattern + rejecting shell metacharacters as defense-in-depth) + _apply_machine_sync(spec, missing, dry_run) orchestrator (calls crontab_safe.sh add for real sync) + _is_dry_run() env reader.

drift_action: machine_sync             # V37.9.23 escalation
convergence_method:
  implemented: machine_sync_via_helper  # Replaces V37.9.19's `planned`
  helper: "bash $HOME/crontab_safe.sh add '<line>'"
  dry_run_env_var: CONVERGENCE_DRY_RUN
  dry_run_default: true                 # Safety net: V37.9.24+ flips it off

The key to Plan B (gradual dry-run): drift_action upgrade + default dry-run env control. Operators see the literal apply[dry-run]=36 in governance audit output to verify cron line construction is correct, then in V37.9.24+ flip the env to actually activate it. This mirrors the "shadow → on" pattern from V37.9.13's P2 context evaluator, applied at the convergence layer.

V37.9.24 — Named-Dispatch for Apply Functions + Second machine_sync Spec

We observed that kb_sources_to_index had a fundamentally different apply pattern from jobs_to_crontab:

Dimension	jobs_to_crontab	kb_sources_to_index
Helper	crontab_safe.sh	kb_embed.py
Pattern	per-entry call	one-shot incremental
Startup overhead	<100ms	~3s (load embedding model)
Input	single cron line	entire KB (mtime diff)

If we made V37.9.23's _apply_machine_sync support both patterns simultaneously = if-else dispatch + spec_id-hardcoding. That violates V37.9.20's named-dispatch design principle.

V37.9.24 refactored _apply_machine_sync into a top-level dispatcher that routes by the spec yaml's convergence_method.apply_function field:

_APPLY_FUNCTIONS = {
    "jobs_to_crontab_per_entry": _apply_jobs_to_crontab_per_entry,
    "kb_embed_incremental": _apply_kb_embed_incremental,
}

def _apply_machine_sync(spec, missing_entries, dry_run=None):
    method = spec.get("convergence_method") or {}
    fn_name = method.get("apply_function") or ""
    fn = _APPLY_FUNCTIONS.get(fn_name)
    return fn(spec, missing_entries, dry_run)

Adding kb_sources_to_index machine_sync requires only:

Implement _apply_kb_embed_incremental (one-shot single subprocess call)
Register in the _APPLY_FUNCTIONS dict
Add apply_function: kb_embed_incremental in spec yaml

The _apply_machine_sync top-level dispatcher: zero changes.

Production Evidence: governance audit Output

Running python3 ontology/governance_checker.py on the production Mac Mini, the convergence section shows:

----------------------------------------------------------------------
  CONVERGENCE FRAMEWORK (Phase 4 Layer 5) -- 4 spec(s)
----------------------------------------------------------------------
  [PASS] jobs_to_crontab            -- declared=36 observed=36 (no drift)
  [WARN] providers_to_adapter       -- declared=7  observed=2  missing=5
                                       (drift_action=alert_only)
  [WARN] openclaw_config_to_runtime -- declared=1  observed=1  (no drift)
  [WARN] kb_sources_to_index        -- declared=14 observed=11 missing=3
                                       (drift_action=machine_sync)
                                       apply[dry-run]=1 apply_errors=0

Four specs, three drift_action variants:

jobs_to_crontab (machine_sync, real sync) — zero drift, no apply needed
kb_sources_to_index (machine_sync, real sync) — 3 missing, 1 line of dry-run one-shot summary
providers_to_adapter (alert_only_permanent) — 5 providers missing API keys; the framework can't magically provision keys, this is an operator decision
openclaw_config_to_runtime (alert_only_permanent) — Gateway runtime state changes are intentional operator actions

The framework knows each spec's apply path is different → routes via named-dispatch → emits observable logs.

Third Insight: drift_action Is 4-Tier, Not 1-Tier

Mainstream policy engines have only "allow/deny" or "warn"-tier behaviors. OpenClaw's convergence framework explicitly splits drift_action into 4 tiers:

drift_action	Meaning	Typical spec
`alert_only`	Emits alert only; operator decides how to fix	(cautious bootstrap mode)
`alert_only_permanent`	Structural decision — framework can never magically fix	API keys / Gateway state
`machine_sync`	Framework auto-syncs declared → runtime	jobs_to_crontab / kb_sources_to_index
`block_until_human`	Drift blocks subsequent audits until human confirmation	Security-sensitive specs

Each tier corresponds to a different engineering commitment. Seeing a spec marked alert_only_permanent, an operator knows: "I shouldn't wait for the framework to fix this — it's a permanent dashboard signal I monitor." Seeing machine_sync + dry_run_default: true, an operator knows: "I should flip dry-run off in a week, otherwise the framework won't actually do anything."

The existence of drift_action turns declared → runtime sync from a binary decision into a gradient.

How This Differs from OPA / Kyverno

Dimension	OPA / Kyverno	OpenClaw Convergence Framework
Subject	"Is the request compliant?"	"Does declared state actually exist at runtime?"
Input	request body	declared spec + runtime observation
Output	allow/deny/mutate	4-tier drift_action signal + auto-sync
Deployment	sidecar / admission webhook	governance audit cron + helper subprocess
Risk	rejecting wrong requests	wrong syncs can corrupt runtime state
Safety net	rule simulation / shadow mode	drift_action 4 tiers + dry-run env (Plan B gradient)

OPA is a gatekeeper on the request path. Convergence Framework is a sync engine for declared state. They aren't substitutes — they're two complementary pillars of a control plane. A complete control plane should have both.

V3 Roadmap: pip install ontology-engine

V37.9.19 → V37.9.24 worked internally for OpenClaw. The next step is upgrading this from "governance code for this project" to "a generic framework anyone can adopt":

# pip install ontology-engine
from ontology_engine.convergence import verify_convergence, ConvergenceResult
from ontology_engine.governance import run_invariants

# Users write their own yaml
result = verify_convergence("my_custom_spec",
                            path="my_project/convergence_ontology.yaml")

This is the core deliverable for V3 roadmap "let others extend it." OpenClaw's 11-day evolution is the engineering evidence: framework extensibility has been validated by 4 specs + 2 apply patterns + multiple extension granularities (full triplet / mid-extension shared helper / minimal 1 piece / named-dispatch refactor).

Five Actionable Principles

If you're building a similar control plane:

"Declare → Decide" isn't enough — you must have a declare → runtime fact sync mechanism.
drift_action needs at least 4 tiers — alert_only / alert_only_permanent / machine_sync / block_until_human. Each tier corresponds to a different engineering commitment.
machine_sync requires a dry-run safety net — env-var controlled, default safe. The Plan B gradient lets operators verify cron line construction before activating it for real.
named-dispatch is more extensible than if-else — new spec types / new apply patterns only need new dict entries, no framework changes.
The framework must integrate into the main audit flow — being called only in tests ≠ production consumption. Every audit cron must actively run verify_convergence.

One-Sentence Summary

Your control plane isn't just a policy engine — it's a convergence engine. The gap between declared state and runtime state should be closed by machines, not by human memory.

V37.9.18 lesson: memory is the weakest reliability primitive.
V37.9.24 reply: replace memory with a framework.

References

ontology/convergence.py — Convergence Framework V37.9.19 ~ V37.9.24
ontology/convergence_ontology.yaml — 4 spec declarations
ontology/governance_ontology.yaml — INV-CONVERGENCE-* 5 invariants + MR-17
ontology/docs/cases/kb_deep_dive_cron_unregistered_case.md — V37.9.18 incident
Related: "Audit Is a Regression Engine, Not a Prevention Engine" — companion position article
Related: "Why Agent Systems Need a Control Plane" — project-level control plane narrative

DEV Community