Most multi-agent systems are nondeterministic by default. Agents negotiate their own workflows, spawn each other ad hoc, and pass free-text reasoning chains around. After running a fleet of AI agents in production — and watching the same PR diff produce three different fixes in three runs — I started designing the orchestration layer I wish I'd had from day one. This article proposes an architecture designed to make every workflow run replayable, every routing decision auditable, and every agent loop explicitly bounded. It's a design I'm actively evolving — not a finished product.
The Problem: LLM-Driven Control Flow
The default story is more nuanced than "everything is chaos." LangGraph defines static graphs in code — routing is explicit Python functions, with a configurable recursion limit (25 in older versions, 10,000 in LangGraph 1.x). CrewAI runs tasks sequentially with a 25-iteration cap per agent. AutoGen defaults to round-robin, though with no loop bound by default (the real footgun).
But look at what happens in practice: tutorials showcase SelectorGroupChat (AutoGen), Process.hierarchical (CrewAI), and ReAct tool loops (LangGraph) — patterns where the LLM decides what happens next. The defaults may be safe, but the encouraged usage patterns are not. And even with bounded loops, within each agent turn the LLM still autonomously decides which tools to call, when to stop, and what to pass along. The result:
- Non-reproducible — same inputs, different execution paths. The orchestration structure might be fixed, but the LLM-driven inner loops make each run unique. Hard to debug, impossible to regression-test.
- Opaque routing — even when routing is code-defined, the LLM's tool-calling decisions inside each node create stochastic side effects that propagate through the graph.
- Unbounded by default — AutoGen has no loop cap unless you set one. CrewAI caps at 25 iterations per agent, and LangGraph's recursion limit (now 10,000 in 1.x) is generous enough to produce surprise bills.
- No inter-agent validation by default — agents pass messages to each other without schema enforcement. One agent's hallucination becomes another's input.
The fix isn't removing agents. It's removing nondeterminism from the orchestration layer while keeping it where it belongs — inside each agent's reasoning.
Core Thesis
The core design principle: the orchestration graph is code, the agents are LLMs. Keep them separate.
In this design, the orchestrator is a state machine with explicit transitions, bounded loops, and typed message contracts. Routing is intended to be purely deterministic code — no LLM deciding which agent runs next. Quality gates can optionally use LLM judges (e.g., "is this code review good enough?"), but they're agents like any other — isolated containers with typed inputs and outputs. The orchestrator only sees their boolean verdict, never their reasoning. Agents don't know the graph exists.
Important distinction: I'm not claiming LLM outputs are deterministic — they're stochastic by nature. What's deterministic is the control flow: given the same agent outputs, the orchestrator would make the same routing decisions every time. The goal is that you can replay any workflow run from the Kafka log and verify the exact same path was taken.
Design Goals
- Replayable — every workflow run can be replayed from recorded messages
- Auditable — every routing decision is a pure predicate you can inspect
- Bounded — loops have convergence detection, quality thresholds, and hard ceilings
- Testable — routing logic is unit-testable, schemas are contract-testable, full runs are replay-testable
- Provider-agnostic — swap LLM providers per agent without touching orchestration
- Zero-trust — agents have no credentials, no network, no knowledge of each other
Architecture Overview
┌──────────────────────────────────────────────────┐
│ Git Repository │
│ workflows/*.yaml agents/*.yaml schemas/ │
└──────────────────────┬───────────────────────────┘
│ deploy
┌──────────────────────▼───────────────────────────┐
│ Kafka-Based Orchestrator │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Graph │ │ State │ │ Budget / │ │
│ │ Engine │ │ Store │ │ Loop Guard │ │
│ └─────────┘ └──────────┘ └──────────────┘ │
│ │
│ Reads from / writes to Kafka topics │
└──────────────────────┬───────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent A │ │ Agent B │ │ Agent C │
│┌───────┐ │ │┌───────┐ │ │┌───────┐ │
││Sidecar│ │ ││Sidecar│ │ ││Sidecar│ │
│└───────┘ │ │└───────┘ │ │└───────┘ │
│ Docker │ │ Docker │ │ Docker │
│net: none │ │net: none │ │net: none │
└──────────┘ └──────────┘ └──────────┘
Layer 1: Git-Stored Workflow Definitions
Every workflow is a YAML file in a Git repository. No UI, no database — Git is the source of truth. Versioning, diffs, PRs for workflow changes, and audit trail come from Git itself.
Inspired by GitAgent's SkillsFlow format, but with explicit loop semantics:
# workflows/code-review.yaml
name: code-review
version: "1.0"
description: Automated code review with iterative feedback
inputs:
pr_diff:
type: string
required: true
repo_url:
type: string
required: true
agents:
- id: analyzer
image: agents/code-analyzer:latest
runtime: openai-api-compatible # provider-agnostic
input_schema: schemas/analyzer-input.json
output_schema: schemas/analyzer-output.json
- id: security-checker
image: agents/security-check:latest
runtime: openai-api-compatible
input_schema: schemas/security-input.json
output_schema: schemas/security-output.json
- id: code-fixer
image: agents/code-fixer:latest
runtime: claude-agent-sdk # needs file access
input_schema: schemas/fixer-input.json
output_schema: schemas/fixer-output.json
- id: quality-gate
image: agents/quality-validator:latest
runtime: deterministic # no LLM — pure code
input_schema: schemas/quality-input.json
output_schema: schemas/quality-output.json
edges:
- from: analyzer
to: security-checker
- from: security-checker
to: code-fixer
- from: code-fixer
to: quality-gate
# The loop: quality gate can send back to code-fixer
- from: quality-gate
to: code-fixer
condition:
field: output.passed
equals: false
loop:
exit_conditions:
- field: output.quality_score
convergence:
delta: 0.05 # exit if |score[n] - score[n-1]| < 0.05
window: 1 # compare last 1 iteration (use 2+ for moving average)
- field: output.quality_score
gte: 0.9 # exit if score crosses threshold
max_iterations: 5 # hard ceiling — the last resort, not the strategy
on_exhaustion: escalate # or: fail, skip, human-review
# $output is a reserved sink — the workflow's final result
- from: quality-gate
to: $output
condition:
field: output.passed
equals: true
budget:
max_total_tokens: 500000
max_cost_usd: 5.00
max_wall_time: 600s
Why YAML in Git?
-
Diffable —
git diffshows exactly what changed in a workflow - Reviewable — workflow changes go through PRs, just like code
- Branchable — test workflow changes in a branch before deploying
-
Rollbackable —
git revertundoes a broken workflow change - No vendor lock-in — it's files in a repo, not entries in a SaaS database
What About Loops?
This is a directed graph with cycles — not a DAG. The key constraint: every cycle must have an explicit exit condition and a maximum iteration count. The graph definition is static; only the traversal path depends on runtime data.
Loops have multiple exit conditions — convergence detection, quality thresholds, budget ceilings — and a hard max_iterations as the last resort, not the primary strategy. (I wrote a whole article about why a simple iteration counter is not enough.) Think of it as:
while (!converged && !qualityMet && iteration < maxIterations && !budgetExceeded):
run(codeFixer)
run(qualityGate)
The orchestrator enforces all of these. The agents don't even know they're in a loop.
Layer 2: Kafka as the Validation Bus
The orchestrator reads the YAML graph and creates Kafka topic topologies — one input/output pair per agent. From this point, the YAML is compiled into a running system.
Agents never communicate directly. Every message flows through Kafka and the orchestrator, which validates schema compliance, strips reasoning chain contamination, and routes based on typed output fields — not free-text. This is the orchestrator-as-translator pattern baked into the infrastructure.
Why this matters: when agents pass raw reasoning chains to each other, hallucinations propagate and compound. Agent B trusts Agent A's confident-sounding nonsense and builds on it. The orchestrator breaks this chain by enforcing structured summary packets — typed schemas with explicit fields, not prose.
Topic Topology
Each agent gets an input topic and an output topic:
workflow.code-review.analyzer.input
workflow.code-review.analyzer.output
workflow.code-review.security-checker.input
workflow.code-review.security-checker.output
...
The orchestrator consumes output topics, validates every message against its registered schema, and only then produces to the next agent's input topic. Invalid messages are rejected and routed to a dead letter topic — they never reach downstream agents.
Schema Registry as the Trust Boundary
Every message has a schema. At the simplest level, this is JSON Schema validated with Zod at the orchestrator — fast to iterate, familiar to TypeScript developers. But for production at scale, you can level up to Avro or Protobuf schemas in a Schema Registry (or Apicurio for fully open-source). The registry gives you schema evolution rules (backward/forward compatibility), binary serialization (smaller messages), and compile-time type generation — things JSON Schema can't do.
This solves two problems at once:
- Schema drift — if an agent's output structure changes, the registry catches it before downstream agents see garbage
- Reasoning chain contamination — agents can't smuggle free-text reasoning into typed fields. The schema enforces structured summary packets: explicit findings, scores, and decisions — not "here's my thought process"
Why Kafka and Not HTTP/gRPC?
In my earlier architecture, agents communicated via HTTP through a sidecar proxy — request-response to downstream services. That works for service queries, but for workflow orchestration you need replay, ordering, and backpressure. Kafka gives you all three as infrastructure primitives:
| Feature | HTTP/gRPC | Kafka |
|---|---|---|
| Replay | Build it yourself | Built-in (consumer offsets) |
| Audit log | Build it yourself | The log IS the audit |
| Backpressure | Build it yourself | Consumer pause/resume, broker quotas |
| Validation | Per-handler | Centralized — orchestrator validates every message |
| Decoupling | Tight | Total — agents don't know each other exist |
| Ordering | Per-request | Per-partition guarantee |
| Persistence | Ephemeral | Configurable retention |
The feature I'm most excited about: deterministic replay. Given a workflow run ID, you could replay every recorded message from the Kafka log and verify the orchestrator made the same routing decisions. Replay would work from stored outputs, not by re-invoking the LLMs — the log is the source of truth. Note: replay uses event-time from the log, not wall-clock. Runtime-only guards like max_wall_time would be enforced during live execution but excluded from replay verification.
And if the orchestrator crashes mid-workflow? Kafka doesn't care. Consumer offsets track where each agent left off. On restart, the orchestrator would resume from the last committed offset — no lost messages, no duplicate processing (with idempotent producers and transactional offset commits), no expensive LLM re-calls.
Layer 3: The Kafka-Based Orchestrator
This is the novel piece — a TypeScript application built on KafkaJS that executes workflow graphs. It implements the state-management patterns from Kafka Streams (changelog-backed state stores, partition-local state) in userland — you don't get Kafka Streams' built-in exactly-once state/offset atomicity for free, but you get the architectural benefits with a stack that stays in the TypeScript ecosystem.
State Store
The orchestrator maintains state per workflow run in a changelog-backed state store (conceptually similar to Kafka Streams state stores, implemented as a local store with a Kafka changelog topic for recovery):
{
"run_id": "abc-123",
"workflow": "code-review",
"status": "running",
"current_node": "quality-gate",
"loop_counters": {
"quality-gate→code-fixer": 2
},
"budget": {
"tokens_used": 142000,
"cost_usd": 1.87,
"started_at": "2026-03-26T10:00:00Z"
},
"node_outputs": {
"analyzer": { "ref": "topic:offset:42" },
"security-checker": { "ref": "topic:offset:43" }
}
}
Event-Sourced State Machine
Borrowing from the ESAA paper (Event Sourcing for Autonomous Agents — a pattern for separating agent intentions from state mutations): agents don't mutate state directly. They emit structured intentions — the orchestrator validates and applies effects.
Agent output: { "action": "approve", "findings": [...], "confidence": 0.92 }
Orchestrator: validates schema → checks budget → evaluates edge conditions → routes to next node
The agent never says "send this to agent X" or "run this again." It produces typed output. The orchestrator validates the schema, strips any free-text reasoning that leaked outside designated fields, and routes to the next node. Agents are completely blind to the graph topology — they don't know who consumed their output or who produced their input. Intention/effect separation, enforced at the infrastructure level.
Bounded Loop Execution
// The orchestrator's routing logic — zero LLM calls
function route(runState: RunState, nodeOutput: NodeOutput): RoutingDecision {
const currentNode = runState.currentNode;
for (const edge of workflow.edgesFrom(currentNode)) {
if (!evaluateCondition(edge.condition, nodeOutput)) continue;
if (edge.loop) {
const counter = runState.loopCounters[edge.id] ?? 0;
const prevOutput = runState.previousOutputs[edge.id];
// Smart exit conditions first — max_iterations is the last resort
if (edge.loop.exitConditions) {
if (
checkConvergence(prevOutput, nodeOutput, edge.loop.exitConditions)
) {
return routeTo(edge.loop.convergenceTarget ?? "$output", nodeOutput);
}
if (checkThreshold(nodeOutput, edge.loop.exitConditions)) {
return routeTo(edge.loop.convergenceTarget ?? "$output", nodeOutput);
}
}
// Hard ceiling — the safety net, not the strategy
if (counter >= edge.loop.maxIterations) {
return handleExhaustion(edge.loop.onExhaustion);
}
runState.loopCounters[edge.id] = counter + 1;
runState.previousOutputs[edge.id] = nodeOutput;
}
if (runState.budget.exceeded()) {
return terminate(runState, "budget_exceeded");
}
return routeTo(edge.target, nodeOutput);
}
return terminate(runState, "no_matching_edge");
}
Deterministic given the same inputs. The routing function evaluates convergence, thresholds, iteration counts, and budgets — all pure predicates, no AI.
Layer 4: LLM-Provider-Agnostic Agent Runtime
Each agent is a Docker container with a standard interface: consume from Kafka topic, produce to Kafka topic. What happens inside the container is the agent's business.
Two Runtime Types
1. OpenAI-API-compatible runtime — for analysis, classification, summarization:
Agent container
├── Kafka consumer (input topic)
├── OpenAI SDK client → sidecar proxy → any LLM provider
└── Kafka producer (output topic)
The OpenAI Node SDK talks to any provider that implements the OpenAI API format — just swap the baseURL. No wrapper libraries, no abstraction layers. The agent code doesn't know if it's talking to GPT-4, Groq, Together, or a local Ollama instance. Provider choice is a URL, not code:
const client = new OpenAI({
baseURL: process.env.LLM_BASE_URL, // set per agent container
apiKey: "PROXY:default", // sidecar injects real key
});
2. Claude Agent SDK runtime — for tasks that need file access:
Agent container
├── Kafka consumer (input topic)
├── Claude Agent SDK → sidecar proxy → Anthropic API
├── Workspace volume mount (read/write)
└── Kafka producer (output topic)
The Claude Agent SDK gives you a full autonomous agent with built-in file and shell operations — the same tool suite that powers Claude Code (file read/write/edit, shell execution, codebase search). It's Claude-only, but that lock-in is contained — it's one agent type in one container, not a system-wide dependency.
Why Two Runtimes?
File-access agents need tool use — browsing directories, editing code, running tests. The OpenAI function-calling API can technically do this, but you'd be reimplementing Claude Code's entire tool loop (file discovery, edit application, error recovery). The Agent SDK gives you that for free. The pragmatic choice: use the best tool for the job, isolate the dependency.
The key insight: the orchestrator doesn't care which runtime an agent uses. It only sees Kafka messages with typed schemas going in and out. Runtime choice is an implementation detail of each agent container — you could add a third runtime (Gemini, local Llama, a shell script) without changing the orchestration layer.
Escalation and Human-in-the-Loop
When a loop hits on_exhaustion: escalate, the orchestrator publishes to a special workflow.{name}.escalation topic. What happens next depends on your setup:
- GitHub/GitLab issue — a deterministic agent creates a ticket with the full run context (inputs, iteration history, why convergence failed)
- Slack/webhook notification — alert a human who can inspect the Kafka log and decide
- Human-as-agent — the human provides a typed decision event (approve/reject/override) via a simple UI that publishes back to Kafka. The orchestrator treats it like any other agent output — schema-validated, logged, replayable
The human doesn't break determinism because their decision is recorded as an event in the Kafka log. On replay, you see exactly what the human chose and when.
Layer 5: Zero-Trust Agent Sandboxing
Agents run with zero credentials and zero network access. This isn't defense-in-depth — it's the only layer. If the sandbox fails, nothing else stops the agent from exfiltrating your code.
The Sidecar Proxy Pattern
┌─────────────────────────────────────┐
│ Agent Pod / Compose │
│ │
│ ┌───────────────┐ ┌────────────┐ │
│ │ Agent │ │ Sidecar │ │
│ │ ─┼──┤ Proxy │ │
│ │ net: none │ │ │──┼──→ LLM APIs
│ │ no API keys │ │ holds │ │
│ │ ─┼──┤ secrets │──┼──→ Kafka
│ │ /workspace │ │ │ │
│ │ only │ │ allowlist │ │
│ └───────────────┘ └────────────┘ │
│ ↕ Unix socket │
└─────────────────────────────────────┘
How it works:
- Agent container starts with
--network none— this creates an isolated network namespace with only a loopback interface. No external interfaces, no routes, no DNS resolution. (In my earlier architecture, agents had loopback access to a localhost proxy — that still leaves a TCP endpoint for potential exploits to target.--network nonecombined with the Unix socket pattern eliminates that attack surface — the agent has no TCP listener to connect to.) - The only communication channel is a Unix domain socket (shared volume mount). The agent can talk to exactly one thing — the sidecar proxy. On Linux, Unix sockets support
SO_PEERCRED, so the proxy can verify the UID/PID of the connecting process (which is always the case here — agents run in Linux containers). - When a workflow run starts, the orchestrator mints a short-lived JWT scoped to exactly the services this agent needs — RBAC-based, expires in minutes. The token is injected into the sidecar's config, never into the agent's environment.
- Agent makes API calls with placeholder tokens (
Authorization: Bearer PROXY:anthropic). The sidecar validates the JWT scope, substitutes real credentials, and forwards to the allowed domain. - When the workflow run completes, the JWT expires. No standing credentials anywhere.
- Response masking strips any echoed credentials before returning to the agent.
This is the JIT credential model — no agent ever holds a real API key, and credentials exist only for the duration of a single workflow run.
File system isolation:
- Agent sees only
/workspace(bind-mounted project directory) - No access to
~/.ssh,~/.aws,/etc/passwd, host filesystem - Optional gVisor kernel-level isolation for defense against container escape
This pattern is already production-proven: Anthropic's own sandbox-runtime uses it for Claude Code, and the Kubernetes agent-sandbox SIG (1,500+ stars) standardizes it as a CRD.
For a deeper dive on agent security, see my earlier article on zero-trust security for AI agents.
Layer 6: Observability and Deterministic Replay
Correlation Through Kafka
Every message carries a run_id. Since all inter-agent communication flows through Kafka topics, you get a complete execution trace by reading the topics filtered by run_id.
run_id: abc-123
→ analyzer.input (offset 100, t=0ms)
→ analyzer.output (offset 101, t=3200ms, tokens=1200)
→ security.input (offset 50, t=3210ms)
→ security.output (offset 51, t=5100ms, tokens=800)
→ fixer.input (offset 30, t=5110ms)
→ fixer.output (offset 31, t=12000ms, tokens=3500)
→ gate.input (offset 20, t=12010ms)
→ gate.output (offset 21, t=12050ms, passed=false) ← loop back
→ fixer.input (offset 32, t=12060ms, iteration=2)
→ fixer.output (offset 33, t=19000ms, tokens=2800)
→ gate.input (offset 22, t=19010ms)
→ gate.output (offset 23, t=19040ms, passed=true) ← exit loop
Deterministic Replay
The orchestrator's routing decisions are deterministic given the same agent outputs. To verify:
- Read the original Kafka messages for a run
- Feed the agent outputs through the orchestrator's routing logic
- Verify the same routing decisions were made
Ordering caveat: this only works if all events for a given run are causally ordered. The simplest approach: partition all agent topics by run_id, so every message for a run lands on the same partition and Kafka's per-partition ordering guarantee does the rest. Without this, replay across partitions requires explicit sequence numbers or vector clocks — complexity you don't want.
The agent outputs themselves won't be identical (LLMs are stochastic), but the orchestration path is reproducible. This is the key distinction: we're not trying to make LLMs deterministic — we're making the system around them deterministic.
Why Replayability Changes Everything
Replay isn't just for debugging — it unlocks capabilities you can't get from a non-reproducible system:
- Model comparison — replay the same workflow with GPT-4o vs. Claude Sonnet vs. Llama 3 in each agent slot. Same inputs, same graph, different models. Compare quality gate pass rates, token usage, and cost. Find the best quality/cost ratio per agent, not per system.
- Isolated agent testing — record real production messages from the Kafka log, use them as test fixtures. Swap out one agent, replay the run, compare outputs. You're testing agents against real data without running the whole pipeline live.
- Regression detection — after a model update or prompt change, replay the last 100 runs and diff the quality gate outcomes. Did pass rates change? Did convergence speed up or slow down?
- Cost optimization — replay with cheaper models and measure which agents can tolerate a downgrade without quality loss. The optimizer meta-workflow does this automatically.
- Root cause analysis — when a run produces bad output, replay it step by step. Inspect every inter-agent message. Find exactly where the quality degraded — which agent, which iteration, which input caused it.
- Compliance auditing — prove to stakeholders that a specific run followed the declared workflow, hit the quality gate, and stayed within budget. The Kafka log is the receipt.
None of this is possible when your orchestration is "the LLM decided."
Cost Tracking
The sidecar proxy logs token usage per request. The orchestrator aggregates:
Run abc-123:
analyzer: 1,200 tokens $0.02
security-checker: 800 tokens $0.01
code-fixer (×2): 6,300 tokens $0.12 ← ran twice (loop)
quality-gate: 0 tokens $0.00 ← deterministic, no LLM
Total: 8,300 tokens $0.15
Wall time: 19.04s
Loop iterations: 2/3 (quality-gate→code-fixer)
Layer 7: Meta-Workflows — The System That Watches Itself
Here's where the design gets interesting. In my earlier architecture, I had a single meta-workflow that analyzed logs and staged PRs. That was a good start, but it was doing too many things at once. The natural evolution: split it into four specialized meta-workflows, each with a single responsibility — and run them on the exact same infrastructure as regular workflows: YAML definitions in Git, Kafka topics, bounded loops, sandboxed agents.
The orchestrator doesn't distinguish between a "regular" workflow processing code and a "meta" workflow analyzing execution logs. It's the same graph engine. The only difference is the input: agent outputs vs. system telemetry.
The Watchdog
A real-time anomaly detector that subscribes to execution log topics:
# workflows/meta/watchdog.yaml
name: watchdog
type: meta
trigger: continuous # always running, consuming the log stream
agents:
- id: anomaly-detector
runtime: openai-api-compatible
subscribe: workflow.*.*.output # all agent outputs, all workflows
- id: kill-switch
runtime: deterministic # no LLM — pure code
capabilities:
- publish: workflow.*.control # can send halt signals
edges:
- from: anomaly-detector
to: kill-switch
condition:
field: output.severity
in: [critical, emergency]
rules:
- name: token-spike
description: "Agent using >10x its rolling average tokens"
window: 5m
threshold: 10x
action: alert
- name: loop-divergence
description: "Quality score decreasing across loop iterations"
condition: "iteration[n].score < iteration[n-1].score for 2 consecutive iterations"
action: kill
- name: budget-runaway
description: "Run exceeding 80% of budget with <50% of graph completed"
action: pause_and_alert
- name: latency-outlier
description: "Agent response time >5x p95 baseline"
window: 1h
action: alert
The kill switch is deterministic — a pure-code agent with no LLM. It receives a structured alert from the anomaly detector and publishes a halt message to the target workflow's control topic. No AI deciding whether to pull the plug.
The Optimizer
Runs asynchronously over completed workflow runs. Analyzes historical data and proposes improvements:
# workflows/meta/optimizer.yaml
name: optimizer
type: meta
trigger:
schedule: "0 */6 * * *" # every 6 hours
min_completed_runs: 50 # need enough data
agents:
- id: bottleneck-analyzer
runtime: openai-api-compatible
input: "Last 50 runs of each workflow — per-node latency, token usage, loop counts"
- id: recommendation-engine
runtime: openai-api-compatible
- id: pr-creator
runtime: claude-agent-sdk # needs Git access
capabilities:
- write: workflows/*.yaml # can modify workflow definitions
edges:
- from: bottleneck-analyzer
to: recommendation-engine
- from: recommendation-engine
to: pr-creator
condition:
field: output.confidence
gte: 0.85
What the optimizer looks for:
- Bottleneck agents — consistently the slowest node in the graph. Suggest parallelization or model upgrade.
- Over-provisioned loops — max_iterations=5 but historically converges in 1.2. Suggest lowering the bound.
- Model waste — agent using GPT-4 but output quality identical to GPT-4o-mini based on downstream quality gate pass rates. Suggest downgrade.
- Parallelization opportunities — two sequential agents with no data dependency. Suggest fan-out.
The output is a pull request to the workflow Git repo — not a direct change. A human reviews and merges. The optimizer proposes, it doesn't deploy.
The Auditor
Compliance and governance. Verifies that every workflow run followed the rules:
# workflows/meta/auditor.yaml
name: auditor
type: meta
trigger:
on: workflow.*.completed
agents:
- id: trace-verifier
runtime: deterministic
checks:
- graph-compliance: "Every node executed matches the declared workflow graph"
- budget-compliance: "No run exceeded its declared budget"
- loop-compliance: "No loop exceeded its declared max_iterations"
- schema-compliance: "Every message matched its registered schema"
- id: report-generator
runtime: openai-api-compatible
output_schema: schemas/audit-report.json
The trace verifier is deterministic — it replays the Kafka log for a run and verifies the orchestrator's routing decisions match the workflow definition. If a run somehow deviated from the graph (bug in the orchestrator, race condition, corrupted state), the auditor catches it.
The Canary
Safe deployment of workflow changes:
# workflows/meta/canary.yaml
name: canary
type: meta
trigger:
on: workflow.*.version.deployed
agents:
- id: traffic-splitter
runtime: deterministic
config:
canary_percentage: 10
promotion_threshold: "p95_quality >= baseline AND p95_cost <= baseline * 1.2"
rollback_threshold: "p95_quality < baseline * 0.8 OR error_rate > 5%"
observation_window: 2h
- id: metrics-comparator
runtime: deterministic
edges:
- from: traffic-splitter
to: metrics-comparator
loop:
max_iterations: 12 # check every 10min for 2h
interval: 10m
exit_conditions:
- field: output.decision
in: [promote, rollback]
When a new workflow version is merged, the canary routes 10% of runs to the new version and compares quality/cost/latency metrics against the baseline. After 2 hours (or sooner if thresholds are hit), it either promotes the new version or rolls back. All deterministic — no LLM deciding whether the new version is "good enough."
Why Meta-Workflows Matter
Without them, your orchestration system is open-loop — it runs workflows but doesn't learn from them. Meta-workflows close the loop:
Workflows produce execution logs
→ Watchdog monitors in real-time (safety)
→ Auditor verifies after completion (compliance)
→ Optimizer analyzes trends (efficiency)
→ Canary tests changes (safe deployment)
→ Improvements become PRs to workflow definitions
→ Merged changes deploy through the canary
→ Cycle repeats
And because meta-workflows are just workflows, they're subject to the same guarantees: bounded loops, typed schemas, sandboxed agents, deterministic routing, full audit trail. It's self-similar all the way down — but every layer has explicit bounds, so it can't recurse infinitely.
What This Is NOT (and What It Costs)
- Not a framework — it's a proposed architecture. You'd implement it with Kafka, Docker, and your language of choice.
- Not for every use case — if you need agents to creatively collaborate, negotiate, or explore, use CrewAI/AutoGen. This design targets repeatable production workflows.
- Not anti-LLM — LLMs do all the heavy lifting inside each agent. The orchestration layer just doesn't use them.
- Not battle-tested at scale — I'm sharing the design as it evolves. Some of these ideas are informed by production experience (sidecar proxies, bounded loops, schema enforcement); others (the Kafka orchestrator, meta-workflows) are closer to design proposals that I believe would address problems I've seen.
Tradeoffs you'd be accepting:
- Operational complexity — Kafka + Schema Registry + Docker + sidecars is a lot of moving parts. This is not a weekend project.
- Schema friction — defining typed contracts for every agent interaction slows down prototyping. You'll hate it during exploration; you'll love it in production.
- Rigidity — deterministic routing means you can't "let the agent figure it out." If you need a new path, you edit YAML and deploy. That's the point — but it's slower than emergent behavior for novel tasks.
The Stack (All Open-Source)
| Component | Technology | License |
|---|---|---|
| Message bus | Apache Kafka | Apache 2.0 |
| Schema enforcement | Apicurio Registry | Apache 2.0 |
| Orchestrator | KafkaJS + custom TypeScript | MIT |
| Agent containers | Docker | Apache 2.0 |
| Kernel isolation | gVisor | Apache 2.0 |
| LLM client | OpenAI Node SDK | Apache 2.0 |
| Sidecar proxy | Envoy + credential injector | Apache 2.0 |
| Workflow storage | Git | GPL v2 |
| State persistence | Kafka changelog topics + local store | Apache 2.0 |
Every component is replaceable — swap Kafka for NATS or Redis Streams, swap Envoy for a custom proxy, swap Docker for Firecracker microVMs. The architecture is the idea; the stack is one implementation. (Note: KafkaJS is no longer actively maintained; for production, consider confluent-kafka-javascript (librdkafka-based) or a JVM Kafka Streams implementation.)
Previously in this series:

Top comments (0)