Mike

Posted on Mar 27

Toward Reproducible Agent Workflows — A Kafka-Based Orchestration Design

#ai #architecture #kafka #llm

Most multi-agent systems are nondeterministic by default. Agents negotiate their own workflows, spawn each other ad hoc, and pass free-text reasoning chains around. After running a fleet of AI agents in production — and watching the same PR diff produce three different fixes in three runs — I started designing the orchestration layer I wish I'd had from day one. This article proposes an architecture designed to make every workflow run replayable, every routing decision auditable, and every agent loop explicitly bounded. It's a design I'm actively evolving — not a finished product.

The Problem: LLM-Driven Control Flow

The default story is more nuanced than "everything is chaos." LangGraph defines static graphs in code — routing is explicit Python functions, with a configurable recursion limit (25 in older versions, 10,000 in LangGraph 1.x). CrewAI runs tasks sequentially with a 25-iteration cap per agent. AutoGen defaults to round-robin, though with no loop bound by default (the real footgun).

But look at what happens in practice: tutorials showcase SelectorGroupChat (AutoGen), Process.hierarchical (CrewAI), and ReAct tool loops (LangGraph) — patterns where the LLM decides what happens next. The defaults may be safe, but the encouraged usage patterns are not. And even with bounded loops, within each agent turn the LLM still autonomously decides which tools to call, when to stop, and what to pass along. The result:

Non-reproducible — same inputs, different execution paths. The orchestration structure might be fixed, but the LLM-driven inner loops make each run unique. Hard to debug, impossible to regression-test.
Opaque routing — even when routing is code-defined, the LLM's tool-calling decisions inside each node create stochastic side effects that propagate through the graph.
Unbounded by default — AutoGen has no loop cap unless you set one. CrewAI caps at 25 iterations per agent, and LangGraph's recursion limit (now 10,000 in 1.x) is generous enough to produce surprise bills.
No inter-agent validation by default — agents pass messages to each other without schema enforcement. One agent's hallucination becomes another's input.

The fix isn't removing agents. It's removing nondeterminism from the orchestration layer while keeping it where it belongs — inside each agent's reasoning.

Core Thesis

The core design principle: the orchestration graph is code, the agents are LLMs. Keep them separate.

In this design, the orchestrator is a state machine with explicit transitions, bounded loops, and typed message contracts. Routing is intended to be purely deterministic code — no LLM deciding which agent runs next. Quality gates can optionally use LLM judges (e.g., "is this code review good enough?"), but they're agents like any other — isolated containers with typed inputs and outputs. The orchestrator only sees their boolean verdict, never their reasoning. Agents don't know the graph exists.

Important distinction: I'm not claiming LLM outputs are deterministic — they're stochastic by nature. What's deterministic is the control flow: given the same agent outputs, the orchestrator would make the same routing decisions every time. The goal is that you can replay any workflow run from the Kafka log and verify the exact same path was taken.

Design Goals

Replayable — every workflow run can be replayed from recorded messages
Auditable — every routing decision is a pure predicate you can inspect
Bounded — loops have convergence detection, quality thresholds, and hard ceilings
Testable — routing logic is unit-testable, schemas are contract-testable, full runs are replay-testable
Provider-agnostic — swap LLM providers per agent without touching orchestration
Zero-trust — agents have no credentials, no network, no knowledge of each other

Architecture Overview

┌──────────────────────────────────────────────────┐
│                   Git Repository                 │
│  workflows/*.yaml   agents/*.yaml   schemas/     │
└──────────────────────┬───────────────────────────┘
                       │ deploy
┌──────────────────────▼───────────────────────────┐
│              Kafka-Based Orchestrator            │
│                                                  │
│  ┌─────────┐   ┌──────────┐   ┌──────────────┐   │
│  │  Graph  │   │  State   │   │   Budget /   │   │
│  │  Engine │   │  Store   │   │  Loop Guard  │   │
│  └─────────┘   └──────────┘   └──────────────┘   │
│                                                  │
│  Reads from / writes to Kafka topics             │
└──────────────────────┬───────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌──────────┐   ┌──────────┐   ┌──────────┐
   │ Agent A  │   │ Agent B  │   │ Agent C  │
   │┌───────┐ │   │┌───────┐ │   │┌───────┐ │
   ││Sidecar│ │   ││Sidecar│ │   ││Sidecar│ │
   │└───────┘ │   │└───────┘ │   │└───────┘ │
   │  Docker  │   │  Docker  │   │  Docker  │
   │net: none │   │net: none │   │net: none │
   └──────────┘   └──────────┘   └──────────┘

Layer 1: Git-Stored Workflow Definitions

Every workflow is a YAML file in a Git repository. No UI, no database — Git is the source of truth. Versioning, diffs, PRs for workflow changes, and audit trail come from Git itself.

Inspired by GitAgent's SkillsFlow format, but with explicit loop semantics:

# workflows/code-review.yaml
name: code-review
version: "1.0"
description: Automated code review with iterative feedback

inputs:
  pr_diff:
    type: string
    required: true
  repo_url:
    type: string
    required: true

agents:
  - id: analyzer
    image: agents/code-analyzer:latest
    runtime: openai-api-compatible # provider-agnostic
    input_schema: schemas/analyzer-input.json
    output_schema: schemas/analyzer-output.json

  - id: security-checker
    image: agents/security-check:latest
    runtime: openai-api-compatible
    input_schema: schemas/security-input.json
    output_schema: schemas/security-output.json

  - id: code-fixer
    image: agents/code-fixer:latest
    runtime: claude-agent-sdk # needs file access
    input_schema: schemas/fixer-input.json
    output_schema: schemas/fixer-output.json

  - id: quality-gate
    image: agents/quality-validator:latest
    runtime: deterministic # no LLM — pure code
    input_schema: schemas/quality-input.json
    output_schema: schemas/quality-output.json

edges:
  - from: analyzer
    to: security-checker

  - from: security-checker
    to: code-fixer

  - from: code-fixer
    to: quality-gate

  # The loop: quality gate can send back to code-fixer
  - from: quality-gate
    to: code-fixer
    condition:
      field: output.passed
      equals: false
    loop:
      exit_conditions:
        - field: output.quality_score
          convergence:
            delta: 0.05 # exit if |score[n] - score[n-1]| < 0.05
            window: 1 # compare last 1 iteration (use 2+ for moving average)
        - field: output.quality_score
          gte: 0.9 # exit if score crosses threshold
      max_iterations: 5 # hard ceiling — the last resort, not the strategy
      on_exhaustion: escalate # or: fail, skip, human-review

  # $output is a reserved sink — the workflow's final result
  - from: quality-gate
    to: $output
    condition:
      field: output.passed
      equals: true

budget:
  max_total_tokens: 500000
  max_cost_usd: 5.00
  max_wall_time: 600s

Why YAML in Git?

Diffable — git diff shows exactly what changed in a workflow
Reviewable — workflow changes go through PRs, just like code
Branchable — test workflow changes in a branch before deploying
Rollbackable — git revert undoes a broken workflow change
No vendor lock-in — it's files in a repo, not entries in a SaaS database

What About Loops?

This is a directed graph with cycles — not a DAG. The key constraint: every cycle must have an explicit exit condition and a maximum iteration count. The graph definition is static; only the traversal path depends on runtime data.

Loops have multiple exit conditions — convergence detection, quality thresholds, budget ceilings — and a hard max_iterations as the last resort, not the primary strategy. (I wrote a whole article about why a simple iteration counter is not enough.) Think of it as:

while (!converged && !qualityMet && iteration < maxIterations && !budgetExceeded):
    run(codeFixer)
    run(qualityGate)

The orchestrator enforces all of these. The agents don't even know they're in a loop.

Layer 2: Kafka as the Validation Bus

The orchestrator reads the YAML graph and creates Kafka topic topologies — one input/output pair per agent. From this point, the YAML is compiled into a running system.

Agents never communicate directly. Every message flows through Kafka and the orchestrator, which validates schema compliance, strips reasoning chain contamination, and routes based on typed output fields — not free-text. This is the orchestrator-as-translator pattern baked into the infrastructure.

Why this matters: when agents pass raw reasoning chains to each other, hallucinations propagate and compound. Agent B trusts Agent A's confident-sounding nonsense and builds on it. The orchestrator breaks this chain by enforcing structured summary packets — typed schemas with explicit fields, not prose.

Topic Topology

Each agent gets an input topic and an output topic:

workflow.code-review.analyzer.input
workflow.code-review.analyzer.output
workflow.code-review.security-checker.input
workflow.code-review.security-checker.output
...

The orchestrator consumes output topics, validates every message against its registered schema, and only then produces to the next agent's input topic. Invalid messages are rejected and routed to a dead letter topic — they never reach downstream agents.

Schema Registry as the Trust Boundary

Every message has a schema. At the simplest level, this is JSON Schema validated with Zod at the orchestrator — fast to iterate, familiar to TypeScript developers. But for production at scale, you can level up to Avro or Protobuf schemas in a Schema Registry (or Apicurio for fully open-source). The registry gives you schema evolution rules (backward/forward compatibility), binary serialization (smaller messages), and compile-time type generation — things JSON Schema can't do.

This solves two problems at once:

Schema drift — if an agent's output structure changes, the registry catches it before downstream agents see garbage
Reasoning chain contamination — agents can't smuggle free-text reasoning into typed fields. The schema enforces structured summary packets: explicit findings, scores, and decisions — not "here's my thought process"

Why Kafka and Not HTTP/gRPC?

In my earlier architecture, agents communicated via HTTP through a sidecar proxy — request-response to downstream services. That works for service queries, but for workflow orchestration you need replay, ordering, and backpressure. Kafka gives you all three as infrastructure primitives:

Feature	HTTP/gRPC	Kafka
Replay	Build it yourself	Built-in (consumer offsets)
Audit log	Build it yourself	The log IS the audit
Backpressure	Build it yourself	Consumer pause/resume, broker quotas
Validation	Per-handler	Centralized — orchestrator validates every message
Decoupling	Tight	Total — agents don't know each other exist
Ordering	Per-request	Per-partition guarantee
Persistence	Ephemeral	Configurable retention

The feature I'm most excited about: deterministic replay. Given a workflow run ID, you could replay every recorded message from the Kafka log and verify the orchestrator made the same routing decisions. Replay would work from stored outputs, not by re-invoking the LLMs — the log is the source of truth. Note: replay uses event-time from the log, not wall-clock. Runtime-only guards like max_wall_time would be enforced during live execution but excluded from replay verification.

And if the orchestrator crashes mid-workflow? Kafka doesn't care. Consumer offsets track where each agent left off. On restart, the orchestrator would resume from the last committed offset — no lost messages, no duplicate processing (with idempotent producers and transactional offset commits), no expensive LLM re-calls.

Layer 3: The Kafka-Based Orchestrator

This is the novel piece — a TypeScript application built on KafkaJS that executes workflow graphs. It implements the state-management patterns from Kafka Streams (changelog-backed state stores, partition-local state) in userland — you don't get Kafka Streams' built-in exactly-once state/offset atomicity for free, but you get the architectural benefits with a stack that stays in the TypeScript ecosystem.

State Store

The orchestrator maintains state per workflow run in a changelog-backed state store (conceptually similar to Kafka Streams state stores, implemented as a local store with a Kafka changelog topic for recovery):

{
  "run_id": "abc-123",
  "workflow": "code-review",
  "status": "running",
  "current_node": "quality-gate",
  "loop_counters": {
    "quality-gate→code-fixer": 2
  },
  "budget": {
    "tokens_used": 142000,
    "cost_usd": 1.87,
    "started_at": "2026-03-26T10:00:00Z"
  },
  "node_outputs": {
    "analyzer": { "ref": "topic:offset:42" },
    "security-checker": { "ref": "topic:offset:43" }
  }
}

Event-Sourced State Machine

Borrowing from the ESAA paper (Event Sourcing for Autonomous Agents — a pattern for separating agent intentions from state mutations): agents don't mutate state directly. They emit structured intentions — the orchestrator validates and applies effects.

Agent output:  { "action": "approve", "findings": [...], "confidence": 0.92 }
Orchestrator:  validates schema → checks budget → evaluates edge conditions → routes to next node

The agent never says "send this to agent X" or "run this again." It produces typed output. The orchestrator validates the schema, strips any free-text reasoning that leaked outside designated fields, and routes to the next node. Agents are completely blind to the graph topology — they don't know who consumed their output or who produced their input. Intention/effect separation, enforced at the infrastructure level.

Bounded Loop Execution

// The orchestrator's routing logic — zero LLM calls
function route(runState: RunState, nodeOutput: NodeOutput): RoutingDecision {
  const currentNode = runState.currentNode;

  for (const edge of workflow.edgesFrom(currentNode)) {
    if (!evaluateCondition(edge.condition, nodeOutput)) continue;

    if (edge.loop) {
      const counter = runState.loopCounters[edge.id] ?? 0;
      const prevOutput = runState.previousOutputs[edge.id];

      // Smart exit conditions first — max_iterations is the last resort
      if (edge.loop.exitConditions) {
        if (
          checkConvergence(prevOutput, nodeOutput, edge.loop.exitConditions)
        ) {
          return routeTo(edge.loop.convergenceTarget ?? "$output", nodeOutput);
        }
        if (checkThreshold(nodeOutput, edge.loop.exitConditions)) {
          return routeTo(edge.loop.convergenceTarget ?? "$output", nodeOutput);
        }
      }

      // Hard ceiling — the safety net, not the strategy
      if (counter >= edge.loop.maxIterations) {
        return handleExhaustion(edge.loop.onExhaustion);
      }

      runState.loopCounters[edge.id] = counter + 1;
      runState.previousOutputs[edge.id] = nodeOutput;
    }

    if (runState.budget.exceeded()) {
      return terminate(runState, "budget_exceeded");
    }

    return routeTo(edge.target, nodeOutput);
  }

  return terminate(runState, "no_matching_edge");
}

Deterministic given the same inputs. The routing function evaluates convergence, thresholds, iteration counts, and budgets — all pure predicates, no AI.

Layer 4: LLM-Provider-Agnostic Agent Runtime

Each agent is a Docker container with a standard interface: consume from Kafka topic, produce to Kafka topic. What happens inside the container is the agent's business.

Two Runtime Types

1. OpenAI-API-compatible runtime — for analysis, classification, summarization:

Agent container
├── Kafka consumer (input topic)
├── OpenAI SDK client → sidecar proxy → any LLM provider
└── Kafka producer (output topic)

The OpenAI Node SDK talks to any provider that implements the OpenAI API format — just swap the baseURL. No wrapper libraries, no abstraction layers. The agent code doesn't know if it's talking to GPT-4, Groq, Together, or a local Ollama instance. Provider choice is a URL, not code:

const client = new OpenAI({
  baseURL: process.env.LLM_BASE_URL, // set per agent container
  apiKey: "PROXY:default", // sidecar injects real key
});

2. Claude Agent SDK runtime — for tasks that need file access:

Agent container
├── Kafka consumer (input topic)
├── Claude Agent SDK → sidecar proxy → Anthropic API
├── Workspace volume mount (read/write)
└── Kafka producer (output topic)

The Claude Agent SDK gives you a full autonomous agent with built-in file and shell operations — the same tool suite that powers Claude Code (file read/write/edit, shell execution, codebase search). It's Claude-only, but that lock-in is contained — it's one agent type in one container, not a system-wide dependency.

Why Two Runtimes?

File-access agents need tool use — browsing directories, editing code, running tests. The OpenAI function-calling API can technically do this, but you'd be reimplementing Claude Code's entire tool loop (file discovery, edit application, error recovery). The Agent SDK gives you that for free. The pragmatic choice: use the best tool for the job, isolate the dependency.

The key insight: the orchestrator doesn't care which runtime an agent uses. It only sees Kafka messages with typed schemas going in and out. Runtime choice is an implementation detail of each agent container — you could add a third runtime (Gemini, local Llama, a shell script) without changing the orchestration layer.

Escalation and Human-in-the-Loop

When a loop hits on_exhaustion: escalate, the orchestrator publishes to a special workflow.{name}.escalation topic. What happens next depends on your setup:

GitHub/GitLab issue — a deterministic agent creates a ticket with the full run context (inputs, iteration history, why convergence failed)
Slack/webhook notification — alert a human who can inspect the Kafka log and decide
Human-as-agent — the human provides a typed decision event (approve/reject/override) via a simple UI that publishes back to Kafka. The orchestrator treats it like any other agent output — schema-validated, logged, replayable

The human doesn't break determinism because their decision is recorded as an event in the Kafka log. On replay, you see exactly what the human chose and when.

Layer 5: Zero-Trust Agent Sandboxing

Agents run with zero credentials and zero network access. This isn't defense-in-depth — it's the only layer. If the sandbox fails, nothing else stops the agent from exfiltrating your code.

The Sidecar Proxy Pattern

┌─────────────────────────────────────┐
│           Agent Pod / Compose       │
│                                     │
│  ┌───────────────┐  ┌────────────┐  │
│  │    Agent      │  │  Sidecar   │  │
│  │              ─┼──┤  Proxy     │  │
│  │  net: none    │  │            │──┼──→ LLM APIs
│  │  no API keys  │  │  holds     │  │
│  │              ─┼──┤  secrets   │──┼──→ Kafka
│  │  /workspace   │  │            │  │
│  │  only         │  │  allowlist │  │
│  └───────────────┘  └────────────┘  │
│         ↕ Unix socket               │
└─────────────────────────────────────┘

How it works:

Agent container starts with --network none — this creates an isolated network namespace with only a loopback interface. No external interfaces, no routes, no DNS resolution. (In my earlier architecture, agents had loopback access to a localhost proxy — that still leaves a TCP endpoint for potential exploits to target. --network none combined with the Unix socket pattern eliminates that attack surface — the agent has no TCP listener to connect to.)
The only communication channel is a Unix domain socket (shared volume mount). The agent can talk to exactly one thing — the sidecar proxy. On Linux, Unix sockets support SO_PEERCRED, so the proxy can verify the UID/PID of the connecting process (which is always the case here — agents run in Linux containers).
When a workflow run starts, the orchestrator mints a short-lived JWT scoped to exactly the services this agent needs — RBAC-based, expires in minutes. The token is injected into the sidecar's config, never into the agent's environment.
Agent makes API calls with placeholder tokens (Authorization: Bearer PROXY:anthropic). The sidecar validates the JWT scope, substitutes real credentials, and forwards to the allowed domain.
When the workflow run completes, the JWT expires. No standing credentials anywhere.
Response masking strips any echoed credentials before returning to the agent.

This is the JIT credential model — no agent ever holds a real API key, and credentials exist only for the duration of a single workflow run.

File system isolation:

Agent sees only /workspace (bind-mounted project directory)
No access to ~/.ssh, ~/.aws, /etc/passwd, host filesystem
Optional gVisor kernel-level isolation for defense against container escape

This pattern is already production-proven: Anthropic's own sandbox-runtime uses it for Claude Code, and the Kubernetes agent-sandbox SIG (1,500+ stars) standardizes it as a CRD.

For a deeper dive on agent security, see my earlier article on zero-trust security for AI agents.

Layer 6: Observability and Deterministic Replay

Correlation Through Kafka

Every message carries a run_id. Since all inter-agent communication flows through Kafka topics, you get a complete execution trace by reading the topics filtered by run_id.

run_id: abc-123
  → analyzer.input  (offset 100, t=0ms)
  → analyzer.output (offset 101, t=3200ms, tokens=1200)
  → security.input  (offset 50, t=3210ms)
  → security.output (offset 51, t=5100ms, tokens=800)
  → fixer.input     (offset 30, t=5110ms)
  → fixer.output    (offset 31, t=12000ms, tokens=3500)
  → gate.input      (offset 20, t=12010ms)
  → gate.output     (offset 21, t=12050ms, passed=false)  ← loop back
  → fixer.input     (offset 32, t=12060ms, iteration=2)
  → fixer.output    (offset 33, t=19000ms, tokens=2800)
  → gate.input      (offset 22, t=19010ms)
  → gate.output     (offset 23, t=19040ms, passed=true)   ← exit loop

Deterministic Replay

The orchestrator's routing decisions are deterministic given the same agent outputs. To verify:

Read the original Kafka messages for a run
Feed the agent outputs through the orchestrator's routing logic
Verify the same routing decisions were made

Ordering caveat: this only works if all events for a given run are causally ordered. The simplest approach: partition all agent topics by run_id, so every message for a run lands on the same partition and Kafka's per-partition ordering guarantee does the rest. Without this, replay across partitions requires explicit sequence numbers or vector clocks — complexity you don't want.

The agent outputs themselves won't be identical (LLMs are stochastic), but the orchestration path is reproducible. This is the key distinction: we're not trying to make LLMs deterministic — we're making the system around them deterministic.

Why Replayability Changes Everything

Replay isn't just for debugging — it unlocks capabilities you can't get from a non-reproducible system:

Model comparison — replay the same workflow with GPT-4o vs. Claude Sonnet vs. Llama 3 in each agent slot. Same inputs, same graph, different models. Compare quality gate pass rates, token usage, and cost. Find the best quality/cost ratio per agent, not per system.
Isolated agent testing — record real production messages from the Kafka log, use them as test fixtures. Swap out one agent, replay the run, compare outputs. You're testing agents against real data without running the whole pipeline live.
Regression detection — after a model update or prompt change, replay the last 100 runs and diff the quality gate outcomes. Did pass rates change? Did convergence speed up or slow down?
Cost optimization — replay with cheaper models and measure which agents can tolerate a downgrade without quality loss. The optimizer meta-workflow does this automatically.
Root cause analysis — when a run produces bad output, replay it step by step. Inspect every inter-agent message. Find exactly where the quality degraded — which agent, which iteration, which input caused it.
Compliance auditing — prove to stakeholders that a specific run followed the declared workflow, hit the quality gate, and stayed within budget. The Kafka log is the receipt.

None of this is possible when your orchestration is "the LLM decided."

Cost Tracking

The sidecar proxy logs token usage per request. The orchestrator aggregates:

Run abc-123:
  analyzer:         1,200 tokens  $0.02
  security-checker:   800 tokens  $0.01
  code-fixer (×2):  6,300 tokens  $0.12  ← ran twice (loop)
  quality-gate:         0 tokens  $0.00  ← deterministic, no LLM
  Total:            8,300 tokens  $0.15
  Wall time:        19.04s
  Loop iterations:  2/3 (quality-gate→code-fixer)

Layer 7: Meta-Workflows — The System That Watches Itself

Here's where the design gets interesting. In my earlier architecture, I had a single meta-workflow that analyzed logs and staged PRs. That was a good start, but it was doing too many things at once. The natural evolution: split it into four specialized meta-workflows, each with a single responsibility — and run them on the exact same infrastructure as regular workflows: YAML definitions in Git, Kafka topics, bounded loops, sandboxed agents.

The orchestrator doesn't distinguish between a "regular" workflow processing code and a "meta" workflow analyzing execution logs. It's the same graph engine. The only difference is the input: agent outputs vs. system telemetry.

The Watchdog

A real-time anomaly detector that subscribes to execution log topics:

# workflows/meta/watchdog.yaml
name: watchdog
type: meta
trigger: continuous # always running, consuming the log stream

agents:
  - id: anomaly-detector
    runtime: openai-api-compatible
    subscribe: workflow.*.*.output # all agent outputs, all workflows

  - id: kill-switch
    runtime: deterministic # no LLM — pure code
    capabilities:
      - publish: workflow.*.control # can send halt signals

edges:
  - from: anomaly-detector
    to: kill-switch
    condition:
      field: output.severity
      in: [critical, emergency]

rules:
  - name: token-spike
    description: "Agent using >10x its rolling average tokens"
    window: 5m
    threshold: 10x
    action: alert

  - name: loop-divergence
    description: "Quality score decreasing across loop iterations"
    condition: "iteration[n].score < iteration[n-1].score for 2 consecutive iterations"
    action: kill

  - name: budget-runaway
    description: "Run exceeding 80% of budget with <50% of graph completed"
    action: pause_and_alert

  - name: latency-outlier
    description: "Agent response time >5x p95 baseline"
    window: 1h
    action: alert

The kill switch is deterministic — a pure-code agent with no LLM. It receives a structured alert from the anomaly detector and publishes a halt message to the target workflow's control topic. No AI deciding whether to pull the plug.

The Optimizer

Runs asynchronously over completed workflow runs. Analyzes historical data and proposes improvements:

# workflows/meta/optimizer.yaml
name: optimizer
type: meta
trigger:
  schedule: "0 */6 * * *" # every 6 hours
  min_completed_runs: 50 # need enough data

agents:
  - id: bottleneck-analyzer
    runtime: openai-api-compatible
    input: "Last 50 runs of each workflow — per-node latency, token usage, loop counts"

  - id: recommendation-engine
    runtime: openai-api-compatible

  - id: pr-creator
    runtime: claude-agent-sdk # needs Git access
    capabilities:
      - write: workflows/*.yaml # can modify workflow definitions

edges:
  - from: bottleneck-analyzer
    to: recommendation-engine

  - from: recommendation-engine
    to: pr-creator
    condition:
      field: output.confidence
      gte: 0.85

What the optimizer looks for:

Bottleneck agents — consistently the slowest node in the graph. Suggest parallelization or model upgrade.
Over-provisioned loops — max_iterations=5 but historically converges in 1.2. Suggest lowering the bound.
Model waste — agent using GPT-4 but output quality identical to GPT-4o-mini based on downstream quality gate pass rates. Suggest downgrade.
Parallelization opportunities — two sequential agents with no data dependency. Suggest fan-out.

The output is a pull request to the workflow Git repo — not a direct change. A human reviews and merges. The optimizer proposes, it doesn't deploy.

The Auditor

Compliance and governance. Verifies that every workflow run followed the rules:

# workflows/meta/auditor.yaml
name: auditor
type: meta
trigger:
  on: workflow.*.completed

agents:
  - id: trace-verifier
    runtime: deterministic
    checks:
      - graph-compliance: "Every node executed matches the declared workflow graph"
      - budget-compliance: "No run exceeded its declared budget"
      - loop-compliance: "No loop exceeded its declared max_iterations"
      - schema-compliance: "Every message matched its registered schema"

  - id: report-generator
    runtime: openai-api-compatible
    output_schema: schemas/audit-report.json

The trace verifier is deterministic — it replays the Kafka log for a run and verifies the orchestrator's routing decisions match the workflow definition. If a run somehow deviated from the graph (bug in the orchestrator, race condition, corrupted state), the auditor catches it.

The Canary

Safe deployment of workflow changes:

# workflows/meta/canary.yaml
name: canary
type: meta
trigger:
  on: workflow.*.version.deployed

agents:
  - id: traffic-splitter
    runtime: deterministic
    config:
      canary_percentage: 10
      promotion_threshold: "p95_quality >= baseline AND p95_cost <= baseline * 1.2"
      rollback_threshold: "p95_quality < baseline * 0.8 OR error_rate > 5%"
      observation_window: 2h

  - id: metrics-comparator
    runtime: deterministic

edges:
  - from: traffic-splitter
    to: metrics-comparator
    loop:
      max_iterations: 12 # check every 10min for 2h
      interval: 10m
      exit_conditions:
        - field: output.decision
          in: [promote, rollback]

When a new workflow version is merged, the canary routes 10% of runs to the new version and compares quality/cost/latency metrics against the baseline. After 2 hours (or sooner if thresholds are hit), it either promotes the new version or rolls back. All deterministic — no LLM deciding whether the new version is "good enough."

Why Meta-Workflows Matter

Without them, your orchestration system is open-loop — it runs workflows but doesn't learn from them. Meta-workflows close the loop:

Workflows produce execution logs
  → Watchdog monitors in real-time (safety)
  → Auditor verifies after completion (compliance)
  → Optimizer analyzes trends (efficiency)
  → Canary tests changes (safe deployment)
  → Improvements become PRs to workflow definitions
  → Merged changes deploy through the canary
  → Cycle repeats

And because meta-workflows are just workflows, they're subject to the same guarantees: bounded loops, typed schemas, sandboxed agents, deterministic routing, full audit trail. It's self-similar all the way down — but every layer has explicit bounds, so it can't recurse infinitely.

What This Is NOT (and What It Costs)

Not a framework — it's a proposed architecture. You'd implement it with Kafka, Docker, and your language of choice.
Not for every use case — if you need agents to creatively collaborate, negotiate, or explore, use CrewAI/AutoGen. This design targets repeatable production workflows.
Not anti-LLM — LLMs do all the heavy lifting inside each agent. The orchestration layer just doesn't use them.
Not battle-tested at scale — I'm sharing the design as it evolves. Some of these ideas are informed by production experience (sidecar proxies, bounded loops, schema enforcement); others (the Kafka orchestrator, meta-workflows) are closer to design proposals that I believe would address problems I've seen.

Tradeoffs you'd be accepting:

Operational complexity — Kafka + Schema Registry + Docker + sidecars is a lot of moving parts. This is not a weekend project.
Schema friction — defining typed contracts for every agent interaction slows down prototyping. You'll hate it during exploration; you'll love it in production.
Rigidity — deterministic routing means you can't "let the agent figure it out." If you need a new path, you edit YAML and deploy. That's the point — but it's slower than emergent behavior for novel tasks.

The Stack (All Open-Source)

Component	Technology	License
Message bus	Apache Kafka	Apache 2.0
Schema enforcement	Apicurio Registry	Apache 2.0
Orchestrator	KafkaJS + custom TypeScript	MIT
Agent containers	Docker	Apache 2.0
Kernel isolation	gVisor	Apache 2.0
LLM client	OpenAI Node SDK	Apache 2.0
Sidecar proxy	Envoy + credential injector	Apache 2.0
Workflow storage	Git	GPL v2
State persistence	Kafka changelog topics + local store	Apache 2.0

Every component is replaceable — swap Kafka for NATS or Redis Streams, swap Envoy for a custom proxy, swap Docker for Firecracker microVMs. The architecture is the idea; the stack is one implementation. (Note: KafkaJS is no longer actively maintained; for production, consider confluent-kafka-javascript (librdkafka-based) or a JVM Kafka Streams implementation.)

Previously in this series:

DEV Community