DEV Community: Philipp Enderle

Your Multi-Agent Framework Handles Operations. What About the Other Five?

Philipp Enderle — Mon, 02 Mar 2026 23:13:19 +0000

In Part 1, I introduced the Viable System Model (VSM) and how it maps to multi-agent AI systems. The response was great — but the most common question was: "OK, the theory makes sense. But how is this actually different from what CrewAI/LangGraph/AutoGen already do?"

Fair question. Let me answer it properly.

The One Thing Every Framework Gets Right

Every multi-agent framework gives you System 1 — Operations. The agents that do actual work. Define a role, give it tools, let it run. CrewAI calls them "agents." LangGraph calls them "nodes." AutoGen calls them "agents" too. This part works.

The problem is that operations is 1 of 6 necessary control functions. The other five — coordination, optimization, audit, intelligence, and identity — are either missing entirely or left as an exercise for the developer.

Here's what that looks like:

              S1    S2    S3    S3*   S4    S5
             Ops  Coord Optim Audit Intel Ident
CrewAI        ✅    ❌    ⚠️    ❌    ❌    ❌
LangGraph     ✅    ❌    ⚠️    ❌    ❌    ❌
OpenAI Agents ✅    ❌    ❌    ❌    ❌    ❌
AutoGen       ⚠️    ⚠️    ❌    ❌    ❌    ❌
ViableOS      ✅    ✅    ✅    ✅    ✅    ✅

This isn't a knock on those frameworks. They're excellent infrastructure — they give you the building blocks to run agents. But infrastructure isn't organization. It's like having Kubernetes without knowing what services to deploy, how they should communicate, and who watches them.

Why VSM Is Different: It's About the Channels, Not the Agents

Stafford Beer's Viable System Model isn't a framework you bolt onto agents. It's a structural theory about what ANY viable system needs to survive — whether it's a cell, a company, or a swarm of AI agents. He published it in 1972. It's been validated on governments, corporations, and cooperatives. And it maps 1:1 to multi-agent AI.

The key insight: viability requires specific communication channels, not just capable components.

In a flat multi-agent system with 5 agents, you potentially need 20 direct communication channels. Every agent might talk to every other agent. That's n×(n-1) complexity. It doesn't scale. More importantly, it doesn't differentiate — a resource conflict looks the same as a strategic concern looks the same as an emergency.

The VSM replaces this with structured channels, each with a specific purpose:

S5 (Identity/Policy)
    ↕ balance channel
S4 (Intelligence)     S3 (Optimization)
    ↕ strategy bridge      ↕ command channel
                      S2 (Coordination)
                           ↕ coordination rules
              S1a ←→ S1b ←→ S1c (Operations)
                    ↕
              S3* (Audit — independent, different provider)

S2 coordination rules prevent conflicts between S1 units. Not by managing them, but by establishing traffic rules. "If you deploy, notify ops. If you claim a feature, verify with dev first."

S3 command channel gives optimization authority over operations. "Shift 20% of your token budget to the high-priority task." This is top-down resource allocation with teeth.

S3* audit bypass goes directly from the auditor into S1 operations — read-only, independent, different LLM provider. "I checked the last 5 commits. Tests didn't actually pass." (More on why "different provider" matters below.)

S4→S3 strategy bridge injects external intelligence into operational planning. "Competitor just launched feature X. Here's a briefing."

S5 balance channel ensures the system doesn't drift too far toward internal optimization (S3) or external scanning (S4). Too much S3 = navel-gazing. Too much S4 = strategy tourism.

Algedonic channel — the emergency bypass. Any agent can signal existential issues directly to S5 and the human, skipping the entire hierarchy. Named after the Greek words for pain (algos) and pleasure (hedone). This is your system's fire alarm.

These channels aren't nice-to-haves. Each one prevents a specific failure mode:

Without this channel...	You get...
S2 coordination	Agents contradicting each other
S3 command	No resource control, token budgets explode
S3* audit	Hallucinations go undetected
S4→S3 bridge	System optimizes for yesterday's world
S5 balance	Either navel-gazing or strategy tourism
Algedonic	Critical issues buried in status reports

That's the difference between "a list of agents with a router" and "a viable system." The agents are the same. The organization makes them work.

Deep Dive: Why Your Agents Forget Their Orders

Let me zoom in on one problem that every multi-agent system has but almost nobody talks about: context window amnesia.

LLMs don't have persistent memory. Everything lives in the context window — a buffer of recent messages that eventually overflows. When S3 (Optimization) sends a directive to an S1 worker — say, "switch to a cheaper model for routine tasks to stay within budget" — that directive enters the context window. For maybe 20-40 turns, the agent remembers. Then newer messages push it out.

The agent doesn't refuse the directive. It doesn't disagree. It simply forgets it existed.

In a human organization, this is the memo that nobody read. The policy that got announced but never enforced. The quarterly goal that was abandoned by February. Stafford Beer saw this problem 50 years ago and his solution had a name: Vollzug.

Vollzug is German for the confirmed execution of a directive. Not "I heard you" — but "I heard you, I did it, and here's proof." Beer was a British cyberneticist, but he borrowed the German term because English doesn't have a single word for this concept. Three steps, each with a hard timeout:

vollzug_protocol:
  enabled: true
  timeout_quittung: 30min    # Must acknowledge within 30 min
  timeout_vollzug: 48h       # Must execute within 48 hours
  on_timeout: escalate       # Auto-escalate if missed

Step 1 — Quittung (Acknowledgment). The receiving agent has 30 minutes to confirm receipt. No confirmation → auto-escalate. This catches the case where a directive is sent but never enters the agent's active context.

Step 2 — Vollzug (Execution). The agent has 48 hours to carry out the directive. The timeout scales with team size — a 2-person org gets 12 hours, a 10-person org gets a full week.

Step 3 — Report. Confirm completion with evidence. Not "done" — but "done, and here's what changed."

If any step times out, the system escalates automatically. But not everything goes through the same path:

escalation_chains:
  operational:
    path: [s2-coordination, s3-optimization, human]
    timeout_per_step: 2h
  quality:
    path: [s3-optimization, human]
    timeout_per_step: 2h
  strategic:
    path: [s4-intelligence, s5-policy, human]
    timeout_per_step: 4h
  algedonic:
    path: [s5-policy, human]
    timeout_per_step: 15min

An operational timeout goes through coordination first. A quality issue goes straight to optimization. A strategic concern routes through intelligence and policy. And an existential threat — the algedonic channel — reaches the human in 15 minutes, no matter what.

This is what "from topology to behavior" means. It's not enough to define which agents exist. You need to define how they behave when things go wrong. When context is lost. When directives are ignored. When the whole system is on fire. That's the gap between a diagram and an operating system.

And here's why this matters specifically for LLM-based agents: LLMs are optimized to produce coherent, confident outputs. An agent reporting "task completed" sounds exactly like an agent that actually completed the task — and one that hallucinated the completion. Without Vollzug, without S3* audit, without escalation chains — you have no way to tell the difference.

What We've Built

ViableOS takes all of this and turns it into working software. You describe your organization — or let an AI-powered assessment interview figure it out — and it generates the full VSM package: every agent, every channel, every behavioral spec.

What works today:

AI-guided assessment interview — Chat with a VSM expert that asks the right questions and auto-generates a complete config
6-step web wizard with 12 organization templates (SaaS, E-Commerce, Agency, Consulting, Law Firm, Education, and more)
Budget calculator mapping monthly USD to per-agent model allocations across 23 models and 7 providers
Assessment transformer that auto-derives all 9 behavioral spec areas from your assessment data — team size, external forces, success criteria, dependencies → operational modes, escalation chains, vollzug protocol, autonomy matrix, provider constraints, everything
Package generator producing SOUL.md, SKILL.md, HEARTBEAT.md per agent, plus coordination rules, permission matrices, and fallback chains
LangGraph export for direct integration
Viability checker with VSM completeness checks and behavioral spec validation
245 passing tests across schema, transformer, generator, and checker

What's auto-derived, not hand-configured:

Small team (1-2 people) → shorter timeouts, more human approval, daily reporting. Large team (10+) → more agent autonomy, longer execution windows, weekly reporting. Regulatory external forces → monthly premise checks, elevated mode triggers. You can override everything, but the defaults are designed to be sensible based on 50 years of organizational theory.

What we haven't built yet:

The runtime engine. ViableOS currently generates the configuration for a viable agent organization. It doesn't yet execute it. There's no live enforcement of vollzug timeouts, no real-time escalation routing, no Operations Room. That's v0.3 — and it's where I need help.

First Test: My Own Healthcare Software Company

Theory is worth nothing without practice. So the first real test of ViableOS will be my own company — a small medical care software firm in Germany.

It's a good test case for three reasons:

The domain is regulated. GDPR, healthcare data laws, documentation requirements. This forces the system to take identity and values seriously — "patient privacy above everything" isn't a nice-to-have, it's legally required. S5 (Identity) earns its keep here. And S3* audit with a different LLM provider isn't theoretical elegance — it's practical necessity when agents touch patient-adjacent workflows.

The stakes are real. When agents handle scheduling, documentation, or billing, hallucinations aren't just annoying — they're potentially harmful. The Vollzug Protocol isn't academic neatness. It's "did you actually update that patient record, or did you just tell me you did?"

It's small enough to be honest about. Solo founder, small team. If ViableOS generates reasonable defaults for an organization this size, and if those defaults actually change agent behavior in practice, that's validation. If they don't — that's equally valuable information. I'll document the entire process publicly.

Try It Yourself

But one test case doesn't validate a theory. If you're running multi-agent systems — on CrewAI, LangGraph, AutoGen, or your own framework — I'd genuinely love you to try ViableOS on your setup and tell us:

Does the assessment capture your organization correctly? Chat with the VSM expert or use the wizard. Does the output match your reality?
Are the behavioral specs sensible? Look at the generated SOUL.md files. Do the escalation chains, autonomy levels, and operational modes match your intuition about how your agents should work?
What's missing? Which failure patterns have you seen in production that our nine behavioral specs don't cover?

pip install -e ".[dev]"
viableos api
# Open http://localhost:5173

GitHub: github.com/philipp-lm/ViableOS

Open an issue, start a discussion, or drop a comment here. Every behavioral spec in ViableOS started from theory — now we need practice to validate it. The more diverse the test cases, the better the system gets.

This is part 2 of a series on applying organizational design to AI agent systems. Part 1: Your AI Agents Need an Org Chart. Next: building the runtime that actually enforces these specs — the Operations Room.

Philipp Enderle — Engineer (KIT, TU Munich, UC Berkeley). 9 years strategy consulting at Deloitte and Berylls by AlixPartners, designing org transformations for DAX automotive companies. Now applying the same organizational theory to AI agent teams.

LinkedIn · GitHub · ViableOS

Your AI Agents Need an Org Chart — But Not the Kind You Think

Philipp Enderle — Fri, 20 Feb 2026 13:39:44 +0000

I run a small healthcare SaaS. Solo founder, Python/FastAPI backend, React frontend, German-hosted VPS. The usual indie stack.

A few weeks ago I decided to go all-in on AI agents. Not one — a whole team. A coding agent for bug fixes and feature work. An ops agent watching my server. A go-to-market agent drafting landing page copy and doing competitor research.

Each one worked beautifully in isolation. Then I let them loose together.

Within a week, my go-to-market agent had published a feature comparison table on the website that included two features we hadn't built yet. My coding agent, meanwhile, was happily refactoring auth middleware — important work, but nobody told the ops agent, which then flagged the deployment as a breaking change and sent me a 2am alert on WhatsApp. Three agents, all competent, all actively making my life worse.

I spent that weekend not debugging code, but thinking about something I hadn't touched since my consulting days: organizational design.

The realization that hit me

Here's the thing — I spent 9 years at Deloitte and Berylls (now AlixPartners) doing exactly this for DAX automotive companies. Designing org structures. Building target operating models. Figuring out how 800 people should coordinate without stepping on each other.

And the problems I was seeing with my three AI agents? I'd seen them before. In billion-dollar organizations. The exact same pathologies:

Marketing promises something Engineering can't deliver → coordination failure
One team absorbs all the budget while others starve → resource dominance
Everyone's optimizing locally but nobody's looking at the big picture → missing oversight
Reports say everything's great, but the actual output is garbage → no independent verification
The company is heads-down executing, not noticing the market has shifted → strategic blindness

These aren't random problems. They're structural. And there's a theory from the 1970s that explains exactly why they happen and how to fix them.

Stafford Beer's cheat code

In 1972, a British cyberneticist named Stafford Beer published Brain of the Firm. The core idea: every system that survives — a cell, an organism, a company, an economy — has exactly five control functions. Miss any one of them and the system develops pathologies. Not might. Will.

He called it the Viable System Model (VSM). I'd used it in consulting to diagnose organizational dysfunction. But sitting there at 2am, staring at my WhatsApp alert from an overzealous ops agent, I realized: this applies directly to multi-agent AI systems.

Let me show you what I mean. Here are the six functions (five systems, one with a crucial sub-function), translated to agents:

System 1 — Operations. The agents doing actual work. Your coders, your researchers, your support bots. Every framework gets this right. It's the easy part.

System 2 — Coordination. Rules that prevent agents from conflicting. Not a "coordinator agent" — think of it more like traffic lights. When my coding agent deploys, System 2 automatically notifies the ops agent and the go-to-market agent. No manager bottleneck, no n×(n-1)/2 direct channels. Just rules.

This is what was missing when my go-to-market agent published features we hadn't built. There was no rule saying "check with the dev agent before publishing feature claims."

System 3 — Optimization. Something that watches the whole system. Are tokens being spent wisely? Is one agent idle while another is overwhelmed? This is your operations manager, except for agents it's concrete: monitor API costs, detect redundant work, reallocate compute.

System 3* — Audit. This is the one that matters most for AI. System 3 gets its information from the agents' own reports. But agents hallucinate. They confabulate. They'll tell you they ran the tests when they didn't. System 3* bypasses the reporting chain and checks directly: read the last 5 commits — did the tests actually pass? Check the website — does it match what the go-to-market agent claims? Verify the database — are backups actually fresh?

In a company, this is internal audit. For AI agents, it's your hallucination firewall. You don't trust the agent's summary — you verify the artifact.

System 4 — Intelligence. The function that looks outside and forward. Is there a new LLM that could cut your costs? Did your competitor just launch the feature you're building? Is a library you depend on being deprecated?

I don't know of a single multi-agent framework that has this concept. Every agent is stuck in the present.

System 5 — Identity. The shared purpose. When your coding agent faces a trade-off — ship fast or write tests? — what should it choose? The answer depends on who you are. My system's identity is "privacy and reliability above everything" because I'm in healthcare. A fintech startup might say "move fast." Without this, agents optimize for different things and you get incoherent behavior.

Here's how the complete model looks when you put all six functions together:

The key insight from Beer: these aren't optional features you add later. They're structural requirements for any system that needs to remain viable. Remove any one and the system develops predictable pathologies — whether it's a Fortune 500 company or three AI agents on a VPS.

Diagnosing my own system

I took my three-agent setup and scored it honestly:

                    Have it?
Operations (S1)     ✅  Yes — three specialized agents
Coordination (S2)   ❌  Nope — agents had no rules about each other
Optimization (S3)   ❌  Nope — nobody watching the whole
Audit (S3*)         ❌  Nope — I trusted agent reports at face value
Intelligence (S4)   ❌  Nope — nobody looking outside
Identity (S5)       ❌  Nope — no shared purpose document

One out of six. No wonder it was a mess.

So I redesigned the whole thing from scratch. Here's what the configuration looks like:

viable_system:
  name: "German Healthcare SaaS"

  identity:
    purpose: "Help therapists focus on patients, not paperwork"
    values:
      - "Patient privacy above everything"
      - "Simplicity over feature bloat"
      - "Reliability over speed"
    decisions_requiring_human:
      - "Pricing changes"
      - "Anything touching patient data"
      - "Publishing to production"

  system_1:
    - name: "Product Development"
      purpose: "Build and stabilize the software"
      autonomy: "Can fix bugs independently. Features need approval."
      tools: [github, testing, code-review]

    - name: "Operations"
      purpose: "Keep the platform running"
      autonomy: "Can monitor and alert independently. Deployments need approval."
      tools: [ssh, docker, log-analysis]

    - name: "Go-to-Market"
      purpose: "Get first customers"
      autonomy: "Can draft content independently. Publishing needs approval."
      tools: [website-editing, seo-analysis, copywriting]

  system_2:
    coordination_rules:
      - trigger: "Go-to-Market mentions a feature on the website"
        action: "Validate with Product Dev that the feature exists"

      - trigger: "Product Dev deploys a new feature"
        action: "Notify Go-to-Market for website update, Ops for monitoring"

      - trigger: "Ops detects a performance issue after deployment"
        action: "Notify Product Dev with logs and priority tag"

  system_3:
    reporting_rhythm: "weekly"
    resource_allocation: "Dev: 60%, Ops: 20%, Go-to-Market: 20%"

  system_3_star:
    schedule: "bi-weekly"
    checks:
      - name: "Code Quality"
        target: "Product Development"
        method: "Read last 5 commits. Did tests actually pass? Check coverage."
      - name: "Content Accuracy"
        target: "Go-to-Market"
        method: "Compare website claims against actual shipped features."
      - name: "GDPR Compliance"
        target: "All"
        method: "Check if patient data leaked into agent prompts or logs."
    on_failure: "Escalate to human immediately"

  system_4:
    monitoring:
      competitors: ["TherapieApp", "PraxisPro", "SimplePractice"]
      technology: ["New LLM releases", "Python/React breaking changes"]
      regulation: ["GDPR updates", "Healthcare data laws"]

The coordination rules in system_2 are the ones that would have prevented my go-to-market agent from publishing phantom features. The decisions_requiring_human in identity is what keeps me in the loop for anything that matters.

Why this matters beyond my little SaaS

Deloitte's 2025 AI report says 40% of multi-agent projects will be scaled back or abandoned by 2027. Not because the agents are dumb. Because the organization is missing.

Look at every major framework right now:

                    S1    S2    S3    S3*   S4    S5
                   Ops  Coord Optim Audit Intel Ident
CrewAI              ✅    ❌    ⚠️    ❌    ❌    ❌
LangGraph           ✅    ❌    ⚠️    ❌    ❌    ❌
OpenAI Agents SDK   ✅    ❌    ❌    ❌    ❌    ❌
AutoGen             ⚠️    ⚠️    ❌    ❌    ❌    ❌

These frameworks are great at what they do — giving you powerful building blocks for multi-agent systems. But they all stop at the operational layer. Nobody has an audit function. Nobody independently verifies that agents did what they claim. In a world where LLMs hallucinate by design, that's a gap.

It's like building a company with employees and a CEO, but no internal audit, no strategy department, no quality control, and no mission statement. It works at 3 people. It breaks at 10.

Look at the diagram again — the pink System 3 command channel running down the center, the orange System 2 coordination bars on the right, the S3* audit triangle probing directly into operations. These aren't theoretical niceties. They're the difference between agents that collaborate and agents that collide.

What I'm building

I'm turning this into a tool called ViableOS. You describe your business, it generates the control functions — including the audit layer — and deploys them to your agent framework of choice. Think of it as Terraform, but for agent organizations instead of cloud infrastructure.

It's early. The config format above is how I've designed my own setup. The tooling to actually deploy it is what I'm building now. If you're running multi-agent systems and hitting coordination problems, I'd genuinely love to hear about it — it helps me figure out whether this is a problem only I have or something broader.

Open an issue on GitHub if you've got a war story, or subscribe to the newsletter if you want to follow along as I build this in public.

Next in this series: *"The Dominance Problem"** — what happens when one agent eats all your tokens, and how a System 3 resource monitor fixes it. Subscribe to get notified.*

LinkedIn · GitHub · ViableOS