Sahajmeet Kaur

Posted on Jun 16 • Edited on Jun 24

Why Multi-Agent Orchestration Is Harder Than It Looks

#ai #agents #mlops #llm

One AI agent answering a question is useful. Five agents that divide a complex task, pass state to each other, and act on live enterprise systems is a meaningfully different category of system. It also carries a meaningfully different category of operational problems.

Multi-agent orchestration is the architectural pattern that makes the second case coherent. But a lot of teams prototype multi-agent systems in a weekend and then spend months figuring out why production is unpredictable, expensive, and impossible to audit.

Here's how it actually works, what the frameworks solve, and what they leave on the floor.

What multi-agent orchestration is

A single AI agent handles a task from start to finish, sequentially. Multi-agent orchestration distributes that work: each agent owns a defined role, capability, or subtask. An orchestration layer above them decides who runs when, what context each agent receives, what they're allowed to access, and how the system behaves when something fails.

The shift matters because complex tasks have natural decomposition boundaries. A research task has a retrieval step and a synthesis step. A code review has a static analysis step, a logic review, and a security scan. Running those through a single general-purpose agent loses the benefit of specialisation. Orchestrating specialist agents over those steps - with clean handoffs and persistent state, produces a system that handles genuinely complex goals that would defeat a single-agent approach.

It also produces a system that fails in more interesting ways.

How the runtime works

Task decomposition

An orchestrator agent receives the high-level objective and breaks it into discrete subtasks suited to specialist agents. It manages sequencing, dependencies, and what context each downstream agent needs. When results come back, the orchestrator scores each step and decides whether it passed, needs a retry, or should trigger an alternative path before moving on.

Scoped execution

Sub-agents run inside a defined scope. Access is limited to the tools, data sources, and model capabilities their specific role requires. That scoping is one of the pattern's main security properties — when it's actually enforced, which frameworks alone don't guarantee. More on that below.

State persistence

Outputs from earlier steps feed later decisions. This is what distinguishes a multi-agent workflow from chained API calls — shared state accumulates across the full execution rather than starting fresh at each step.

This is also where things get fragile. A wrong fact written to shared state at step two can corrupt every downstream step. Debugging a multi-agent failure is significantly harder than debugging a single-agent one because cause and symptom are often several steps apart, and the final output can look plausible even when it isn't.

Error handling - the part most prototypes skip

What happens when a sub-agent times out? When it returns malformed output? When the orchestrator receives a result that contradicts an earlier assumption?

Without explicit retry policies and escalation paths defined at the orchestration layer, you get two failure modes: the entire workflow stalls on a single point of failure, or the orchestrator pushes ahead with incomplete information and produces output that looks finished but is wrong in ways nobody catches until downstream.

Most tutorials spend two paragraphs on error handling. Production systems spend two months on it.

The current framework landscape

Four frameworks lead the space in 2026. They solve the coordination problem in meaningfully different ways.

LangGraph

Workflows in LangGraph take the shape of directed graphs. Nodes represent agent steps; edges represent transitions, including conditional ones. Every path through the workflow is under explicit developer control. Time-travel debugging across agent steps is built in, which matters considerably more in production than it sounds during a demo — being able to replay from an intermediate state when something goes wrong changes the debugging experience substantially.

The explicitness is the main advantage and the main cost. Complex workflows require upfront graph design. The tradeoff is usually worth it for teams with hard auditability requirements or workflows with significant conditional branching.

Microsoft Agent Framework

In October 2025, Microsoft converged AutoGen and Semantic Kernel into a single SDK. AutoGen had pioneered multi-agent conversation patterns - group chat, debate loops, reflection. Semantic Kernel contributed enterprise plumbing: telemetry, Azure integration, plugin architecture. Both predecessor frameworks entered maintenance mode as part of the consolidation.

One naming note worth making explicitly: AG2 is frequently confused with Microsoft Agent Framework. They're different things. AG2 is a community fork of the original AutoGen 0.2, maintained outside Microsoft by some of the framework's original contributors. If you're reading older documentation, pay attention to which one it's actually describing.

CrewAI

CrewAI uses a crew-of-roles abstraction that maps onto how teams already think about task division. Agents have named roles, goals, and tool sets. That mental model accelerates initial development — the onboarding story is fast and the initial prototype comes together quickly.

The limitation shows at scale. Fine-grained state management and complex branching scenarios require workarounds that teams running sustained production workflows often find limiting enough to migrate away from. Good for domain-specific workflows where the role decomposition is stable and well-understood.

Google Agent Development Kit (ADK)

Google ADK organises agents into hierarchical trees: a central orchestrator delegates to sub-agents, which may themselves have sub-agents. Native support for the A2A (Agent-to-Agent) protocol enables cross-framework communication. That matters if you're building on agents from multiple providers that may use different frameworks underneath — increasingly common as the ecosystem fragments into specialised agents from many sources.

When to use each

Framework	Strongest fit
LangGraph	Branching workflows, auditability requirements, teams comfortable with graph design
Microsoft Agent Framework	Azure-native teams, enterprise integration requirements
CrewAI	Domain-specific crew-style workflows, fast initial development
Google ADK	Hierarchical delegation, cross-framework agent interop, A2A protocol needs

None of these is a complete answer on its own. They solve coordination. They don't solve governance.

The governance gap and where we hit it

This is the part that surprises teams moving from demo to production. We hit all four of these walls, in roughly this order.

Access control. No major framework enforces which agents or users can access which tools or models at the framework layer. That policy lives in application code — which scales inconsistently across teams and drifts over time. The specific thing that caught us: a sub-agent spawned by our orchestrator was inheriting the parent's full API access. When that sub-agent went rogue in a retry loop, it had the same blast radius as the service account that kicked off the original workflow. The framework had no concept of delegated permissions - it was all or nothing.

Cost visibility. A five-agent workflow with three model calls per agent per step generates 15 or more inference calls per request. LangGraph's execution logs told us what ran. They didn't tell us what it cost, per agent, per team, per day. When our weekly LLM bill jumped 40% in one sprint, we spent two days narrowing down which workflow was responsible by cross-referencing timestamps between the framework logs and the OpenAI billing dashboard. Manual, slow, and the kind of thing that shouldn't require detective work.

Compliance audit trails. Our security team asked a question we should have been able to answer: "Which agents accessed our internal data API in the last 30 days, and on whose authority?" Framework logs give you execution traces — what steps ran, in what order. They don't give you a structured audit trail that maps each tool invocation to an authenticated user identity in a format a compliance team can actually use. We had the data, but it was in free-form JSON logs that required custom parsing to produce anything readable.

Model portability. We had agents built on LangGraph and a newer team experimenting with CrewAI. Getting consistent governance across both, same access policies, same cost limits, same audit trail format - meant duplicating config in two places. Every time we added a new model to the approved list, we updated it in two systems and inevitably missed one.

What actually solved it for us

The pattern that worked: a governance layer that sits above the orchestration framework rather than inside it.

The framework handles coordination — which agents run, in what order, what state gets passed. The governance layer handles everything the framework deliberately doesn't: who can invoke which agents, what tools each agent can access, what the cost ceiling is per workflow, and what the audit trail looks like.

We ended up on TrueFoundry's Agent Gateway for this. The specific things that fixed our four walls:

Per-agent identity with OAuth 2.0 injection. Every agent action is tied to the authenticated user who originated the workflow. When a sub-agent is spawned, it inherits only the delegated permissions its parent was authorised to pass - not the full service account access. The over-privileged sub-agent problem is closed by design, at the infrastructure layer, without any change to the LangGraph or CrewAI code.

Workflow-level budget enforcement. Token budgets apply per workflow, not just per team. When a workflow hits its limit, it gets a rate-limit error before more spend accumulates. We also set a max-steps circuit breaker - if an agent workflow exceeds a configurable step count, the gateway stops it. The runaway retry loop that caught us in staging would have been caught at step 12 rather than running until the budget alert fired.

Structured audit trail per tool invocation. Every tool call logs: which agent, which step, which user identity, which tool, what the parameters were, what the result was, what it cost. This is the format the compliance team actually asked for — not execution traces, but a per-action record tied to user identity. The 30-days question that took us two days to answer manually now takes a query.

Framework-agnostic coverage. The same governance policies apply to our LangGraph agents and our CrewAI agents. One place to update the approved model list, one place to see cost across both. The gateway doesn't care what framework built the agent - it applies policies at the request level, below the framework.

The tradeoff worth naming: TrueFoundry is Kubernetes-native, so there's real setup overhead if you're not already on K8s. And if your agents are simple - one team, one framework, no compliance requirements — it's more infrastructure than you need. The inflection point for us was four agents with overlapping tool access and a compliance requirement that needed a real audit trail. Before that, the application-layer workarounds were manageable.

The failure mode nobody demos

Multi-agent systems fail in compound ways that single-agent systems don't. A single-agent failure is usually obvious: bad output, developer sees it, developer fixes it. A multi-agent failure often involves a plausible-looking error introduced at step two that propagates through steps three, four, and five, producing a finished-looking output that's wrong in ways that surface only downstream.

That failure mode changes the production reliability requirements. Explicit error handling, defined retry policies, and circuit breakers that stop a runaway workflow before it completes on bad state and an inflated inference bill matter significantly more than most framework tutorials suggest.

The consistent lesson: the coordination problem is largely solved by the frameworks. The governance, cost, and auditability problems are still early and underestimated and they're the ones that determine whether multi-agent systems stay in staging or make it to production.

The honest tradeoff summary

Multi-agent orchestration genuinely extends what AI systems can do. The ability to compose specialist agents into workflows that handle complex, multi-step goals is real leverage. The frameworks for doing it have matured quickly.

What hasn't kept up is the tooling for running those systems reliably at scale with enforced access boundaries, controlled costs, and audit trails that a compliance team can use. Teams that treat governance as a core concern from the start spend considerably less time retrofitting it later.

If you're building multi-agent systems in production, what did your governance setup looks like and what problems you've actually hit - particularly around the access control and audit trail gaps. The framework documentation is optimistic on this. Drop it in the comments.

DEV Community