WonderLab

Posted on Jul 2

Workflow Series (04): Multi-Agent Coordination — Orchestrator Boundaries, Concurrency Control, and Context Isolation

#ai #workflow #multiagent #productivity

Orchestrator Responsibility Boundaries

Unclear division between the Orchestrator (main Agent) and Subagents is the most common design problem in multi-agent workflows.

The Orchestrator does three things:

1. Decide: read state, determine the next step
2. Dispatch: spawn subagents, pass task prompts
3. Collect: read subagent output files, update state

It doesn't execute business logic (analyze bugs, write code, query logs), read raw files, or modify business data. Those belong to subagents.

The main Agent receives only structured conclusions (JSON). Subagents report through output.json, not message streams.

# ✅ Correct: main Agent reads structured conclusion
result = json.loads(Path("phase3/analysis_final.json").read_text())
if result["confidence"] >= 95:
    proceed_to_phase4()

# ❌ Wrong: main Agent reads raw logs
log_content = Path("crash.log").read_text()  # 100k lines into the main Agent's context
decision = llm.analyze(log_content)           # this is subagent work

This boundary produces two benefits: the main Agent's context stays manageable (state and conclusions, no raw data); subagents' business logic can be tested independently, without the main Agent's session history.

Subagent Design Principles

Principle 1: Input Completeness

The task prompt must contain everything the subagent needs to complete its task.

# ❌ Incomplete task prompt
Analyze the root cause of this bug. Refer to previous analysis results.

# ✅ Complete task prompt
## Task
Analyze the root cause of the following bug.

## Input
Bug info:
{{ bug_info.summary }}
{{ bug_info.stack_trace }}

Log directory: {{ log_dir }}

## Output requirements
Write to analysis_final.json, format:
{"confidence": float, "root_cause": str, "evidence": [str]}

"Refer to previous analysis results" requires the subagent to access the main Agent's context history — which doesn't exist in an isolated session. Each subagent knows only what's in its task prompt.

Principle 2: Output Contract Strictness

Subagents must write their output files in the declared JSON Schema. The main Agent's routing logic depends on this schema; missing fields or wrong types break the decision logic.

# Subagent output schema (defined in templates/)
OUTPUT_SCHEMA = {
    "passed": bool,           # required — main Agent routing depends on this
    "confidence": float,      # required — range 0-1
    "root_cause": str,        # required
    "evidence": list[str],    # required
    "error": str | None       # required on failure
}

Principle 3: Structured Error Output on Failure

On failure, subagents must still write an output file with passed=false.

{
  "passed": false,
  "error": "Log file not found: /workspace/logs/crash_20260601.log",
  "confidence": 0,
  "root_cause": null,
  "evidence": []
}

A missing output file looks like a timeout to the main Agent. Structured error output lets the main Agent distinguish "subagent failed" from "subagent timed out" and respond differently.

Fan-out / Fan-in Concurrency Control

Fan-out Design

Fan-out means one trigger point spawns N concurrent subagents. Two hard constraints:

Constraint 1: Each subagent writes a different output file

# ✅ Correct: each candidate writes to its own file
candidates = ["candidate_a", "candidate_b", "candidate_c"]
for c in candidates:
    spawn_subagent(
        task_prompt=build_prompt(c, bug_info),
        output_file=f"phase4/{c}.json"  # unique filename
    )

# ❌ Wrong: all candidates write to the same file (concurrent write conflict)
spawn_subagent(task_prompt=..., output_file="phase4/result.json")
spawn_subagent(task_prompt=..., output_file="phase4/result.json")  # conflict!

Constraint 2: The main Agent waits for all subagents to complete

After fan-out, the main Agent enters a waiting state and doesn't proceed. When no async runtime is available, polling works:

def wait_all_candidates(candidates: list[str], timeout: int = 300) -> dict:
    results = {}
    deadline = time.time() + timeout

    while len(results) < len(candidates) and time.time() < deadline:
        for c in candidates:
            if c not in results:
                output_file = Path(f"phase4/{c}.json")
                if output_file.exists():
                    results[c] = json.loads(output_file.read_text())
        time.sleep(5)

    return results

Fan-in: Failure Strategy

When some subagents fail at fan-in, two strategies:

fail-fast (abort on any failure)

# For: all branch results are required; one failure makes the whole batch meaningless
phase_parallel_analysis:
  fan_in_strategy: fail-fast
  on_any_failure: trigger_gate_A

When to use: Three subagents each retrieve data from different sources. Missing any one source makes further analysis impossible.

collect-all (aggregate everything, including failures)

# For: partial success is enough; select the best from passing results
phase_4_fix:
  fan_in_strategy: collect-all
  selection_criteria:
    require_any_passed: true
    select_by: max_test_coverage
  on_all_failed: trigger_gate_B

When to use: Three code-fix candidates run concurrently. One passing tests is sufficient. Failed candidates are discarded. Only if all three fail does the human gate trigger.

Selection Principle

All branch results are required          → fail-fast
Partial success is sufficient            → collect-all (code fix, candidate generation)
Comparing multiple results for quality   → collect-all (select best)

The Phase 4 fix in a Bug fix workflow uses collect-all: three candidates run concurrently, the one with highest test coverage among passing candidates is selected, and the human gate only triggers when all three fail.

Context Isolation

Subagents must run in isolated sessions with no access to the main Agent's conversation history.

The main Agent's context holds the full workflow history: all file contents, all subagent raw outputs, all intermediate decisions. Passing this to a subagent writing one patch causes context to balloon from a few thousand tokens to tens of thousands, doubles token cost, and lets irrelevant history degrade the subagent's focus.

Information flows in two directions:

Main Agent
  │
  │ task prompt (only the fields the subagent needs)
  ▼
Subagent (isolated session, no history)
  │
  │ output_file (JSON at agreed path)
  ▼
Main Agent (reads file, not conversation history)

The subagent knows what's in its task prompt and the agreed output path. It doesn't know what the main Agent did, how far the workflow has progressed, or what other subagents produced.

If a subagent needs to "understand the background" to complete its task, the task prompt is incomplete. Put that background in explicitly — don't count on the subagent accessing history it can't see.

Design Checklist

Orchestrator responsibilities

[ ] Main Agent reads only structured JSON output — no raw logs or long text
[ ] Main Agent doesn't execute business logic (analysis, writing code, queries)
[ ] Routing decisions depend on the state file and subagent output, not conversation history

Subagent design

[ ] Task prompt contains all fields the subagent needs (no implicit context dependencies)
[ ] Output schema is declared in templates/ and includes a passed field
[ ] On failure, the subagent still writes {"passed": false, "error": "..."} to the output file

Concurrency control

[ ] Each subagent in a fan-out writes to a unique output file path
[ ] Fan-in strategy is explicitly labeled fail-fast or collect-all
[ ] collect-all has defined selection_criteria and on_all_failed behavior

Context isolation

[ ] Subagents run in isolated sessions with no access to the main Agent's history
[ ] All background information the subagent needs is explicitly in the task prompt

Summary

The Orchestrator only decides and dispatches: reads JSON conclusions, spawns subagents, collects JSON conclusions — keeping context manageable is the core objective, not "being smart"
Fan-in strategy determines workflow resilience: solution-space problems (code fix) use collect-all; all-or-nothing problems (data collection) use fail-fast — getting this wrong either blocks on a single failure or wastes time on impossible tasks
Context isolation is quality assurance: extra context is noise, not help; if a subagent needs background to do its job, that background belongs explicitly in the task prompt

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community