Programming Central

Posted on Jun 6

How to Orchestrate Autonomous Sub-Agents Without Blowing Your LLM Context Window

#hermesagent #ai #python

We have all hit the "monolithic LLM wall."

You design an incredibly capable AI agent, arm it with a suite of tools, and give it a complex, multi-step task—like writing a comprehensive technical paper complete with data analysis, web research, and code verification. At first, it works beautifully. But as the steps accumulate, the context window fills up. The agent begins to experience "attention drift." It forgets its original instructions, hallucinates tool outputs, and eventually spins out of control, burning through millions of tokens and your API budget.

The problem isn't the LLM's reasoning capacity; it’s the architecture. Trying to solve a complex, multi-domain problem within a single agent’s context window is the modern software equivalent of writing an entire enterprise application inside a single, monolithic main() function.

To build AI systems that can scale to handle real-world complexity, we must transition from monolithic agents to hierarchical multi-agent orchestration.

By decomposing complex goals into isolated, specialized sub-agents—each operating within its own bounded context and resource budget—we can build resilient, self-improving AI systems that scale indefinitely.

In this post, we will dive deep into the architectural patterns of multi-agent orchestration, explore how to manage agent lifecycles, and write production-grade Python code to spawn and supervise sub-agents.

(The concepts and code demonstrated here are drawn from my ebook Hermes Agent, The Self-Evolving AI Workforce)

1. The Core Concept: Hierarchical Decomposition and Supervisory Control

Multi-agent orchestration is not just a design convenience; it is an architectural necessity. The theoretical foundation of this approach rests on two pillars: task decomposition and supervisory control. Together, they transform a monolithic agent into a scalable, resilient hierarchy of specialized workers.

The Master Carpenter Analogy

Think of a master carpenter building a custom cabinet. The master does not personally cut every dovetail, sand every surface, or install every hinge. Instead, she decomposes the project into distinct sub-tasks: joinery, finishing, and hardware installation.

For each sub-task, she assigns an apprentice with the right tools and expertise. She monitors their progress, checks their quality, and integrates their individual outputs into the final product. If an apprentice hits a snag, she intervenes, provides guidance, or reassigns resources.

In this scenario, the parent agent is the master carpenter, and the sub-agents are the apprentices. Each apprentice operates with their own focused toolset and an independent iteration budget.

                   +------------------+
                   |   Parent Agent   |  <-- Master Carpenter (Supervisor)
                   +--------+---------+
                            |
         +------------------+------------------+
         |                  |                  |
+--------v-------+ +--------v-------+ +--------v-------+
|  Sub-Agent A   | |  Sub-Agent B   | |  Sub-Agent C   |  <-- Apprentices (Workers)
| (Web Searcher) | | (Code Builder) | | (Doc Writer)  |
+----------------+ +----------------+ +----------------+

The Software Engineering Parallel: Microservices and OS Processes

In software engineering, this pattern is everywhere:

Microservices: A microservices architecture decomposes a monolithic application into independently deployable services, each with its own database and communication protocol. An orchestrator (like Kubernetes) manages the lifecycle of these services, ensuring they are spawned, scaled, and terminated correctly.
Operating Systems: A modern operating system uses processes. Each process has its own virtual address space, preventing any single runaway process from exhausting system memory or crashing the entire OS.

Multi-agent orchestration applies these exact principles to AI. The parent agent acts as the Kubernetes orchestrator or OS kernel, sub-agents act as independent processes or microservices, and persistent memory serves as the shared state store.

2. The Parent-Agent Supervisor Pattern

The parent-agent supervisor pattern is the architectural heart of multi-agent systems. The parent agent (the primary orchestrator instance) is responsible for managing the entire lifecycle of the operation:

Task Decomposition: Breaking the user’s high-level request into sub-tasks that can be executed independently or sequentially.
Sub-Agent Spawning: Instantiating new sub-agent processes with tailored system prompts, restricted toolsets, and capped budgets.
Delegation: Assigning each sub-task to the appropriate sub-agent, along with the necessary context.
Monitoring: Tracking the state, progress, and iteration consumption of each sub-agent via persistent memory.
Synchronization: Collecting results, resolving dependencies, and merging outputs.
Termination: Cleaning up sub-agents when their work is done, freeing up system resources (e.g., closing browser instances or terminating virtual environments).

This pattern closely mirrors the supervisor-worker model in Erlang/OTP, where supervisor processes monitor worker processes and handle failures gracefully. If a sub-agent fails or gets stuck in an infinite loop, the parent agent can catch the failure, reclaim the resources, and either spawn a replacement or adapt its plan.

3. Resource Management and the Iteration Budget

One of the biggest risks in autonomous agent systems is the "infinite loop" bug—where an agent repeatedly calls a failing tool or gets stuck in a reasoning loop, draining your API keys. When agents start spawning other agents, this risk multiplies exponentially.

To solve this, we implement a thread-safe, per-agent Iteration Budget.

class IterationBudget:
    """Thread-safe iteration counter for an agent.

    Each agent (parent or subagent) gets its own IterationBudget.
    The parent's budget is capped at max_iterations (default 90).
    Each subagent gets an independent budget capped at
    delegation.max_iterations (default 50) — this means total
    iterations across parent + subagents can exceed the parent's cap.
    """

The Reasoning vs. Acting Budget

An elegant design pattern here is the concept of budget refunds for programmatic execution.

If a sub-agent calls a tool to run a Python script (execute_code) that takes several steps to execute, those purely computational steps should not consume the agent's reasoning budget. The agent’s "thinking" budget (deciding what to do) should be strictly separated from its "acting" budget (running computations).

By refunding iterations spent on raw code execution, we ensure that complex computational tasks do not penalize the agent's cognitive allocation.

4. State Management and Persistent Memory

Sub-agents must operate in isolated contexts to keep prompt sizes small, but they still need a way to share state with the parent and their sibling agents. This is achieved through persistent memory—a file-based storage system that survives agent restarts.

This architecture is based on the classical AI Blackboard Pattern:

+-------------------------------------------------------+
|                  PERSISTENT BLACKBOARD                |
|               (Shared File-Based Memory)              |
+---------------------------^---------------------------+
                            |
         +------------------+------------------+
         |                  |                  |
+--------v-------+ +--------v-------+ +--------v-------+
|  Sub-Agent A   | |  Sub-Agent B   | |  Sub-Agent C   |
| Writes Search  | | Reads Search   | | Reads Code     |
| Results        | | Writes Code    | | Writes Final   |
|                | | Artifacts      | | Report         |
+----------------+ +----------------+ +----------------+

The Blackboard: A shared, structured memory space (stored in a local directory like ~/.hermes/).
The Write Phase: A sub-agent completes its task and writes its structured output (e.g., JSON, files, or code patches) to a designated path in the persistent memory.
The Read Phase: The parent agent reads this memory and injects a compressed, sanitized summary of these results into the next sub-agent's system prompt using a context builder.

To prevent memory bloat, a Streaming Context Scrubber is used to compress and summarize large sub-agent outputs before they are passed back up to the parent, keeping the parent's context window clean and focused on high-level strategy.

5. Closed Learning Loops: Recursive Self-Improvement

The true power of this architecture emerges when we apply closed learning loops recursively.

In a multi-agent system, optimization occurs at two distinct layers:

The Sub-Agent Level (The Specialist): Each sub-agent uses optimization frameworks (like DSPy or GEPA) to refine its own tool-calling patterns. For example, a web search sub-agent learns over time which search queries yield the highest-quality results for a given domain.
The Parent Level (The Strategist): The parent agent analyzes the execution trajectories of its sub-agents. If a parent observes that a certain type of sub-task consistently fails or runs out of budget, it dynamically rewrites its decomposition strategy, alters the sub-agent's system prompt, or provisions a different set of tools for the next run.

This is the AI equivalent of meta-learning—the system doesn't just get better at doing tasks; it gets better at delegating them.

6. Step-by-Step Implementation: Spawning and Managing Sub-Agents

Let’s translate these theoretical foundations into production-grade Python code.

Below is a complete, robust implementation of a parent agent supervisor that initializes a persistent session database, builds a specialized sub-agent configuration, and manages sub-agent execution.

#!/usr/bin/env python3
"""
Production-Grade Parent-Agent Supervisor and Sub-Agent Spawner.
"""
import logging
import asyncio
import json
from typing import Dict, List, Any, Optional
from pathlib import Path

# Mocking the imports from the Hermes framework for demonstration
# In a real environment, these are imported from your agent library
class IterationBudget:
    def __init__(self, limit: int):
        self.limit = limit
        self.used = 0

    def consume(self, amount: int = 1):
        self.used += amount
        if self.used > self.limit:
            raise TimeoutError("Iteration budget exceeded!")

class AIAgent:
    def __init__(self, **kwargs):
        self.config = kwargs
        self.session_id = kwargs.get("session_id")
        self.budget = IterationBudget(kwargs.get("max_iterations", 50))

    async def run_conversation(self, prompt: str) -> Dict[str, Any]:
        # Simulate agent execution and tool calling
        await asyncio.sleep(1)
        self.budget.consume(5) # Simulate consuming 5 iterations of reasoning
        return {
            "status": "success",
            "output": f"Processed prompt: '{prompt}' using model {self.config.get('model')}",
            "iterations_used": self.budget.used
        }

class SessionDB:
    def __init__(self, db_path: Path):
        self.db_path = db_path
        self.db_path.mkdir(parents=True, exist_ok=True)
        self.sessions_file = self.db_path / "sessions.json"
        if not self.sessions_file.exists():
            self.sessions_file.write_text("{}")

    def ensure_tables(self):
        # In a real SQL database, this would execute CREATE TABLE statements
        pass

    def upsert_session(self, session_id: str, metadata: Dict[str, Any]):
        data = json.loads(self.sessions_file.read_text())
        data[session_id] = metadata
        self.sessions_file.write_text(json.dumps(data, indent=4))
        print(f"💾 Session '{session_id}' persisted to database.")

def get_hermes_home() -> Path:
    home = Path.home() / ".hermes"
    home.mkdir(exist_ok=True)
    return home

# Setup Logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("MultiAgentOrchestrator")

# ---------------------------------------------------------------------------
# Step 1: Parent Agent Supervisor Configuration
# ---------------------------------------------------------------------------

parent_config = {
    "base_url": "https://api.openai.com/v1",
    "api_key": "sk-mock-key",
    "model": "gpt-4o",
    "provider": "openai",
    "api_mode": "chat",
    "max_iterations": 90,              # Parent gets a generous budget
    "tool_delay": 1.0,                 # Rate-limiting safety delay
    "enabled_toolsets": ["filesystem", "web", "terminal", "code_execution"],
    "save_trajectories": True,
    "session_id": "supervisor_session_101",
}

# Initialize Parent Agent
parent_agent = AIAgent(
    base_url=parent_config["base_url"],
    api_key=parent_config["api_key"],
    model=parent_config["model"],
    provider=parent_config["provider"],
    api_mode=parent_config["api_mode"],
    max_iterations=parent_config["max_iterations"],
    tool_delay=parent_config["tool_delay"],
    enabled_toolsets=parent_config["enabled_toolsets"],
    save_trajectories=parent_config["save_trajectories"],
    session_id=parent_config["session_id"],
)

logger.info(f"Supervisor Agent Initialized. Model: {parent_config['model']} | Session: {parent_config['session_id']}")

# ---------------------------------------------------------------------------
# Step 2: Initialize Persistent Session Storage
# ---------------------------------------------------------------------------
hermes_home = get_hermes_home()
session_db = SessionDB(db_path=hermes_home / "sessions")
session_db.ensure_tables()

# Register parent session in DB
session_db.upsert_session(
    session_id=parent_config["session_id"],
    metadata={
        "role": "supervisor",
        "model": parent_config["model"],
        "max_iterations": parent_config["max_iterations"],
        "status": "active"
    }
)

# ---------------------------------------------------------------------------
# Step 3: Sub-Agent Spawner Configuration & Lifecycle Management
# ---------------------------------------------------------------------------
SUB_AGENT_MODEL = "gpt-4-mini"  # Using a faster, cheaper model for sub-agents
SUB_AGENT_MAX_ITERATIONS = 50   # Capped iteration budget for safety

def build_sub_agent_config(task_slug: str, specialized_tools: List[str]) -> dict:
    """
    Generates a tailored configuration for a specialized sub-agent.
    """
    sub_session_id = f"{parent_config['session_id']}_sub_{task_slug}"

    return {
        "base_url": parent_config["base_url"],
        "api_key": parent_config["api_key"],
        "model": SUB_AGENT_MODEL,
        "provider": parent_config["provider"],
        "api_mode": "chat",
        "max_iterations": SUB_AGENT_MAX_ITERATIONS,
        "tool_delay": 0.5,
        "enabled_toolsets": specialized_tools,  # Restrict tools to only what is needed!
        "save_trajectories": True,
        "session_id": sub_session_id,
    }

async def orchestrate_sub_task(task_name: str, prompt: str, tools: List[str]) -> Dict[str, Any]:
    """
    Spawns, executes, tracks, and terminates a sub-agent.
    """
    logger.info(f"🚀 Spawning sub-agent for task: [{task_name}]")

    # Generate configuration
    sub_config = build_sub_agent_config(task_name, tools)

    # Persist sub-agent creation to database
    session_db.upsert_session(
        session_id=sub_config["session_id"],
        metadata={
            "role": f"worker_{task_name}",
            "parent_session_id": parent_config["session_id"],
            "model": sub_config["model"],
            "max_iterations": sub_config["max_iterations"],
            "status": "spawned"
        }
    )

    # Instantiate Sub-Agent
    sub_agent = AIAgent(**sub_config)

    try:
        # Execute Task (Delegation Phase)
        logger.info(f"Delegating task to sub-agent [{sub_config['session_id']}]...")
        result = await sub_agent.run_conversation(prompt)

        # Update Status to Success
        session_db.upsert_session(
            session_id=sub_config["session_id"],
            metadata={"status": "completed", "iterations_used": result["iterations_used"]}
        )
        logger.info(f"✅ Sub-agent [{task_name}] completed successfully.")
        return result

    except Exception as e:
        logger.error(f"❌ Sub-agent [{task_name}] failed: {str(e)}")
        session_db.upsert_session(
            session_id=sub_config["session_id"],
            metadata={"status": "failed", "error": str(e)}
        )
        raise e

    finally:
        # Resource Cleanup Phase
        logger.info(f"🧹 Terminating sub-agent [{sub_config['session_id']}] and cleaning up resources.")
        # In a production system, you would call:
        # sub_agent.cleanup_browser()
        # sub_agent.cleanup_vm()

# ---------------------------------------------------------------------------
# Step 4: Run Orchestration Loop
# ---------------------------------------------------------------------------
async def main():
    print("\n--- Starting Multi-Agent Orchestration Demo ---\n")

    # Define specialized sub-tasks
    tasks = [
        {
            "name": "research",
            "prompt": "Search the web for the latest advancements in solid-state batteries.",
            "tools": ["web"]
        },
        {
            "name": "analysis",
            "prompt": "Analyze the research data and generate a Python script to model efficiency curves.",
            "tools": ["filesystem", "code_execution"]
        }
    ]

    # Execute sub-agents sequentially (can be parallelized using asyncio.gather)
    for task in tasks:
        try:
            result = await orchestrate_sub_task(
                task_name=task["name"],
                prompt=task["prompt"],
                tools=task["tools"]
            )
            print(f"Result Output: {result['output']}\n")
        except Exception:
            print(f"Skipping downstream tasks due to failure in task: {task['name']}")

if __name__ == "__main__":
    asyncio.run(main())

7. Key Architectural Takeaways

If you are designing a multi-agent system, keep these core architectural principles in mind:

Strict Tool Isolation: Never give a sub-agent more tools than it needs. A web-searching agent does not need write access to your terminal; a code-execution agent does not need access to your browser. Limiting tools dramatically reduces security risks and prompt confusion.
Independent Budgets: Always cap your sub-agents' iteration budgets below the parent's budget. If a parent has 90 iterations, its sub-agents should be capped at 30 or 50. This ensures the parent always retains enough budget to handle failures and synthesize the final results.
Persistent State vs. Ephemeral Context: Keep your LLM context windows ephemeral. Use a persistent, file-based database or shared folder to write intermediate data, and only pass highly compressed summaries back into the active context.

Let's Discuss

How do you handle error recovery in your multi-agent systems? If a critical sub-agent fails or runs out of budget, do you prefer to have the parent agent retry with a modified prompt, or do you escalate the failure directly to the human-in-the-loop?
What are your thoughts on budget refunds for programmatic tools? Do you agree that pure code execution shouldn't count against an agent's reasoning budget, or does that open the door to unmonitored resource consumption?

Leave a comment below with your experiences, and let’s build more resilient AI systems together!

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook Hermes Agent, The Self-Evolving AI Workforce: details link, you can find also my programming ebooks with AI here: Programming & AI eBooks.

DEV Community