Suhas Mallesh

Posted on Mar 22

Multi-Agent Orchestration on AWS: Supervisor Pattern with Terraform 🧠

#ai #devops #aws #agents

One agent handles simple tasks. Complex workflows need a team. Bedrock's multi-agent collaboration lets a supervisor agent break down problems, delegate to specialists, and combine results. Here's how to build it with Terraform.

In the previous posts, we deployed a single Bedrock agent with action groups. That works for focused tasks. But real workflows are multi-domain: a customer support request might need to check an order, look up a policy, and escalate to a specialist. A single agent with 15 action groups performs poorly because the model struggles to select the right tool from too many options.

Multi-agent collaboration solves this. You create specialized agents, each focused on one domain, and a supervisor agent that routes requests, delegates tasks, and combines results. The supervisor reads the user's intent and decides which specialist to call. Each specialist stays focused and performs better than one overloaded generalist. Now GA on Amazon Bedrock. 🎯

🏗️ How Multi-Agent Collaboration Works

User: "Cancel my order #1234 and refund to my original payment method"
    ↓
Supervisor Agent (analyzes intent, plans execution)
    ↓ delegates to
┌──────────────────────────────────────────┐
│ Order Agent          │ Payments Agent    │
│ (cancels order #1234)│ (processes refund)│
└──────────────────────────────────────────┘
    ↓ results flow back to
Supervisor Agent (combines results)
    ↓
User: "Order #1234 has been cancelled. A refund of $89.99 
       will be processed to your Visa ending in 4242 
       within 3-5 business days."

The supervisor doesn't do the work itself. It plans, delegates, and synthesizes.

🔧 Two Collaboration Modes

Mode	How It Works	Best For
Supervisor	Analyzes every request, breaks down complex tasks, coordinates multiple agents	Complex multi-step workflows
Supervisor with Routing	Routes simple requests directly to a specialist; falls back to full supervisor mode for complex queries	Mixed simple/complex traffic

Supervisor with routing is more cost-effective for production traffic where most requests are straightforward. The routing mode skips full orchestration for simple queries, reducing latency and token usage.

🔧 Terraform: Build the Team

Step 1: Create Specialist (Collaborator) Agents

Each collaborator is a standalone agent with its own model, instructions, and tools:

# agents/collaborators.tf

variable "collaborators" {
  description = "Map of specialist agents"
  type = map(object({
    instruction          = string
    collaboration_instruction = string
    model_id             = string
  }))
}

resource "aws_bedrockagent_agent" "collaborator" {
  for_each = var.collaborators

  agent_name              = "${var.environment}-${each.key}"
  agent_resource_role_arn = aws_iam_role.agent.arn
  foundation_model        = each.value.model_id
  instruction             = each.value.instruction
  agent_collaboration     = "COLLABORATOR"
  prepare_agent           = true
  idle_session_ttl_in_seconds = var.idle_session_ttl
}

resource "aws_bedrockagent_agent_alias" "collaborator" {
  for_each = var.collaborators

  agent_alias_name = "${var.environment}-live"
  agent_id         = aws_bedrockagent_agent.collaborator[each.key].agent_id
}

agent_collaboration = "COLLABORATOR" marks these agents as available for a supervisor to call. Each collaborator can have its own action groups, knowledge bases, and guardrails.

Step 2: Create the Supervisor Agent

# agents/supervisor.tf

resource "aws_bedrockagent_agent" "supervisor" {
  agent_name              = "${var.environment}-supervisor"
  agent_resource_role_arn = aws_iam_role.supervisor.arn
  foundation_model        = var.supervisor_model.id
  instruction             = var.supervisor_instruction
  agent_collaboration     = "SUPERVISOR"
  prepare_agent           = false  # Prepare AFTER collaborators are associated
  idle_session_ttl_in_seconds = var.idle_session_ttl
}

prepare_agent = false is critical for supervisors. A supervisor agent cannot be prepared until collaborators are associated. If you set this to true, Terraform fails because the supervisor has no team members yet.

Step 3: Associate Collaborators with Supervisor

# agents/associations.tf

resource "aws_bedrockagent_agent_collaborator" "this" {
  for_each = var.collaborators

  agent_id                  = aws_bedrockagent_agent.supervisor.agent_id
  collaboration_instruction = each.value.collaboration_instruction
  collaborator_name         = each.key
  relay_conversation_history = "TO_COLLABORATOR"

  agent_descriptor {
    alias_arn = aws_bedrockagent_agent_alias.collaborator[each.key].agent_alias_arn
  }

  depends_on = [aws_bedrockagent_agent.supervisor]
}

collaboration_instruction tells the supervisor when to use each collaborator. This is the routing logic. Be specific: "Handle all questions about order status, cancellations, and shipping. Do not handle payment or refund questions."

relay_conversation_history = "TO_COLLABORATOR" shares the conversation context so collaborators have full context when handling their part.

Step 4: Prepare the Supervisor

After all collaborators are associated, prepare the supervisor:

# agents/prepare.tf

resource "null_resource" "prepare_supervisor" {
  triggers = {
    collaborators = jsonencode(keys(var.collaborators))
    supervisor    = aws_bedrockagent_agent.supervisor.agent_id
  }

  provisioner "local-exec" {
    command = <<EOF
aws bedrock-agent prepare-agent \
  --agent-id ${aws_bedrockagent_agent.supervisor.agent_id} \
  --region ${var.region}
EOF
  }

  depends_on = [aws_bedrockagent_agent_collaborator.this]
}

resource "time_sleep" "wait_for_preparation" {
  depends_on      = [null_resource.prepare_supervisor]
  create_duration = "15s"
}

resource "aws_bedrockagent_agent_alias" "supervisor" {
  agent_alias_name = "${var.environment}-live"
  agent_id         = aws_bedrockagent_agent.supervisor.agent_id

  depends_on = [time_sleep.wait_for_preparation]
}

The time_sleep gives the supervisor time to reach the PREPARED state before creating the alias. In production, consider polling the agent status instead of a fixed delay.

📐 Environment Configuration

# environments/dev.tfvars

supervisor_model = {
  id      = "anthropic.claude-sonnet-4-20250514-v1:0"
  display = "Claude Sonnet 4"
}

supervisor_instruction = <<-EOT
  You are a customer support supervisor. Analyze each request
  and delegate to the appropriate specialist. For complex requests
  that span multiple domains, coordinate between specialists and
  combine their results into a clear response.
EOT

collaborators = {
  order-agent = {
    model_id    = "anthropic.claude-sonnet-4-20250514-v1:0"
    instruction = "You handle order lookups, cancellations, and shipping status. You have access to the orders database via action groups."
    collaboration_instruction = "Handle all questions about order status, cancellations, returns, and shipping. Do not handle payment or billing questions."
  }
  payments-agent = {
    model_id    = "anthropic.claude-haiku-4-20250414-v1:0"
    instruction = "You handle payment processing, refunds, and billing inquiries. You have access to the payments API via action groups."
    collaboration_instruction = "Handle all questions about payments, refunds, billing, and payment methods. Do not handle order status questions."
  }
  policy-agent = {
    model_id    = "anthropic.claude-haiku-4-20250414-v1:0"
    instruction = "You answer questions about company policies, warranty terms, and return policies based on the knowledge base."
    collaboration_instruction = "Handle all questions about company policies, warranty, and return rules. Refer to the knowledge base for accurate answers."
  }
}

Model selection per agent. Use a powerful model (Sonnet) for the supervisor that needs to plan and coordinate. Use a faster, cheaper model (Haiku) for simple specialist agents that execute focused tasks. This optimizes cost without sacrificing quality.

🧪 Invoke the Multi-Agent System

From your application, you invoke the supervisor. It handles delegation internally:

import boto3

client = boto3.client("bedrock-agent-runtime")

response = client.invoke_agent(
    agentId=supervisor_agent_id,
    agentAliasId=supervisor_alias_id,
    sessionId="user-session-456",
    inputText="Cancel my order #1234 and refund to my original payment method"
)

for event in response["completion"]:
    if "chunk" in event:
        print(event["chunk"]["bytes"].decode(), end="")

Your application calls one endpoint. The supervisor decomposes the request, delegates to the order agent and payments agent, gathers results, and returns a unified response. Zero orchestration code in your application.

⚠️ Gotchas and Tips

Prompt quality is everything. The most common failure mode is the supervisor trying to handle everything itself instead of delegating. Write clear supervisor instructions that explicitly say "delegate to specialists" and clear collaboration instructions that define non-overlapping responsibilities.

Sequential collaborator association. When adding multiple collaborators, each association triggers a preparation step. Adding them too fast can fail because the agent is in a PREPARING state. The depends_on chain in the Terraform handles this, but be aware of it for manual operations.

Mixed-model teams. Use expensive models for complex reasoning (supervisor, planning agents) and cheaper models for straightforward tasks (lookup agents, policy agents). The cost difference is significant at scale.

Conversation history sharing. Enable relay_conversation_history for collaborators that need context from previous turns. Disable it for stateless lookup agents where the context adds unnecessary tokens.

Adding new specialists. To add a new collaborator, add an entry to the collaborators variable and run terraform apply. The supervisor is re-prepared automatically and can immediately delegate to the new agent.

⏭️ What's Next

This is Post 3 of the AWS AI Agents with Terraform series.

Post 1: Deploy First Bedrock Agent 🤖
Post 2: Action Groups - Connect to APIs 🔌
Post 3: Multi-Agent Orchestration (you are here) 🧠
Post 4: Agent + Knowledge Base Grounding

Your single agent is now a team. A supervisor that plans, specialists that execute, and Terraform that makes the whole team a variable change away from a new specialist. Scale your AI by scaling your team. 🧠

Found this helpful? Follow for the full AI Agents with Terraform series! 💬

DEV Community