Mohsin Sheikhani

Posted on Sep 29

Building Production Multi-Agent Systems: My Experience with Amazon Bedrock AgentCore, and AWS Strands Agents

#aws #agentaichallenge #bedrock #ai

The 2 AM Hotel Booking Nightmare

Picture this: It's 2 AM, you're frantically booking a hotel room for a business trip that got moved up by three days. You find the perfect room, book it, then realize six hours later that your flight got cancelled.
Now you need to cancel the booking, but wait, what's the cancellation policy? Will there be fees? Can you modify instead of cancel? And why is customer service only available during business hours when your life operates on chaos-time?

If this sounds familiar, you've experienced the fundamental problem with traditional hotel booking systems: they treat complex, multi-step workflows as isolated transactions. But real life isn't transactional, it's conversational, contextual, and full of "what-ifs."

This is the story of how I built an AI system that doesn't just book hotels, it thinks like a seasoned concierge who remembers your preferences, understands hotel policies, and can handle the messy reality of travel planning.

Why Traditional Hotel Booking Systems Miss the Mark

Most hotel booking platforms treat you like you're ordering a pizza: pick your toppings (dates, location, price range), pay, and done. But hotel booking isn't pizza ordering, it's relationship management.

Think about what a great hotel concierge does:

Understands complex policies and explains them in plain English
Handles changes gracefully without making you start over
Coordinates multiple services (booking, modifications, cancellations, notifications)
Remembers your preferences from previous stays
Provides proactive guidance based on your specific situation

Traditional systems fail because they're built around databases and forms, not conversations and context. They force users to navigate separate interfaces for search, booking, modification, and support, each requiring you to re-explain your situation.

The result? Frustrated customers, abandoned bookings, and support teams drowning in "simple" requests that require human intelligence to resolve.

The Multi-Agent Vision

What if instead of building another booking form, I built a team of AI specialists that work together like a hotel's back-office staff?

Search Specialist who knows every hotel's availability and can compare options intelligently
Booking Specialist who understands reservation lifecycles and policy implications
Policy Advisor who can explain complex terms and check compliance before actions
Communications Specialist who handles confirmations and keeps everyone informed
Supervisor who orchestrates the team and maintains conversation context

This isn't just a technical architecture, it's a business model that scales human-like service.

Enter Amazon Bedrock AgentCore: The Foundation for Intelligent Orchestration

Here's where the vision meets reality. Amazon Bedrock AgentCore isn't just another AI service, it's the infrastructure that makes multi-agent systems production-ready.

The challenge with building agent teams isn't creating individual agents (that's the easy part). The real complexity lies in:

AgentCore solves these with three key services that work together:

Memory Service: Persistent conversation context that survives sessions. Your booking agent remembers you mentioned you prefer ground floor rooms, even if you come back next week.

Gateway Service: Secure, managed access to external tools and APIs. No more wrestling with authentication, rate limiting, or connection management.

Runtime Service: Scalable agent execution with built-in observability. Your agents run reliably in production without you managing infrastructure.

But here's what makes it powerful: these services are designed to work together. Memory informs decision-making, Gateway enables action-taking, and Runtime orchestrates it all.

The Architecture That Changes Everything

Instead of building a monolithic booking system, I created a supervisor-agent architecture where specialized agents collaborate through AgentCore's components:

The supervisor doesn't just route requests—it maintains context, orchestrates multi-step workflows, and ensures policy compliance before any action is taken.

Building the Agent Team: From Search to Confirmation

Each agent in the system is a specialist, but they're not working in isolation. Here's how the team comes together:

The Search & Discovery Agent

Role: The research specialist who knows every hotel's availability and pricing.

Link To Repo

This agent doesn't just return search results, it understands context. When you ask for "hotels near the conference center," it factors in your previous preferences, compares pricing across dates, and highlights amenities that matter to business travelers.

class SearchDiscoveryAgent(BaseAgent):
    """Agent responsible for hotel search and discovery operations"""

    def __init__(self):
        super().__init__(port="9001")

    def get_agent_name(self) -> str:
        return "SearchDiscoveryAgent"

    def get_agent_description(self) -> str:
        return "Handles hotel search, availability checking, and price comparisons using AgentCore Lambda tools"

The Reservation Agent

Role: The booking specialist who manages the entire reservation lifecycle.

Link To Repo

But here's where it gets interesting, this agent is policy-aware. Before making any booking, modification, or cancellation, it checks with the Guest Advisory Agent to understand implications.

class ReservationAgent(BaseAgent):
    """Agent responsible for managing hotel reservations"""

    def __init__(self):
        super().__init__(port="9002")

    def get_agent_name(self) -> str:
        return "ReservationAgent"

    def get_agent_description(self) -> str:
        return "Handles the full lifecycle of hotel room reservations, including booking, updating, canceling, and fetching past reservations for a guest."

The Guest Advisory Agent

Role: The policy expert who bridges business rules and user actions.

Link To Repo

This agent has access to a knowledge base of hotel policies, cancellation terms, and booking rules. It doesn't just recite policies—it explains them in context and calculates real implications.

class GuestAdvisoryAgent(BaseAgent):
    """Agent responsible for providing hotel policies and advisory information"""

    def __init__(self):
        super().__init__(port="9003")

    def get_agent_name(self) -> str:
        return "GuestAdvisoryAgent"

    def get_agent_description(self) -> str:
        return "Provides hotel policies, rules, and advisory information including cancellation policies, check-in/out procedures, and general hotel guidelines."

The Notification Agent

Role: The communications specialist who keeps everyone informed.

Link To Repo

Every booking, modification, or cancellation triggers appropriate notifications. But it's not just email templates, it's contextual communication that includes next steps and relevant details.

class NotificationAgent(BaseAgent):
    """Agent responsible for handling booking notifications and communications"""

    def __init__(self):
        super().__init__(port="9004")

    def get_agent_name(self) -> str:
        return "NotificationAgent"

    def get_agent_description(self) -> str:
        return "Handles booking confirmations, modifications, cancellations, and other communication needs for hotel reservations."

The Supervisor's Secret Weapon: Memory

Here's what makes this system truly intelligent—the supervisor agent uses AgentCore's Memory service to maintain conversation context across sessions.

Link To Repo

class MemoryManager:
    """Manages Bedrock AgentCore memory operations"""

    def __init__(self, region_name: str, memory_name: str):
        self.client = MemoryClient(region_name=region_name)
        self.memory_name = memory_name
        self.memory_id = None

    def initialize_memory(self) -> str:
        """Initialize or retrieve existing memory"""

class MemoryHookProvider(HookProvider):
    """Provides memory hooks for agent lifecycle events"""

    def __init__(self, memory_client: MemoryClient, memory_id: str):
        self.memory_client = memory_client
        self.memory_id = memory_id

    def on_agent_initialized(self, event: AgentInitializedEvent):
        """Load recent conversation history when agent starts"""
        try:
            ...
        except Exception as e:
            logger.error(f"Memory load error: {e}")

    def on_message_added(self, event: MessageAddedEvent):
        """Store messages in memory"""
        try:
            ...
        except Exception as e:
            logger.error(f"Memory save error: {e}")

    def register_hooks(self, registry: HookRegistry):
        """Register memory hooks"""
        registry.add_callback(MessageAddedEvent, self.on_message_added)
        registry.add_callback(AgentInitializedEvent, self.on_agent_initialized)

# Conversation context persists across sessions
# Agents remember preferences, previous bookings, ongoing requests

When you return to continue a booking conversation from yesterday, the system doesn't just remember what you said—it remembers the context, your preferences, and where you left off in the process.

The Magic of Agent Orchestration: How Complex Workflows Become Simple Conversations

Here's where the supervisor agent earns its keep. It doesn't just route requests, it orchestrates intelligent workflows that would normally require multiple customer service interactions.

A Real Workflow in Action

User: "I need to cancel my booking for next week, but I'm worried about fees."

Traditional System: Navigate to "My Bookings" → Find booking → Click cancel → Read policy wall of text → Call customer service for clarification → Wait on hold → Explain situation → Get transferred → Explain again...

Our Multi-Agent System:

Supervisor identifies this as a policy-sensitive cancellation request
Guest Advisory Agent retrieves specific cancellation policy for this booking
Reservation Agent calculates exact fees and alternatives
Supervisor presents clear options: "Cancelling now incurs a $50 fee, but modifying dates is free until tomorrow"
User chooses, Reservation Agent executes, Notification Agent confirms

All in one conversation. No navigation, no transfers, no re-explaining.

The Orchestration Code That Makes It Possible

# Supervisor's intelligent routing with context awareness
system_prompt = f"""
You are the Supervisor Agent for a multi-agent hotel booking system.

Policy-aware behavior:
- Before performing any booking, modification, or cancellation, check relevant hotel policies (using the Guest Advisory Agent or Knowledge Base) to determine if there are penalties, restrictions, or special conditions.
- Present the user with a summary of what will happen and the relevant policy (e.g., "Cancelling within 24 hours will incur a 20% fee").
- Ask for explicit confirmation before taking action.
"""

# The supervisor maintains context through AgentCore Memory
agent = Agent(
    model=bedrock_model,
    tools=provider.tools, # A2A communication tools
    system_prompt=self._get_system_prompt(),
    hooks=[MemoryHookProvider(memory_manager.client, memory_id)],
    state={
        "actor_id": self.settings.actor_id, 
        "session_id": self.settings.session_id
    },
)

Link To Repo

Why This Architecture Scales

Each agent runs independently on its own port, communicating through Agent-to-Agent (A2A) protocols. This means:

Fault isolation: One agent's issues don't cascade
Independent scaling: High-demand agents can scale separately
Easy updates: Modify one agent without touching others
Clear responsibilities: Each agent has a single, well-defined purpose

Production Reality: Memory, Monitoring, and the Details That Matter

Building a demo is one thing. Building something that handles real customer conversations at 3 AM when your infrastructure is under load? That's where AgentCore's production features become essential.

Memory That Actually Works

The breakthrough isn't just that the system remembers, it's how it remembers intelligently.

# Memory hooks that maintain context across sessions
class MemoryHookProvider(HookProvider):
    def on_agent_initialized(self, event: AgentInitializedEvent):
        # Load last 10 conversation turns when agent starts
        recent_turns = self.memory_client.get_last_k_turns(
            memory_id=self.memory_id, 
            actor_id=actor_id, 
            session_id=session_id, 
            k=10
        )
        # Inject context into agent's system prompt
        event.agent.system_prompt += f"\n\nRecent conversation:\n{context}"

When a customer returns after a week to modify their booking, the system doesn't just remember the booking ID, it remembers they mentioned preferring ground floor rooms, that they're traveling for a conference, and that they were concerned about cancellation policies.

Error Handling That Prevents Disasters

In production, things fail. APIs timeout, agents crash, networks hiccup. The difference between a good system and a great one is graceful degradation:

Link To Repo

# Supervisor handles agent failures gracefully
async def process_request(self, question: str):
    """Process a user request through the supervisor agent"""
    try:
        logger.info(f"Processing request: {question}")
        response = await self.agent.invoke_async(question)
        logger.info("Request processed successfully")
        return response
    except Exception as e:
        logger.error(f"Error processing request: {e}")
        raise

Observability That Tells the Real Story

AgentCore's built-in observability means you can see exactly what's happening:

Which agent handled each request
How long each step took
Where conversations get stuck
Which policies cause the most confusion

This isn't just monitoring, it's business intelligence about how customers actually interact with your system.

The Authentication Reality

Real systems need real security. AgentCore handles the complexity:

Link To Repo

# Cognito integration for secure gateway access
class TokenManager:
    def get_fresh_token(self):
        # Client credentials flow with proper scoping
        # Automatic token refresh and error handling
        # No security vulnerabilities from DIY auth

Business Impact: What This Actually Means for Hotels and Customers

The technical architecture is impressive, but let's talk about what really matters, business outcomes.

For Hotels: Operational Efficiency at Scale

Before: Customer calls about cancellation policy → Agent looks up booking → Checks policy document → Calculates fees → Explains options → Customer decides → Agent processes → Sends confirmation.
Average handling time: 8-12 minutes.

After: Customer asks about cancellation → System instantly knows booking details, policy implications, and alternatives → Presents clear options → Customer decides → Action executed automatically.
Average handling time: 1-2 minutes.

That's not just efficiency, that's 4x capacity increase with the same support team.

For Customers: Intelligence That Feels Personal

The system doesn't just remember your booking, it remembers your story:

"I see you're traveling for the same conference as last year. Would you like the same room type?"
"Based on your previous concern about cancellation fees, I've found options with flexible policies."
"Your flight was delayed last time, should I book a late check-in for this trip?"

This isn't marketing personalization, it's operational intelligence that makes every interaction feel like talking to someone who actually knows you.

The Competitive Advantage

While competitors are still building better search interfaces, this system is solving the real problem: complex travel decisions require conversation, not just transactions.

Hotels using this approach can offer:

24/7 intelligent support without 24/7 staffing costs
Proactive policy guidance that prevents booking mistakes
Seamless modification workflows that retain customers instead of losing them
Contextual upselling based on actual preferences, not generic algorithms

ROI That Actually Matters

Support cost reduction: 4x efficiency gain on routine inquiries
Booking completion rates: Fewer abandoned bookings due to policy confusion
Customer retention: Seamless modification experience vs. starting over elsewhere
Upselling effectiveness: Context-aware recommendations vs. generic offers

The Future of Conversational Commerce: Where This Goes Next

This hotel booking system is just the beginning. The patterns we've established, supervisor orchestration, policy-aware workflows, persistent memory, apply to any complex business process that currently requires human intervention.

Beyond Hotel Booking

Imagine applying this architecture to:

Insurance claims processing with policy specialists and damage assessors
Financial planning with investment advisors and compliance experts
Healthcare coordination with specialists who understand your medical history
Enterprise procurement with budget analysts and vendor specialists

Each domain gets a team of AI specialists who collaborate intelligently, remember context, and handle complexity gracefully.

The AgentCore Foundation: Why This Changes Everything

Amazon Bedrock AgentCore solved the infrastructure problems that kill most multi-agent projects. Here's how each component transformed this hotel booking system from concept to reality.

AgentCore Gateway: Your Lambda Functions Become Agent Tools

The breakthrough moment came when I realized I could keep all my business logic in familiar AWS Lambda functions while making them accessible to agents through the Model Context Protocol (MCP).

The Problem: Agents need access to real business systems, hotel inventory databases, booking APIs, policy engines. Traditionally, this means complex authentication, API management, and security concerns.

AgentCore Gateway's Solution: Your Lambda functions become agent tools automatically.

export const handler = async (
  event: APIGatewayProxyEvent
): Promise<APIGatewayProxyResult> => {
  console.log("Received event:", JSON.stringify(event, null, 2));

  try {
    if (city) {
      const queryResp = await dynamo.send(
        new QueryCommand({
          TableName: tableName,
          KeyConditionExpression: ...,
          ExpressionAttributeValues: {
            ...
          },
        })
      );
    } else {
      const scanResp = await dynamo.send(
        new ScanCommand({
          TableName: tableName,
        })
      );
    }

    return {
      statusCode: 200,
      body: JSON.stringify({ results }),
    };
  } catch (error) {
    console.error("Error:", error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: "Internal Server Error" }),
    };
  }
};

The Gateway handles:

Authentication and authorization with fine-grained access control
Rate limiting and throttling to protect your backend systems
Request/response transformation between agent protocols and your APIs
MCP protocol abstraction so your Lambdas don't need to know about agents

This means your hotel search, booking, and policy engines remain standard AWS Lambda functions, but agents can invoke them as naturally as calling any other tool.

AgentCore Identity: Security That Scales

Multi-agent systems create complex security challenges. When the Supervisor Agent needs to call the Reservation Agent, which then calls the Policy Agent, how do you maintain security context across the entire chain?

Traditional Approach: Custom authentication between every agent pair, token management nightmares, and security vulnerabilities.

AgentCore Identity: Centralized identity management with automatic token handling.

# Agents authenticate once, communicate securely forever
class TokenManager:
    def __init__(self):
        self.scope_string = f"{self.resource_server_id}/gateway:read {self.resource_server_id}/gateway:write"

    def get_fresh_token(self) -> Optional[str]:
        # AgentCore handles the OAuth2 client credentials flow
        # Automatic token refresh and scope validation

The Identity service provides:

OAuth2 client credentials flow with automatic token refresh
Fine-grained scopes for different agent capabilities
Centralized policy management across your entire agent ecosystem
Audit trails for every agent interaction

AgentCore Memory: Context That Persists and Scales

Here's where the magic happens. Traditional chatbots lose context between sessions. AgentCore Memory makes agents truly intelligent by maintaining conversation context that survives sessions, scales across millions of users, and enables sophisticated reasoning.

The Architecture:

What This Enables:

Cross-session continuity: Customer returns next week, agent remembers their preferences
Multi-agent context sharing: Reservation Agent knows what Search Agent discovered
Intelligent reasoning: "Based on your previous concern about cancellation fees..."
Scalable storage: Millions of conversations with configurable retention policies

The Business Impact: Customers don't repeat themselves. Agents make contextual decisions. Conversations feel natural, not transactional.

AgentCore Runtime: Production-Ready Agent Execution

Running one agent in development is easy. Running a team of agents in production, handling failures gracefully, scaling under load, and maintaining performance—that's where most multi-agent projects die.

AgentCore Runtime's Production Features:

app = BedrockAgentCoreApp()

# Agents that handle production realities
@app.entrypoint
async def send_message(request):
    """Main entry point for the hotel booking system"""
    try:
        question = request.get("question")
        if not question:
            return {"error": "No question provided"}

        response = await supervisor.process_request(question)
        return response.message["content"]
    except Exception as e:
        logger.error(f"Failed to process request: {str(e)}")
        return {"error": f"Failed to process request: {str(e)}"}

Runtime Capabilities:

Automatic scaling based on demand
Health monitoring with automatic recovery
Resource management to prevent runaway processes
Load balancing across agent instances
Graceful degradation when components fail

The Result: Your agents run reliably in production without you managing infrastructure complexity.

AgentCore Observability: Intelligence You Can See

The most sophisticated multi-agent system is useless if you can't understand what's happening inside it. AgentCore Observability provides deep insights into agent behavior, performance, and business impact.

What You Can See:

Agent interaction flows: Which agent handled each step of a complex workflow
Performance metrics: Response times, success rates, resource utilization
Business intelligence: Which policies cause confusion, where customers get stuck
Error patterns: Systematic issues before they become customer problems

The Business Value: You don't just run agents, you optimize them based on real usage patterns and customer behavior.

Why This Architecture Scales

Each AgentCore service solves a specific production challenge:

Gateway: Tool access without security complexity
Identity: Authentication without custom code
Memory: Context without storage management
Runtime: Reliability without infrastructure management
Observability: Insights without custom monitoring

Together, they create a foundation where you focus on business logic while AgentCore handles the production complexity that typically kills multi-agent projects.

Building Your Own Agent Team

Ready to experiment? The architecture is surprisingly approachable:

Start with the supervisor pattern - One agent that orchestrates others
Add memory integration - Context that persists across sessions
Build specialized agents - Each with a single, clear responsibility
Use A2A communication - Agents that collaborate, not compete
Deploy on AgentCore - Production-ready from day one

The complete implementation is available on GitHub, including deployment scripts and documentation for getting started with your own multi-agent system.

What complex workflow in your domain could benefit from conversational intelligence? The tools are ready, the question is what you'll build with them.

GitHub Repo

Multi-Agent Hotel Assistant

This project was built for "AWS AI Engineering Month: Building with Agents". The combination of Amazon Bedrock AgentCore's Memory, Gateway, and Runtime services with the Strands Agents framework creates a powerful foundation for production multi-agent systems.

DEV Community

Building Production Multi-Agent Systems: My Experience with Amazon Bedrock AgentCore, and AWS Strands Agents

The 2 AM Hotel Booking Nightmare

Why Traditional Hotel Booking Systems Miss the Mark

The Multi-Agent Vision

Enter Amazon Bedrock AgentCore: The Foundation for Intelligent Orchestration

The Architecture That Changes Everything

Building the Agent Team: From Search to Confirmation

The Search & Discovery Agent

The Reservation Agent

The Guest Advisory Agent

The Notification Agent

The Supervisor's Secret Weapon: Memory

The Magic of Agent Orchestration: How Complex Workflows Become Simple Conversations

A Real Workflow in Action

The Orchestration Code That Makes It Possible

Why This Architecture Scales

Production Reality: Memory, Monitoring, and the Details That Matter

Memory That Actually Works

Error Handling That Prevents Disasters

Observability That Tells the Real Story

The Authentication Reality

Business Impact: What This Actually Means for Hotels and Customers

For Hotels: Operational Efficiency at Scale

For Customers: Intelligence That Feels Personal

The Competitive Advantage

ROI That Actually Matters

The Future of Conversational Commerce: Where This Goes Next

Beyond Hotel Booking

The AgentCore Foundation: Why This Changes Everything

AgentCore Gateway: Your Lambda Functions Become Agent Tools

AgentCore Identity: Security That Scales

AgentCore Memory: Context That Persists and Scales

AgentCore Runtime: Production-Ready Agent Execution

AgentCore Observability: Intelligence You Can See

Why This Architecture Scales

Building Your Own Agent Team

GitHub Repo

Top comments (0)