Lewis Sawe

for AWS Community Builders

Posted on Sep 29 • Originally published at lewisawe.hashnode.dev

Building Multi-Agent AWS Operations Intelligence with Amazon Bedrock AgentCore

#aws #agentcore #bedrock #devops

We set out to build something that didn't exist yet - a multi-agent AI system that could automatically discover AWS environments and provide real operations intelligence across entire organizations. The goal wasn't just to make another monitoring tool, but to create agents that could actually understand AWS infrastructure and costs in ways that existing tools couldn't.

What We Built

The AWS Operations Command Center is a three-agent system built on Amazon Bedrock AgentCore. Each agent specializes in a different aspect of AWS operations, but they work together to provide unified intelligence about your infrastructure.

The Cost Intelligence Agent handles financial analysis across multiple AWS accounts. It doesn't just pull billing data - it separates actual usage costs from credits to show you what you're really consuming. Most AWS billing tools show you net costs after credits, which hides your actual usage patterns. Our agent discovered $255.22 in real usage across our three-account organization that was almost completely offset by $255.21 in credits, leaving a final bill of just $0.01.

The Operations Intelligence Agent scans resources across your entire AWS organization. It uses cross-account roles to inventory everything from EC2 instances to S3 buckets to RDS databases. In our testing, it found 90 resources across our accounts that we could then analyze for security issues, performance bottlenecks, and optimization opportunities.

The Infrastructure Intelligence Agent generates architecture recommendations and scores existing setups for security and reliability. It's not just templating - it analyzes your actual infrastructure and suggests improvements based on AWS best practices.

The AgentCore Implementation

Getting these agents to work with Amazon Bedrock AgentCore was more challenging than expected. The documentation makes it look straightforward, but the reality involved a lot of trial and error.

Each agent uses the AgentCore runtime framework. The basic structure looks like this:

from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload):
    session_id = payload.get('session_id', str(uuid.uuid4()))

    # Store session context in memory service
    memory_service.store_context(session_id, {
        'agent': 'cost_intelligence',
        'timestamp': datetime.now().isoformat(),
        'payload': payload
    })

    # Process the request
    result = agent.get_multi_account_costs()

    # Add AgentCore metadata
    result.update({
        'services_used': ['runtime', 'memory', 'gateway'],
        'session_id': session_id
    })

    return result

The memory service integration was crucial for getting agents to share context. When the cost agent discovers expensive resources, that information needs to be available to the operations agent for deeper analysis. We implemented a simple but effective memory service that stores session data with TTL expiration.

Agent coordination happens through an orchestrator that can invoke agents either locally or through the AgentCore gateway. The orchestrator handles partial failures gracefully - if one agent fails, the others continue working and provide what insights they can.

Self-Discovery Architecture

One of the biggest challenges was making the agents work in any AWS environment without manual configuration. Most tools require you to specify account IDs, regions, and access patterns upfront. We wanted agents that could figure out their environment automatically.

The solution was environment discovery during agent initialization. Each agent detects whether it's running in an AWS Organizations management account by trying to call the Organizations API. If that succeeds, it knows it can do cross-account analysis. If not, it falls back to single-account mode.

def _discover_environment(self):
    try:
        sts = boto3.client('sts')
        self.current_account_id = sts.get_caller_identity()['Account']

        try:
            self.organizations = boto3.client('organizations')
            org_info = self.organizations.describe_organization()
            self.is_org_account = True

            accounts = self.organizations.list_accounts()
            self.org_accounts = [
                {'id': acc['Id'], 'name': acc['Name'], 'status': acc['Status']}
                for acc in accounts['Accounts'] if acc['Status'] == 'ACTIVE'
            ]
        except:
            self.is_org_account = False
            self.org_accounts = [{'id': self.current_account_id, 'name': 'current', 'status': 'ACTIVE'}]
    except Exception as e:
        # Handle gracefully
        pass

This approach means you can deploy the same agent code anywhere and it adapts to its environment. Deploy it in a standalone account and it analyzes that account. Deploy it in an organization management account and it analyzes the entire organization.

Cross-Account Access Challenges

Getting true multi-account analysis working was the hardest part. AWS has good security boundaries between accounts, which is great for isolation but challenging when you want unified operations intelligence.

We solved this with cross-account IAM roles deployed via CloudFormation. Each member account gets an OrganizationAccountAccessRole that the management account can assume. The role has read-only permissions and requires an external ID for additional security.

The deployment process uses the configured AWS profiles to push CloudFormation templates to each account:

def deploy_role_to_profile(profile_name, account_name):
    session = boto3.Session(profile_name=profile_name, region_name='us-east-1')
    cf = session.client('cloudformation')

    response = cf.create_stack(
        StackName='aws-operations-command-center-role',
        TemplateBody=template_body,
        Capabilities=['CAPABILITY_NAMED_IAM']
    )

    waiter = cf.get_waiter('stack_create_complete')
    waiter.wait(StackName='aws-operations-command-center-role')

Once the roles are deployed, agents can assume them to access resources in member accounts. The cost agent uses this to get billing data from each account separately, then aggregates it for organization-wide analysis. The operations agent uses it to scan resources across all accounts.

Real-World Challenges

The AgentCore CLI had reliability issues. The agentcore deploy command failed more often than it worked, usually with cryptic IAM errors. Even with admin permissions, we'd hit edge cases where role creation would partially succeed, leaving broken deployments that were hard to clean up.

The memory service had undocumented size limits that caused silent failures. Context data would just disappear without error messages, making debugging difficult. We ended up implementing our own memory service with proper error handling and size checks.

Gateway service communication was unreliable under load, with intermittent connection failures. We built a fallback mechanism that uses local agent invocation when the gateway isn't available.

Version compatibility between the AgentCore CLI and SDK was problematic. Upgrading would break existing configurations in ways that weren't well documented. We had to pin to specific versions and test thoroughly before any updates.

Production-Ready Error Handling

One thing we learned quickly was that AWS APIs fail in many different ways, and you need to handle each failure mode appropriately. We built a comprehensive error handling system that categorizes failures and responds accordingly.

def safe_aws_call(func, *args, max_retries=3, **kwargs):
    for attempt in range(max_retries):
        try:
            result = func(*args, **kwargs)
            return {'success': True, 'data': result}
        except ClientError as e:
            error_code = e.response['Error']['Code']
            if error_code in ['Throttling', 'RequestLimitExceeded']:
                time.sleep(2**attempt)
                continue
            elif error_code == 'AccessDenied':
                return {'success': False, 'error': 'access_denied'}
            else:
                return {'success': False, 'error': str(e)}
        except NoCredentialsError:
            return {'success': False, 'error': 'no_credentials'}

    return {'success': False, 'error': 'max_retries_exceeded'}

This wrapper handles throttling with exponential backoff, distinguishes between different types of access errors, and provides meaningful error messages that agents can act on.

What We Learned

Building multi-agent systems with AgentCore taught us that the framework is powerful but still rough around the edges. The core concepts are solid - the runtime, memory, and gateway services provide a good foundation for agent coordination. But the tooling needs work, especially around deployment and debugging.

The self-discovery approach worked better than expected. Having agents that adapt to their environment makes deployment much simpler and reduces configuration errors. It also makes the system more resilient - agents continue working even when they can't access all the resources they'd like to.

Cross-account analysis is valuable but complex. The security boundaries that make AWS safe also make unified operations intelligence challenging. The IAM role approach works, but it requires careful setup and ongoing maintenance.

Results

The final system provides genuine business value. It discovered $255.22 in AWS usage across our organization that was hidden by credits, giving us visibility into our actual consumption patterns. It inventoried 90 resources across multiple accounts, providing a unified view of our infrastructure that we didn't have before.

More importantly, it demonstrates that multi-agent AI systems can solve real operational problems. The agents work together to provide insights that none of them could generate alone. The cost agent finds expensive resources, the operations agent analyzes them for optimization opportunities, and the infrastructure agent suggests architectural improvements.

The system is production-ready with comprehensive error handling, structured logging, and graceful degradation when services are unavailable. It's not just a demo - it's a working operations intelligence platform that scales from single accounts to entire AWS organizations.

REFERENCES

Lets Connect - Lewis Sawe: LinkedIn

Buy me coffee ☕

DEV Community