TL;DR
Learn how to build an AI agent that autonomously optimises your AWS costs by analysing CloudWatch metrics, identifying underutilised resources, and making intelligent decisions—all while maintaining production safety through AWS X-Ray observability and human-in-the-loop approval workflows.
What you'll learn:
- How to build a Bedrock Agent with custom action groups
- Implementing X-Ray tracing to audit AI decision-making
- Production safety patterns for autonomous infrastructure agents
- Human-in-the-loop approval workflows for high-risk actions
Prerequisites:
- AWS account with Bedrock access (ap-southeast-2 recommended for Australia)
- Python 3.9+ and AWS CLI configured
- Basic understanding of Lambda, EC2, and CloudWatch
- Estimated cost: ~$20/month for development use
The Problem
We have all been there. You spin up a p3.2xlarge instance for a quick Friday afternoon experiment. You go to happy hour, the weekend hits, and you forget about it. Two weeks later, the AWS bill arrives, and panic sets in.
For years, we solved this with "dumb" scripts—Cron jobs that shut down everything tagged dev at 7 PM. But scripts lack context. They kill long-running training jobs just as often as they save money.
We don't need a script. We need an Agent.
In this post, I’ll walk through how to build an Autonomous FinOps Agent using Amazon Bedrock and Python. More importantly, I will show you how to use AWS X-Ray to "audit the brain" of the agent, ensuring it never deletes production resources by mistake.
The Difference: Automation vs. Agentic AI
Why use an Agent instead of a Lambda function triggered by a CloudWatch Alarm?
Automation (The Script): "If CPU < 5% for 1 hour, terminate instance."
Risk: It kills a critical database waiting for connections.Agentic AI (The CFO): "I see this instance has low CPU. Let me check the tags. It belongs to the 'Data Science' team. Let me check the git logs on the attached volume. It seems inactive. I will slack the owner, and if they don't reply in 24 hours, I will snapshot and terminate it."
The Agent adds reasoning to the automation.
The Architecture
We will use the Amazon Bedrock Agents framework, which simplifies the orchestration of tools and provides built-in reasoning capabilities.
Component Overview:
- The Brain: Amazon Bedrock (Model: Claude 3.5 Sonnet) - Handles reasoning and decision-making
- The Hands: Python Lambda function (Action Group) - Executes AWS API calls via Boto3
- The Eyes: AWS X-Ray + CloudWatch Logs - Traces every decision and API call
- The Safety Net: SNS notifications for human approval on destructive actions
┌──────────────────────────────────────────────────────────┐
│ User Query │
│ "Find under-utilised resources in ap-southeast-2" │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ Amazon Bedrock Agent │
│ (Claude 3.5 Sonnet) │──── X-Ray Tracing
└───────────┬───────────┘
│
│ Invokes Action Group
▼
┌───────────────────────┐
│ Lambda Function │
│ (Action Router) │──── CloudWatch Logs
└───────────┬───────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ EC2 API │ │ CW API │
└─────────┘ └─────────┘
Step 1: Building the Action Group Lambda
In Bedrock, an "Action Group" is an OpenAPI schema that maps natural language intents to Lambda functions. The Lambda acts as a central router that executes different tools based on what the agent decides.
Critical: We use
aws_xray_sdkto patch all Boto3 calls. In Agentic AI, observability isn't optional—it's the only way to debug hallucinations and verify the agent's reasoning.
The Lambda Handler (Action Router)
import json
import boto3
from aws_xray_sdk.core import xray_recorder, patch_all
# Automatically trace all AWS API calls
patch_all()
ec2_client = boto3.client('ec2')
def lambda_handler(event, context):
"""
Central router for Bedrock Agent action groups.
Receives function name + parameters, executes the tool, returns structured response.
"""
function_name = event['function']
parameters = {p['name']: p['value'] for p in event.get('parameters', [])}
# Start X-Ray subsegment to track this specific tool execution
subsegment = xray_recorder.begin_subsegment(f"ToolExecution:{function_name}")
subsegment.put_annotation("Agent", event['agent']['name'])
try:
if function_name == 'analyse_underutilised_resources':
response_body = check_cpu_metrics(parameters.get('region', 'us-east-1'))
elif function_name == 'stop_resource':
# CRITICAL SAFETY CHECK: Never stop production instances
resource_id = parameters['resource_id']
if is_production(resource_id):
response_body = {
"status": "DENIED",
"reason": "Production resources require manual approval"
}
logger.warning(f" Agent attempted to stop PROD: {resource_id}")
else:
response_body = stop_ec2_instance(resource_id)
except Exception as e:
logger.error(f"Tool execution failed: {e}")
subsegment.put_metadata("Error", str(e))
response_body = {"error": "Internal tool error"}
finally:
xray_recorder.end_subsegment()
# Return in Bedrock's expected format
return {
'messageVersion': '1.0',
'response': {
'actionGroup': event['actionGroup'],
'function': function_name,
'functionResponse': {
'responseBody': {'TEXT': {'body': json.dumps(response_body)}}
}
}
}
Key Design Patterns:
- Subsegments for granular tracing - Each tool execution gets its own X-Ray subsegment
- Production safety guard - Tag-based checks prevent accidental destruction
- Structured responses - JSON format allows the agent to reason about results
Step 2: The OpenAPI Schema (Connecting Agent to Tools)
Before the agent can call your Lambda, you need to define an OpenAPI schema that describes the available tools. This is what Bedrock uses to understand when and how to invoke your functions.
openapi: 3.0.0
info:
title: FinOps Agent Tools
version: 1.0.0
paths:
/analyse:
post:
summary: Analyse underutilised EC2 resources
description: Scans a region for instances with low CPU utilisation over the past 7 days
operationId: analyse_underutilised_resources
parameters:
- name: region
in: query
required: false
schema:
type: string
default: ap-southeast-2
responses:
'200':
description: Analysis results with list of underutilised instances
/stop:
post:
summary: Stop an EC2 instance
description: Stops a non-production EC2 instance to save costs
operationId: stop_resource
parameters:
- name: resource_id
in: query
required: true
schema:
type: string
description: The EC2 instance ID to stop
responses:
'200':
description: Instance stop status
How it works:
- User asks: "Find idle instances in ap-southeast-2"
- Bedrock maps this to
analyse_underutilised_resourceswithregion=ap-southeast-2 - Lambda receives the function name + parameters
- Executes
check_cpu_metrics('ap-southeast-2') - Returns structured JSON back to Bedrock
- Agent reasons about the results and responds to the user
Step 3: Implementing the Cost Analysis Tool
Now let's implement the check_cpu_metrics() function that does the actual CloudWatch analysis:
from datetime import datetime, timedelta
def check_cpu_metrics(region):
"""
Analyses EC2 instances for low CPU utilisation.
Returns actionable insights for cost optimisation.
"""
cw_client = boto3.client('cloudwatch', region_name=region)
ec2_client = boto3.client('ec2', region_name=region)
underutilised = []
# Get all running instances
instances = ec2_client.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
# Analyse last 7 days of metrics
end_time = datetime.aestnow()
start_time = end_time - timedelta(days=7)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Query CloudWatch for CPU metrics
metrics = cw_client.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilisation',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=86400, # Daily aggregation
Statistics=['Average']
)
if metrics['Datapoints']:
avg_cpu = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])
# Flag instances below 10% CPU
if avg_cpu < 10.0:
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
underutilised.append({
'instance_id': instance_id,
'instance_type': instance['InstanceType'],
'avg_cpu_percent': round(avg_cpu, 2),
'environment': tags.get('Environment', 'Unknown'),
'recommendation': 'Consider downsizing or stopping'
})
return {
'status': 'success',
'underutilised_instances': underutilised,
'potential_monthly_savings': f"${len(underutilised) * 50}"
}
Why this approach works:
- 7-day analysis window catches weekend/holiday idle time
- Daily aggregation reduces API costs while maintaining accuracy
- Tag extraction gives the agent context about ownership and environment
- Structured output allows the agent to present findings naturally
** Pro Tip:** For production, add network I/O metrics and disk usage to avoid flagging batch processing instances that are I/O bound but CPU-light.
Step 4: X-Ray Observability - Auditing the Agent's Brain
When you deploy a chatbot, users give feedback via thumbs up/down. When you deploy an infrastructure agent, the "feedback" might be your production database going offline.
We need deep observability to audit every decision.
What X-Ray Gives You
By using patch_all() and custom subsegments, we generate a full trace of the agent's decision-making process:
User Query
↓
Bedrock Agent (Reasoning: "Low CPU detected, checking tags...")
↓
Lambda Action Group (Executing: analyse_underutilised_resources)
↓
CloudWatch API (Fetching metrics)
↓
EC2 API (Reading instance tags)
↓
Response back to agent
Why This Matters
If the agent fails to stop an instance, CloudWatch Logs alone won't tell you why. You need to know:
-
Did the tool fail? (e.g., Boto3
AccessDeniederror) - Did the agent fail to reason correctly? (e.g., The agent concluded "CPU at 10% is actually high load for this workload" and didn't call the stop function)
X-Ray lets you overlay the Reasoning Trace (Bedrock) with the Execution Trace (Lambda/AWS APIs).
Example X-Ray Insight:
{
"subsegment": "ToolExecution:stop_resource",
"annotation": {
"Agent": "FinOpsAgent-v1",
"resource_id": "i-0123456789",
"decision": "DENIED - Production tag detected"
},
"duration": 245ms
}
This tells you the agent correctly refused to stop a production instance—critical for compliance audits.
Step 5: The "Human-in-the-Loop" Safety Net
The biggest fear with Agentic AI is the "Runaway Robot" scenario—what if the agent misinterprets data and terminates a critical database?
The solution: Don't give the agent destructive permissions. Instead, use an approval workflow.
Implementation: Request Approval Tool
Add a third function to your Lambda that sends termination requests to humans:
import boto3
sns_client = boto3.client('sns')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-southeast-2:123456789:FinOpsApprovals'
def request_termination_approval(instance_id, justification):
"""
Requests human approval before terminating an instance.
Publishes to SNS topic monitored by DevOps team.
"""
# Get instance details for context
ec2 = boto3.client('ec2')
instance = ec2.describe_instances(InstanceIds=[instance_id])['Reservations'][0]['Instances'][0]
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
instance_type = instance['InstanceType']
# Calculate estimated savings
# Rough estimates: t3.medium=$30/mo, t3.large=$60/mo, etc.
hourly_cost = get_instance_cost(instance_type)
monthly_savings = hourly_cost * 730 # hours/month
message = f"""
FinOps Agent Termination Request
Instance: {instance_id}
Type: {instance_type}
Name: {tags.get('Name', 'N/A')}
Environment: {tags.get('Environment', 'Unknown')}
AI Analysis:
{justification}
Estimated Monthly Savings: ${monthly_savings:.2f}
Actions:
✅ Approve: Reply with "APPROVE {instance_id}"
❌ Deny: Reply with "DENY {instance_id}"
⏸️ Snooze 7 days: Reply with "SNOOZE {instance_id}"
View X-Ray Trace: https://console.aws.amazon.com/xray/...
"""
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Subject=f' FinOps Agent: Approval needed for {instance_id}',
Message=message
)
return {
'status': 'approval_requested',
'instance_id': instance_id,
'message': 'Notification sent to DevOps team'
}
Update the OpenAPI Schema
Add this to your schema:
/request-termination:
post:
summary: Request approval to terminate an instance
operationId: request_termination_approval
parameters:
- name: instance_id
in: query
required: true
schema:
type: string
- name: justification
in: query
required: true
schema:
type: string
description: AI-generated explanation for why termination is recommended
The Agent's New Workflow
Now when the agent finds a zombie instance, it:
- Analyses the metrics (CPU, network, disk)
- Reasons about the context (tags, uptime, cost)
- Requests approval instead of acting immediately
- Includes the X-Ray trace link so humans can audit the decision
- Waits for human confirmation before taking action
This gives you the speed of AI analysis with the safety of human judgment.
Deployment & Testing
Step 1: Create the Lambda Function
# Package dependencies
pip install aws-xray-sdk boto3 -t ./package
cd package && zip -r ../lambda.zip . && cd ..
zip -g lambda.zip lambda_function.py
# Deploy to AWS
aws lambda create-function \
--function-name FinOpsAgentTools \
--runtime python3.11 \
--role arn:aws:iam::YOUR_ACCOUNT:role/FinOpsLambdaRole \
--handler lambda_function.lambda_handler \
--zip-file fileb://lambda.zip \
--timeout 60 \
--memory-size 256 \
--tracing-config Mode=Active
Step 2: Create IAM Policy for Lambda
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EC2ReadOnly",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeTags"
],
"Resource": "*"
},
{
"Sid": "EC2StopNonProd",
"Effect": "Allow",
"Action": "ec2:StopInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringNotEquals": {
"ec2:ResourceTag/Environment": "Production"
}
}
},
{
"Sid": "CloudWatchMetrics",
"Effect": "Allow",
"Action": "cloudwatch:GetMetricStatistics",
"Resource": "*"
},
{
"Sid": "SNSPublish",
"Effect": "Allow",
"Action": "sns:Publish",
"Resource": "arn:aws:sns:*:*:FinOpsApprovals"
},
{
"Sid": "XRayTracing",
"Effect": "Allow",
"Action": [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords"
],
"Resource": "*"
}
]
}
Key Security Features:
-
EC2 stop is tag-restricted - Cannot stop instances with
Environment=Production - Read-only for metrics - No write access to CloudWatch
- SNS scope limited - Can only publish to approval topic
Step 3: Create the Bedrock Agent
# Create the agent
aws bedrock-agent create-agent \
--agent-name FinOpsAgent \
--foundation-model anthropic.claude-3-5-sonnet-20241022-v2:0 \
--instruction "You are a FinOps specialist. Analyse AWS costs and recommend optimisations. Always request approval before suggesting destructive actions." \
--idle-session-ttl 600
# Create action group with OpenAPI schema
aws bedrock-agent create-agent-action-group \
--agent-id YOUR_AGENT_ID \
--action-group-name FinOpsTools \
--action-group-executor lambda=arn:aws:lambda:ap-southeast-2:ACCOUNT:function:FinOpsAgentTools \
--api-schema file://openapi-schema.yaml
# Prepare and create alias
aws bedrock-agent prepare-agent --agent-id YOUR_AGENT_ID
aws bedrock-agent create-agent-alias \
--agent-id YOUR_AGENT_ID \
--agent-alias-name prod
Step 4: Test the Agent
Test Query 1: Analyse Resources
User: "Find underutilised instances in ap-southeast-2 that are costing us money"
Agent Response:
"I found 3 underutilised instances in ap-southeast-2:
1. i-0abc123 (t3.large) - Avg CPU: 3.2% - Est. savings: $60/mo
2. i-0def456 (m5.xlarge) - Avg CPU: 7.8% - Est. savings: $140/mo
3. i-0ghi789 (t3.medium) - Avg CPU: 5.1% - Est. savings: $30/mo
Total potential monthly savings: $230
Would you like me to request approval to stop these instances?"
Test Query 2: Safety Check
User: "Stop instance i-prod-db-001"
Agent Response:
"I cannot stop i-prod-db-001 because it has the Environment=Production tag.
Production resources require manual approval through the change management process."
Step 5: Monitor with X-Ray
Navigate to AWS X-Ray console and filter by:
annotation.Agent = "FinOpsAgent-v1"
You'll see traces showing:
- Which tools the agent invoked
- How long each API call took
- Whether any errors occurred
- The full decision path from user query → response
Conclusion
Building Agentic AI on AWS is about more than just Prompt Engineering—it's about Reliability Engineering.
By treating the agent's "thoughts" as loggable events (X-Ray traces) and wrapping its "hands" (tools) in strict safety checks (tag validation, human approvals), we can build a FinOps assistant that is not only smart but trustworthy.
Key Takeaways
- Agents > Scripts - Reasoning capabilities allow context-aware decisions
- Observability is mandatory - X-Ray tracing is the only way to audit AI decisions
- Safety through architecture - Use IAM policies and human-in-the-loop for destructive actions
- Structure matters - Well-designed tool outputs enable better agent reasoning
The era of static Cron jobs is ending. The era of the Cloud CFO Agent has begun.
Cost Estimate
Running this setup for development/testing:
| Service | Monthly Cost |
|---|---|
| Bedrock Agent (Claude 3.5 Sonnet) | ~$5 (100 queries) |
| Lambda invocations | ~$0.20 (1000 invocations) |
| CloudWatch API calls | ~$0.10 (detailed monitoring) |
| X-Ray tracing | ~$2 (10,000 traces) |
| SNS notifications | ~$0.50 (100 emails) |
| Total | ~$8/month |
For production use with 10,000 queries/month: ~$50-75/month
Potential savings identified: Hundreds to thousands per month depending on environment size.
Next Steps & Resources
Enhancements to Consider
- Multi-region support - Extend analysis to all AWS regions
- Additional metrics - Network I/O, disk usage, memory utilisation
- Historical trending - Track savings over time in DynamoDB
- Slack integration - Send reports to team channels instead of email
- Auto-remediation - After 30 days of approvals, enable auto-stop for specific tags

Top comments (0)