DEV Community

Roopa Venkatesh
Roopa Venkatesh

Posted on • Edited on

From Chatbot to Cloud CFO: Building an Autonomous FinOps Agent with Amazon Bedrock

TL;DR

Learn how to build an AI agent that autonomously optimises your AWS costs by analysing CloudWatch metrics, identifying underutilised resources, and making intelligent decisions—all while maintaining production safety through AWS X-Ray observability and human-in-the-loop approval workflows.

What you'll learn:

  • How to build a Bedrock Agent with custom action groups
  • Implementing X-Ray tracing to audit AI decision-making
  • Production safety patterns for autonomous infrastructure agents
  • Human-in-the-loop approval workflows for high-risk actions

Prerequisites:

  • AWS account with Bedrock access (ap-southeast-2 recommended for Australia)
  • Python 3.9+ and AWS CLI configured
  • Basic understanding of Lambda, EC2, and CloudWatch
  • Estimated cost: ~$20/month for development use

The Problem

We have all been there. You spin up a p3.2xlarge instance for a quick Friday afternoon experiment. You go to happy hour, the weekend hits, and you forget about it. Two weeks later, the AWS bill arrives, and panic sets in.

For years, we solved this with "dumb" scripts—Cron jobs that shut down everything tagged dev at 7 PM. But scripts lack context. They kill long-running training jobs just as often as they save money.

We don't need a script. We need an Agent.

In this post, I’ll walk through how to build an Autonomous FinOps Agent using Amazon Bedrock and Python. More importantly, I will show you how to use AWS X-Ray to "audit the brain" of the agent, ensuring it never deletes production resources by mistake.

The Difference: Automation vs. Agentic AI

Why use an Agent instead of a Lambda function triggered by a CloudWatch Alarm?

  • Automation (The Script): "If CPU < 5% for 1 hour, terminate instance."
    Risk: It kills a critical database waiting for connections.

  • Agentic AI (The CFO): "I see this instance has low CPU. Let me check the tags. It belongs to the 'Data Science' team. Let me check the git logs on the attached volume. It seems inactive. I will slack the owner, and if they don't reply in 24 hours, I will snapshot and terminate it."

The Agent adds reasoning to the automation.

The Architecture

We will use the Amazon Bedrock Agents framework, which simplifies the orchestration of tools and provides built-in reasoning capabilities.

Component Overview:

  • The Brain: Amazon Bedrock (Model: Claude 3.5 Sonnet) - Handles reasoning and decision-making
  • The Hands: Python Lambda function (Action Group) - Executes AWS API calls via Boto3
  • The Eyes: AWS X-Ray + CloudWatch Logs - Traces every decision and API call
  • The Safety Net: SNS notifications for human approval on destructive actions
┌──────────────────────────────────────────────────────────┐
│                     User Query                           │
│        "Find under-utilised resources in ap-southeast-2" │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Amazon Bedrock Agent │
         │  (Claude 3.5 Sonnet)  │──── X-Ray Tracing
         └───────────┬───────────┘
                     │
                     │ Invokes Action Group
                     ▼
         ┌───────────────────────┐
         │   Lambda Function     │
         │  (Action Router)      │──── CloudWatch Logs
         └───────────┬───────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
    ┌─────────┐           ┌─────────┐
    │ EC2 API │           │ CW API  │
    └─────────┘           └─────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: Building the Action Group Lambda

In Bedrock, an "Action Group" is an OpenAPI schema that maps natural language intents to Lambda functions. The Lambda acts as a central router that executes different tools based on what the agent decides.

Critical: We use aws_xray_sdk to patch all Boto3 calls. In Agentic AI, observability isn't optional—it's the only way to debug hallucinations and verify the agent's reasoning.

The Lambda Handler (Action Router)

import json
import boto3
from aws_xray_sdk.core import xray_recorder, patch_all

# Automatically trace all AWS API calls
patch_all()

ec2_client = boto3.client('ec2')

def lambda_handler(event, context):
    """
    Central router for Bedrock Agent action groups.
    Receives function name + parameters, executes the tool, returns structured response.
    """
    function_name = event['function']
    parameters = {p['name']: p['value'] for p in event.get('parameters', [])}

    # Start X-Ray subsegment to track this specific tool execution
    subsegment = xray_recorder.begin_subsegment(f"ToolExecution:{function_name}")
    subsegment.put_annotation("Agent", event['agent']['name'])

    try:
        if function_name == 'analyse_underutilised_resources':
            response_body = check_cpu_metrics(parameters.get('region', 'us-east-1'))

        elif function_name == 'stop_resource':
            # CRITICAL SAFETY CHECK: Never stop production instances
            resource_id = parameters['resource_id']
            if is_production(resource_id):
                response_body = {
                    "status": "DENIED",
                    "reason": "Production resources require manual approval"
                }
                logger.warning(f" Agent attempted to stop PROD: {resource_id}")
            else:
                response_body = stop_ec2_instance(resource_id)

    except Exception as e:
        logger.error(f"Tool execution failed: {e}")
        subsegment.put_metadata("Error", str(e))
        response_body = {"error": "Internal tool error"}
    finally:
        xray_recorder.end_subsegment()

    # Return in Bedrock's expected format
    return {
        'messageVersion': '1.0',
        'response': {
            'actionGroup': event['actionGroup'],
            'function': function_name,
            'functionResponse': {
                'responseBody': {'TEXT': {'body': json.dumps(response_body)}}
            }
        }
    }
Enter fullscreen mode Exit fullscreen mode

Key Design Patterns:

  1. Subsegments for granular tracing - Each tool execution gets its own X-Ray subsegment
  2. Production safety guard - Tag-based checks prevent accidental destruction
  3. Structured responses - JSON format allows the agent to reason about results

Step 2: The OpenAPI Schema (Connecting Agent to Tools)

Before the agent can call your Lambda, you need to define an OpenAPI schema that describes the available tools. This is what Bedrock uses to understand when and how to invoke your functions.

openapi: 3.0.0
info:
  title: FinOps Agent Tools
  version: 1.0.0
paths:
  /analyse:
    post:
      summary: Analyse underutilised EC2 resources
      description: Scans a region for instances with low CPU utilisation over the past 7 days
      operationId: analyse_underutilised_resources
      parameters:
        - name: region
          in: query
          required: false
          schema:
            type: string
            default: ap-southeast-2
      responses:
        '200':
          description: Analysis results with list of underutilised instances

  /stop:
    post:
      summary: Stop an EC2 instance
      description: Stops a non-production EC2 instance to save costs
      operationId: stop_resource
      parameters:
        - name: resource_id
          in: query
          required: true
          schema:
            type: string
          description: The EC2 instance ID to stop
      responses:
        '200':
          description: Instance stop status
Enter fullscreen mode Exit fullscreen mode

How it works:

  1. User asks: "Find idle instances in ap-southeast-2"
  2. Bedrock maps this to analyse_underutilised_resources with region=ap-southeast-2
  3. Lambda receives the function name + parameters
  4. Executes check_cpu_metrics('ap-southeast-2')
  5. Returns structured JSON back to Bedrock
  6. Agent reasons about the results and responds to the user

Step 3: Implementing the Cost Analysis Tool

Now let's implement the check_cpu_metrics() function that does the actual CloudWatch analysis:

from datetime import datetime, timedelta

def check_cpu_metrics(region):
    """
    Analyses EC2 instances for low CPU utilisation.
    Returns actionable insights for cost optimisation.
    """
    cw_client = boto3.client('cloudwatch', region_name=region)
    ec2_client = boto3.client('ec2', region_name=region)

    underutilised = []

    # Get all running instances
    instances = ec2_client.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    # Analyse last 7 days of metrics
    end_time = datetime.aestnow()
    start_time = end_time - timedelta(days=7)

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Query CloudWatch for CPU metrics
            metrics = cw_client.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilisation',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=start_time,
                EndTime=end_time,
                Period=86400,  # Daily aggregation
                Statistics=['Average']
            )

            if metrics['Datapoints']:
                avg_cpu = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])

                # Flag instances below 10% CPU
                if avg_cpu < 10.0:
                    tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
                    underutilised.append({
                        'instance_id': instance_id,
                        'instance_type': instance['InstanceType'],
                        'avg_cpu_percent': round(avg_cpu, 2),
                        'environment': tags.get('Environment', 'Unknown'),
                        'recommendation': 'Consider downsizing or stopping'
                    })

    return {
        'status': 'success',
        'underutilised_instances': underutilised,
        'potential_monthly_savings': f"${len(underutilised) * 50}"
    }
Enter fullscreen mode Exit fullscreen mode

Why this approach works:

  • 7-day analysis window catches weekend/holiday idle time
  • Daily aggregation reduces API costs while maintaining accuracy
  • Tag extraction gives the agent context about ownership and environment
  • Structured output allows the agent to present findings naturally

** Pro Tip:** For production, add network I/O metrics and disk usage to avoid flagging batch processing instances that are I/O bound but CPU-light.


Step 4: X-Ray Observability - Auditing the Agent's Brain

When you deploy a chatbot, users give feedback via thumbs up/down. When you deploy an infrastructure agent, the "feedback" might be your production database going offline.

We need deep observability to audit every decision.

What X-Ray Gives You

By using patch_all() and custom subsegments, we generate a full trace of the agent's decision-making process:

User Query
    ↓
Bedrock Agent (Reasoning: "Low CPU detected, checking tags...")
    ↓
Lambda Action Group (Executing: analyse_underutilised_resources)
    ↓
CloudWatch API (Fetching metrics)
    ↓
EC2 API (Reading instance tags)
    ↓
Response back to agent
Enter fullscreen mode Exit fullscreen mode

Why This Matters

If the agent fails to stop an instance, CloudWatch Logs alone won't tell you why. You need to know:

  1. Did the tool fail? (e.g., Boto3 AccessDenied error)
  2. Did the agent fail to reason correctly? (e.g., The agent concluded "CPU at 10% is actually high load for this workload" and didn't call the stop function)

X-Ray lets you overlay the Reasoning Trace (Bedrock) with the Execution Trace (Lambda/AWS APIs).

Example X-Ray Insight:

{
  "subsegment": "ToolExecution:stop_resource",
  "annotation": {
    "Agent": "FinOpsAgent-v1",
    "resource_id": "i-0123456789",
    "decision": "DENIED - Production tag detected"
  },
  "duration": 245ms
}
Enter fullscreen mode Exit fullscreen mode

This tells you the agent correctly refused to stop a production instance—critical for compliance audits.


Step 5: The "Human-in-the-Loop" Safety Net

The biggest fear with Agentic AI is the "Runaway Robot" scenario—what if the agent misinterprets data and terminates a critical database?

The solution: Don't give the agent destructive permissions. Instead, use an approval workflow.

Implementation: Request Approval Tool

Add a third function to your Lambda that sends termination requests to humans:

import boto3

sns_client = boto3.client('sns')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-southeast-2:123456789:FinOpsApprovals'

def request_termination_approval(instance_id, justification):
    """
    Requests human approval before terminating an instance.
    Publishes to SNS topic monitored by DevOps team.
    """
    # Get instance details for context
    ec2 = boto3.client('ec2')
    instance = ec2.describe_instances(InstanceIds=[instance_id])['Reservations'][0]['Instances'][0]

    tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
    instance_type = instance['InstanceType']

    # Calculate estimated savings
    # Rough estimates: t3.medium=$30/mo, t3.large=$60/mo, etc.
    hourly_cost = get_instance_cost(instance_type)
    monthly_savings = hourly_cost * 730  # hours/month

    message = f"""
 FinOps Agent Termination Request

Instance: {instance_id}
Type: {instance_type}
Name: {tags.get('Name', 'N/A')}
Environment: {tags.get('Environment', 'Unknown')}

AI Analysis:
{justification}

Estimated Monthly Savings: ${monthly_savings:.2f}

Actions:
✅ Approve: Reply with "APPROVE {instance_id}"
❌ Deny: Reply with "DENY {instance_id}"
⏸️ Snooze 7 days: Reply with "SNOOZE {instance_id}"

View X-Ray Trace: https://console.aws.amazon.com/xray/...
    """

    sns_client.publish(
        TopicArn=SNS_TOPIC_ARN,
        Subject=f' FinOps Agent: Approval needed for {instance_id}',
        Message=message
    )

    return {
        'status': 'approval_requested',
        'instance_id': instance_id,
        'message': 'Notification sent to DevOps team'
    }
Enter fullscreen mode Exit fullscreen mode

Update the OpenAPI Schema

Add this to your schema:

  /request-termination:
    post:
      summary: Request approval to terminate an instance
      operationId: request_termination_approval
      parameters:
        - name: instance_id
          in: query
          required: true
          schema:
            type: string
        - name: justification
          in: query
          required: true
          schema:
            type: string
          description: AI-generated explanation for why termination is recommended
Enter fullscreen mode Exit fullscreen mode

The Agent's New Workflow

Now when the agent finds a zombie instance, it:

  1. Analyses the metrics (CPU, network, disk)
  2. Reasons about the context (tags, uptime, cost)
  3. Requests approval instead of acting immediately
  4. Includes the X-Ray trace link so humans can audit the decision
  5. Waits for human confirmation before taking action

This gives you the speed of AI analysis with the safety of human judgment.


Deployment & Testing

Step 1: Create the Lambda Function

# Package dependencies
pip install aws-xray-sdk boto3 -t ./package
cd package && zip -r ../lambda.zip . && cd ..
zip -g lambda.zip lambda_function.py

# Deploy to AWS
aws lambda create-function \
  --function-name FinOpsAgentTools \
  --runtime python3.11 \
  --role arn:aws:iam::YOUR_ACCOUNT:role/FinOpsLambdaRole \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://lambda.zip \
  --timeout 60 \
  --memory-size 256 \
  --tracing-config Mode=Active
Enter fullscreen mode Exit fullscreen mode

Step 2: Create IAM Policy for Lambda

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EC2ReadOnly",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeTags"
      ],
      "Resource": "*"
    },
    {
      "Sid": "EC2StopNonProd",
      "Effect": "Allow",
      "Action": "ec2:StopInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringNotEquals": {
          "ec2:ResourceTag/Environment": "Production"
        }
      }
    },
    {
      "Sid": "CloudWatchMetrics",
      "Effect": "Allow",
      "Action": "cloudwatch:GetMetricStatistics",
      "Resource": "*"
    },
    {
      "Sid": "SNSPublish",
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:*:*:FinOpsApprovals"
    },
    {
      "Sid": "XRayTracing",
      "Effect": "Allow",
      "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords"
      ],
      "Resource": "*"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Key Security Features:

  • EC2 stop is tag-restricted - Cannot stop instances with Environment=Production
  • Read-only for metrics - No write access to CloudWatch
  • SNS scope limited - Can only publish to approval topic

Step 3: Create the Bedrock Agent

# Create the agent
aws bedrock-agent create-agent \
  --agent-name FinOpsAgent \
  --foundation-model anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --instruction "You are a FinOps specialist. Analyse AWS costs and recommend optimisations. Always request approval before suggesting destructive actions." \
  --idle-session-ttl 600

# Create action group with OpenAPI schema
aws bedrock-agent create-agent-action-group \
  --agent-id YOUR_AGENT_ID \
  --action-group-name FinOpsTools \
  --action-group-executor lambda=arn:aws:lambda:ap-southeast-2:ACCOUNT:function:FinOpsAgentTools \
  --api-schema file://openapi-schema.yaml

# Prepare and create alias
aws bedrock-agent prepare-agent --agent-id YOUR_AGENT_ID
aws bedrock-agent create-agent-alias \
  --agent-id YOUR_AGENT_ID \
  --agent-alias-name prod
Enter fullscreen mode Exit fullscreen mode

Step 4: Test the Agent

Test Query 1: Analyse Resources

User: "Find underutilised instances in ap-southeast-2 that are costing us money"

Agent Response:
"I found 3 underutilised instances in ap-southeast-2:

1. i-0abc123 (t3.large) - Avg CPU: 3.2% - Est. savings: $60/mo
2. i-0def456 (m5.xlarge) - Avg CPU: 7.8% - Est. savings: $140/mo
3. i-0ghi789 (t3.medium) - Avg CPU: 5.1% - Est. savings: $30/mo

Total potential monthly savings: $230

Would you like me to request approval to stop these instances?"
Enter fullscreen mode Exit fullscreen mode

Test Query 2: Safety Check

User: "Stop instance i-prod-db-001"

Agent Response:
"I cannot stop i-prod-db-001 because it has the Environment=Production tag.
Production resources require manual approval through the change management process."
Enter fullscreen mode Exit fullscreen mode

Step 5: Monitor with X-Ray

Navigate to AWS X-Ray console and filter by:

annotation.Agent = "FinOpsAgent-v1"
Enter fullscreen mode Exit fullscreen mode

You'll see traces showing:

  • Which tools the agent invoked
  • How long each API call took
  • Whether any errors occurred
  • The full decision path from user query → response

Conclusion

Building Agentic AI on AWS is about more than just Prompt Engineering—it's about Reliability Engineering.

By treating the agent's "thoughts" as loggable events (X-Ray traces) and wrapping its "hands" (tools) in strict safety checks (tag validation, human approvals), we can build a FinOps assistant that is not only smart but trustworthy.

Key Takeaways

  1. Agents > Scripts - Reasoning capabilities allow context-aware decisions
  2. Observability is mandatory - X-Ray tracing is the only way to audit AI decisions
  3. Safety through architecture - Use IAM policies and human-in-the-loop for destructive actions
  4. Structure matters - Well-designed tool outputs enable better agent reasoning

The era of static Cron jobs is ending. The era of the Cloud CFO Agent has begun.


Cost Estimate

Running this setup for development/testing:

Service Monthly Cost
Bedrock Agent (Claude 3.5 Sonnet) ~$5 (100 queries)
Lambda invocations ~$0.20 (1000 invocations)
CloudWatch API calls ~$0.10 (detailed monitoring)
X-Ray tracing ~$2 (10,000 traces)
SNS notifications ~$0.50 (100 emails)
Total ~$8/month

For production use with 10,000 queries/month: ~$50-75/month

Potential savings identified: Hundreds to thousands per month depending on environment size.


Next Steps & Resources

Enhancements to Consider

  1. Multi-region support - Extend analysis to all AWS regions
  2. Additional metrics - Network I/O, disk usage, memory utilisation
  3. Historical trending - Track savings over time in DynamoDB
  4. Slack integration - Send reports to team channels instead of email
  5. Auto-remediation - After 30 days of approvals, enable auto-stop for specific tags

Further Reading

Top comments (0)