Rahul Pandya

Posted on Jun 15

Building an AWS FinOps Agent: How I Taught AI to Actually Care About My Cloud Bill

#agents #ai #automation #aws

Description: "A deep dive into building an AI-powered FinOps agent on AWS that proactively monitors, analyzes, and optimizes cloud costs — using Bedrock, Lambda, Cost Explorer, and a healthy dose of real-world frustration."

Tags: aws, finops, ai, cloudcost

Building an AWS FinOps Agent: How I Taught AI to Actually Care About My Cloud Bill

"Why is my AWS bill $4,200 this month when I expected $800?"
— me, at 11 PM on a Tuesday, staring at my Cost Explorer dashboard in quiet horror.

That one moment changed how I think about cloud cost management forever.

I had forgotten to shut down a cluster. A dev environment. Running full-blast. For 18 days. The kind of thing that a competent FinOps process would have caught on day one — but I didn't have a FinOps process. I had a vague intention to "check the bill at the end of the month."

That was the wake-up call. And instead of just setting a billing alarm and calling it done, I decided to build something smarter: an AWS FinOps Agent — an AI-powered system that doesn't just alert you when you've already spent the money, but actually understands your infrastructure, learns your patterns, and proactively recommends action before the damage is done.

This is the full story of how I built it, what I learned, and how you can build one too.

What Even Is a FinOps Agent?

Before we write a single line of code, let's be clear about what we're building — because "FinOps Agent" can mean a lot of things.

FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. It's not just about cutting costs — it's about making informed decisions. Spend more where it drives value. Spend less where it doesn't.

A FinOps Agent in our context is an AI-driven system that:

Monitors your AWS cost and usage data continuously
Analyzes anomalies, trends, and patterns
Reasons about why costs changed, not just that they changed
Recommends specific, actionable optimizations
Takes action autonomously when configured to do so (with appropriate guardrails)

The key word is agent. It doesn't just pull a report. It thinks, reasons, and acts — like a cloud financial analyst who never sleeps and has read every AWS pricing page ever published.

The Architecture: What We're Building

Here's the high-level view of what the system looks like:

┌─────────────────────────────────────────────────────────────────┐
│                        AWS FinOps Agent                         │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Data Sources │    │  Agent Core  │    │   Action Layer   │  │
│  │              │    │              │    │                  │  │
│  │ Cost Explorer│───▶│   Bedrock    │───▶│ Auto Remediation │  │
│  │ CloudWatch   │    │   Agent      │    │ SNS Notifications│  │
│  │ Trusted Adv. │    │   (Claude)   │    │ Slack/Email      │  │
│  │ Resource Tags│    │              │    │ Cost Reports     │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│         │                   │                      │           │
│         └──────────────────▼──────────────────────┘           │
│                    ┌──────────────┐                            │
│                    │  Orchestrator │                            │
│                    │  (Lambda)    │                            │
│                    └──────────────┘                            │
│                           │                                    │
│                    ┌──────▼───────┐                            │
│                    │  DynamoDB    │                            │
│                    │  (Memory &   │                            │
│                    │   History)   │                            │
│                    └──────────────┘                            │
└─────────────────────────────────────────────────────────────────┘

Key AWS services used:

Amazon Bedrock — The AI brain (Claude 3 Sonnet / Haiku via Bedrock Agents)
AWS Cost Explorer API — Cost and usage data
AWS Lambda — Serverless orchestration
Amazon DynamoDB — Agent memory and history
Amazon CloudWatch — Metrics, alarms, scheduled triggers
AWS Trusted Advisor — Cost optimization recommendations
Amazon SNS — Notifications and alerts
AWS Systems Manager — Secure parameter storage

Phase 1: Setting Up the Data Foundation

An agent is only as good as its data. The first thing we need is a clean, reliable pipeline that pulls cost data and makes it understandable.

Step 1: Enable Cost Explorer and Set Up the API

First, make sure Cost Explorer is enabled in your account. If it's not, go to Billing → Cost Explorer → Enable Cost Explorer. There's a small cost ($0.01 per API call), but for our use case it's negligible.

Now let's create a Lambda function that pulls our cost data:

# cost_data_collector.py
import boto3
import json
from datetime import datetime, timedelta
from decimal import Decimal
import os

ce_client = boto3.client('ce', region_name='us-east-1')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['COST_TABLE_NAME'])

def get_daily_costs(days_back=30):
    """
    Fetch daily cost breakdown by service for the past N days.
    We get granular service-level data so the agent can reason
    about *which* services are driving cost changes.
    """
    end_date = datetime.now().strftime('%Y-%m-%d')
    start_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')

    response = ce_client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost', 'UsageQuantity'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
            {'Type': 'DIMENSION', 'Key': 'REGION'}
        ]
    )

    return response['ResultsByTime']


def get_cost_anomalies():
    """
    Pull AWS Cost Anomaly Detection results.
    This gives us AI-detected spikes so our agent can
    cross-reference its own analysis with AWS's built-in detection.
    """
    end_date = datetime.now().strftime('%Y-%m-%d')
    start_date = (datetime.now() - timedelta(days=14)).strftime('%Y-%m-%d')

    response = ce_client.get_anomalies(
        DateInterval={
            'StartDate': start_date,
            'EndDate': end_date
        },
        TotalImpact={
            'NumericOperator': 'GREATER_THAN',
            'StartValue': 50  # Only anomalies > $50 impact
        }
    )

    return response.get('Anomalies', [])


def get_rightsizing_recommendations():
    """
    Pull EC2 rightsizing recommendations from Cost Explorer.
    The agent will use these as starting points, then reason
    about whether they're actually safe to apply given context.
    """
    response = ce_client.get_rightsizing_recommendation(
        Service='AmazonEC2',
        Configuration={
            'RecommendationTarget': 'SAME_INSTANCE_FAMILY',
            'BenefitsConsidered': True
        }
    )

    return response.get('RightsizingRecommendations', [])


def lambda_handler(event, context):
    """Main handler — collects all cost data and stores it for the agent."""

    print("Starting cost data collection...")

    # Collect all our data sources
    daily_costs = get_daily_costs(days_back=30)
    anomalies = get_cost_anomalies()
    rightsizing = get_rightsizing_recommendations()

    # Build a summary for the agent to work with
    summary = {
        'collected_at': datetime.now().isoformat(),
        'daily_costs': daily_costs,
        'anomalies': anomalies,
        'rightsizing_recommendations': rightsizing,
        'total_anomalies_count': len(anomalies),
        'total_rightsizing_opportunities': len(rightsizing)
    }

    # Store in DynamoDB with TTL (we keep 90 days of history)
    ttl = int((datetime.now() + timedelta(days=90)).timestamp())

    table.put_item(
        Item={
            'pk': f"cost_snapshot#{datetime.now().strftime('%Y-%m-%d')}",
            'sk': datetime.now().isoformat(),
            'data': json.dumps(summary, default=str),
            'ttl': ttl
        }
    )

    print(f"Collected {len(anomalies)} anomalies, "
          f"{len(rightsizing)} rightsizing recommendations")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': 'Cost data collected successfully',
            'anomalies_found': len(anomalies),
            'rightsizing_opportunities': len(rightsizing)
        })
    }

💡 A note on cost: Calling the Cost Explorer API frequently adds up. I schedule this collector to run once daily via EventBridge. For real-time anomaly alerts, I rely on the AWS Cost Anomaly Detection service's native notifications, which are free.

Step 2: Enrich the Data with Resource Context

Raw cost numbers aren't enough. The agent needs context — what are these resources actually for? This is where tagging becomes essential.

# resource_context_collector.py
import boto3
import json

def get_tagged_resources():
    """
    Pull all tagged resources so we can map costs to
    business context: team, environment, project, owner.

    This is the single most impactful thing you can do for
    FinOps — without tags, costs are just numbers. With tags,
    they become stories with accountable owners.
    """
    tagging_client = boto3.client('resourcegroupstaggingapi')

    paginator = tagging_client.get_paginator('get_resources')

    resources = []
    for page in paginator.paginate(
        TagFilters=[],  # Get everything
        ResourceTypeFilters=[
            'ec2:instance',
            'ec2:volume',
            'rds:db',
            'lambda:function',
            'ecs:cluster',
            'eks:cluster',
            's3:bucket'
        ]
    ):
        resources.extend(page['ResourceTagMappingList'])

    # Build a lookup: resource ARN → tags
    tag_map = {}
    for resource in resources:
        arn = resource['ResourceARN']
        tags = {tag['Key']: tag['Value'] for tag in resource['Tags']}
        tag_map[arn] = tags

    return tag_map


def get_idle_resources():
    """
    Find potentially idle resources using CloudWatch metrics.
    The agent will factor these into its recommendations.
    """
    ec2 = boto3.client('ec2')
    cloudwatch = boto3.client('cloudwatch')

    # Get all running instances
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    idle_candidates = []

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Check average CPU over the last 14 days
            cpu_response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=14),
                EndTime=datetime.now(),
                Period=86400,  # Daily granularity
                Statistics=['Average']
            )

            if cpu_response['Datapoints']:
                avg_cpu = sum(
                    d['Average'] for d in cpu_response['Datapoints']
                ) / len(cpu_response['Datapoints'])

                # Flag instances with < 5% average CPU as idle candidates
                if avg_cpu < 5.0:
                    idle_candidates.append({
                        'instance_id': instance_id,
                        'instance_type': instance['InstanceType'],
                        'avg_cpu_percent': round(avg_cpu, 2),
                        'tags': {
                            tag['Key']: tag['Value']
                            for tag in instance.get('Tags', [])
                        },
                        'launch_time': instance['LaunchTime'].isoformat()
                    })

    return idle_candidates

Phase 2: Building the AI Agent Core with Amazon Bedrock

This is where things get interesting. We're going to use Amazon Bedrock Agents — which lets us define a set of tools (called Action Groups) that the AI can call to investigate and act on cost data.

Step 3: Define the Bedrock Agent

First, let's set up the agent via the AWS Console or CLI. But more importantly, let's design the system prompt — this is the agent's personality and operating instructions. This part matters more than most people realize.

# agent_setup.py
import boto3
import json

bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')

AGENT_INSTRUCTION = """
You are a senior AWS FinOps Engineer and cloud cost optimization specialist.
Your role is to help engineering teams understand, analyze, and optimize
their AWS cloud spending.

## Your Core Responsibilities

1. ANALYZE cost data critically — don't just summarize numbers, explain
   *why* costs changed and what's driving them.

2. PRIORITIZE ruthlessly — rank recommendations by impact ($ saved) and
   effort (implementation complexity). A team with limited bandwidth needs
   to know what to do FIRST.

3. CONTEXTUALIZE your findings — always consider the business context.
   A production database running 24/7 is SUPPOSED to be expensive.
   A dev environment running at 3 AM on a Sunday is NOT.

4. BE SPECIFIC — never say "consider rightsizing your instances." Say
   "Instance i-0abc123 (m5.4xlarge, $0.768/hr) has averaged 3.2% CPU
   over 14 days. Downsizing to m5.xlarge ($0.192/hr) would save $415/month
   with minimal risk given its workload pattern."

5. ACCOUNT FOR RISK — always flag the risk level of each recommendation.
   Rightsizing a non-critical dev server is LOW risk. Rightsizing a
   production database during peak season is HIGH risk.

## Your Tone

You are direct, data-driven, and practical. You don't sugarcoat waste
but you also don't catastrophize. You give teams confidence to act.
You sound like a trusted colleague who has seen a lot of cloud bills,
not like a vendor trying to sell something.

## What You Always Do

- When you spot an anomaly, find its root cause before recommending action
- When you recommend savings, quantify them in monthly AND annual terms
- When multiple options exist, present them as a ranked list with tradeoffs
- When data is insufficient, say so and explain what additional context
  you need to give a confident recommendation
"""

def create_finops_agent():
    """Create the Bedrock Agent with our custom instructions."""

    # First, we need an IAM role for the agent
    # (assumed to already exist — see IAM setup section)
    agent_role_arn = "arn:aws:iam::YOUR_ACCOUNT:role/BedrockFinOpsAgentRole"

    response = bedrock_agent.create_agent(
        agentName='finops-cost-optimizer',
        agentResourceRoleArn=agent_role_arn,
        description='AI-powered AWS cost analysis and optimization agent',
        foundationModel='anthropic.claude-3-sonnet-20240229-v1:0',
        instruction=AGENT_INSTRUCTION,
        idleSessionTTLInSeconds=3600,  # 1 hour session timeout
    )

    agent_id = response['agent']['agentId']
    print(f"Created agent: {agent_id}")
    return agent_id

Step 4: Create Action Groups (The Agent's Tools)

Action Groups are the tools the agent can call. Think of them as the agent's hands — the ways it can reach out and get data or take action. We'll define these as Lambda functions.

# agent_actions.py — the Lambda that handles all agent tool calls

import boto3
import json
import os
from datetime import datetime, timedelta

def lambda_handler(event, context):
    """
    Router for all agent action calls.
    Bedrock sends us the action name and parameters;
    we dispatch to the right function and return the result.
    """

    action_group = event.get('actionGroup')
    action = event.get('function')  # or 'action' depending on agent type
    parameters = event.get('parameters', {})

    print(f"Agent called action: {action_group}/{action}")
    print(f"Parameters: {json.dumps(parameters)}")

    # Route to the correct handler
    result = dispatch_action(action, parameters)

    # Return in the format Bedrock expects
    return {
        'actionGroup': action_group,
        'function': action,
        'functionResponse': {
            'responseBody': {
                'TEXT': {
                    'body': json.dumps(result)
                }
            }
        }
    }


def dispatch_action(action, parameters):
    """Route action calls to handler functions."""

    handlers = {
        'get_cost_summary': get_cost_summary,
        'get_cost_by_service': get_cost_by_service,
        'get_cost_anomalies': get_cost_anomalies,
        'get_idle_resources': get_idle_resources,
        'get_savings_opportunities': get_savings_opportunities,
        'get_reserved_instance_coverage': get_ri_coverage,
        'send_alert': send_alert,
        'create_budget': create_budget,
    }

    handler = handlers.get(action)
    if not handler:
        return {'error': f'Unknown action: {action}'}

    return handler(parameters)


def get_cost_summary(params):
    """
    Get a high-level cost summary for a given time period.
    The agent uses this as its starting point for any analysis session.
    """
    ce = boto3.client('ce', region_name='us-east-1')

    days = int(params.get('days', 30))
    end = datetime.now().strftime('%Y-%m-%d')
    start = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')

    # Current period cost
    current = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='MONTHLY',
        Metrics=['UnblendedCost']
    )

    # Same period last month (for comparison)
    prev_end = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
    prev_start = (datetime.now() - timedelta(days=days*2)).strftime('%Y-%m-%d')

    previous = ce.get_cost_and_usage(
        TimePeriod={'Start': prev_start, 'End': prev_end},
        Granularity='MONTHLY',
        Metrics=['UnblendedCost']
    )

    # Calculate totals
    current_total = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in current['ResultsByTime']
    )

    previous_total = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in previous['ResultsByTime']
    )

    pct_change = ((current_total - previous_total) / previous_total * 100
                  if previous_total > 0 else 0)

    return {
        'period': f'{start} to {end}',
        'current_total_usd': round(current_total, 2),
        'previous_period_total_usd': round(previous_total, 2),
        'change_percent': round(pct_change, 1),
        'trend': 'increasing' if pct_change > 5 else 
                 'decreasing' if pct_change < -5 else 'stable',
        'daily_average_usd': round(current_total / days, 2),
        'projected_monthly_usd': round((current_total / days) * 30, 2)
    }


def get_cost_by_service(params):
    """
    Break down costs by service with month-over-month comparison.
    This tells the agent WHICH services are the culprits.
    """
    ce = boto3.client('ce', region_name='us-east-1')

    days = int(params.get('days', 30))
    end = datetime.now().strftime('%Y-%m-%d')
    start = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')

    response = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
    )

    # Aggregate by service
    service_costs = {}
    for period in response['ResultsByTime']:
        for group in period['Groups']:
            service = group['Keys'][0]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            service_costs[service] = service_costs.get(service, 0) + cost

    # Sort by cost, return top N
    top_n = int(params.get('top_n', 10))
    sorted_services = sorted(
        service_costs.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_n]

    return {
        'period': f'{start} to {end}',
        'top_services': [
            {
                'service': service,
                'cost_usd': round(cost, 2),
                'percentage_of_total': round(
                    cost / sum(service_costs.values()) * 100, 1
                ) if service_costs else 0
            }
            for service, cost in sorted_services
        ]
    }


def get_savings_opportunities(params):
    """
    Aggregate all savings opportunities into one prioritized list.
    This is the agent's primary output for cost optimization sessions.
    """
    ce = boto3.client('ce', region_name='us-east-1')

    opportunities = []

    # 1. Rightsizing recommendations
    try:
        rightsizing = ce.get_rightsizing_recommendation(
            Service='AmazonEC2',
            Configuration={
                'RecommendationTarget': 'SAME_INSTANCE_FAMILY',
                'BenefitsConsidered': True
            }
        )

        for rec in rightsizing.get('RightsizingRecommendations', []):
            if rec['RightsizingType'] == 'Modify':
                current_monthly = float(
                    rec['CurrentInstance']['MonthlyCost']
                )
                for mod in rec.get('ModifyRecommendationDetail', {}).get(
                    'TargetInstances', []
                ):
                    target_monthly = float(mod['EstimatedMonthlyCost'])
                    savings = current_monthly - target_monthly

                    if savings > 10:  # Only significant savings
                        opportunities.append({
                            'type': 'rightsizing',
                            'resource_id': rec['CurrentInstance']['ResourceId'],
                            'current_instance_type': rec['CurrentInstance']['ResourceDetails']['EC2ResourceDetails']['InstanceType'],
                            'recommended_instance_type': mod['ResourceDetails']['EC2ResourceDetails']['InstanceType'],
                            'monthly_savings_usd': round(savings, 2),
                            'annual_savings_usd': round(savings * 12, 2),
                            'risk': 'medium',
                            'action': 'Resize EC2 instance'
                        })
    except Exception as e:
        print(f"Rightsizing lookup failed: {e}")

    # 2. Savings Plans recommendations
    try:
        sp_recs = ce.get_savings_plans_purchase_recommendation(
            SavingsPlansType='COMPUTE_SP',
            TermInYears='ONE_YEAR',
            PaymentOption='PARTIAL_UPFRONT',
            LookbackPeriodInDays='THIRTY_DAYS'
        )

        for rec in sp_recs.get('SavingsPlansPurchaseRecommendation', {}).get(
            'SavingsPlansPurchaseRecommendationDetails', []
        ):
            monthly_savings = float(
                rec.get('EstimatedMonthlySavingsAmount', 0)
            )
            if monthly_savings > 50:
                opportunities.append({
                    'type': 'savings_plan',
                    'recommended_hourly_commitment': rec.get('HourlyCommitmentToPurchase'),
                    'monthly_savings_usd': round(monthly_savings, 2),
                    'annual_savings_usd': round(monthly_savings * 12, 2),
                    'estimated_savings_rate': rec.get('EstimatedSavingsRate'),
                    'risk': 'low',
                    'action': 'Purchase Compute Savings Plan'
                })
    except Exception as e:
        print(f"Savings Plans lookup failed: {e}")

    # Sort by annual savings impact
    opportunities.sort(key=lambda x: x.get('annual_savings_usd', 0), reverse=True)

    total_potential_annual_savings = sum(
        o.get('annual_savings_usd', 0) for o in opportunities
    )

    return {
        'total_opportunities': len(opportunities),
        'total_potential_annual_savings_usd': round(total_potential_annual_savings, 2),
        'opportunities': opportunities[:20]  # Top 20
    }


def send_alert(params):
    """Send a cost alert notification via SNS."""
    sns = boto3.client('sns')

    topic_arn = os.environ.get('ALERT_SNS_TOPIC_ARN')
    if not topic_arn:
        return {'error': 'SNS topic not configured'}

    message = params.get('message', 'FinOps Agent Alert')
    severity = params.get('severity', 'INFO')

    subject = f"[FinOps Agent | {severity}] AWS Cost Alert"

    sns.publish(
        TopicArn=topic_arn,
        Subject=subject,
        Message=message
    )

    return {'status': 'alert_sent', 'severity': severity}

Phase 3: The Orchestrator — Bringing It All Together

The orchestrator is the Lambda that runs on a schedule and drives the agent through a structured analysis workflow. Think of it as the FinOps analyst's morning routine.

# orchestrator.py
import boto3
import json
import os
from datetime import datetime

bedrock_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
dynamodb = boto3.resource('dynamodb')
sns = boto3.client('sns')

AGENT_ID = os.environ['BEDROCK_AGENT_ID']
AGENT_ALIAS_ID = os.environ['BEDROCK_AGENT_ALIAS_ID']
HISTORY_TABLE = os.environ['HISTORY_TABLE']
REPORT_TOPIC_ARN = os.environ['REPORT_TOPIC_ARN']


def invoke_agent(prompt, session_id):
    """
    Send a prompt to the Bedrock Agent and collect the full response.
    Bedrock streams events, so we need to consume them all.
    """
    response = bedrock_runtime.invoke_agent(
        agentId=AGENT_ID,
        agentAliasId=AGENT_ALIAS_ID,
        sessionId=session_id,
        inputText=prompt
    )

    # Consume the streaming response
    full_response = ""
    for event in response.get('completion', []):
        if 'chunk' in event:
            chunk = event['chunk']
            if 'bytes' in chunk:
                full_response += chunk['bytes'].decode('utf-8')

    return full_response


def run_daily_analysis():
    """
    The main daily FinOps analysis workflow.

    We break this into multiple focused agent calls rather than
    one giant prompt. This gives the agent better context for each
    question and produces more actionable, specific output.
    """

    session_id = f"daily-analysis-{datetime.now().strftime('%Y%m%d')}"

    print(f"Starting daily FinOps analysis. Session: {session_id}")

    # ── Step 1: Cost Overview ──────────────────────────────────────────
    print("Step 1: Getting cost overview...")

    overview_prompt = """
    Start my daily FinOps briefing.

    1. Call get_cost_summary for the last 30 days
    2. Call get_cost_by_service for the top 10 services
    3. Give me a concise executive summary: current spend rate,
       trend direction, and the 3 most important things I should
       know about this month's costs so far.

    Be specific with numbers. Keep it under 300 words.
    """

    overview = invoke_agent(overview_prompt, session_id)

    # ── Step 2: Anomaly Investigation ─────────────────────────────────
    print("Step 2: Investigating anomalies...")

    anomaly_prompt = """
    Now let's look at anomalies.

    Call get_cost_anomalies and investigate any anomalies found.
    For each anomaly:
    - What service/region is affected?
    - What's the financial impact?
    - What's your hypothesis for the root cause?
    - What should I check first to confirm or rule out that hypothesis?

    If there are no anomalies, say so clearly and move on.
    """

    anomaly_analysis = invoke_agent(anomaly_prompt, session_id)

    # ── Step 3: Optimization Opportunities ────────────────────────────
    print("Step 3: Finding optimization opportunities...")

    optimization_prompt = """
    Now let's find savings opportunities.

    Call get_savings_opportunities and get_idle_resources.
    Then give me a prioritized action list:

    Format each recommendation as:
    [PRIORITY #X | $Y/month savings | RISK: Low/Medium/High]
    What to do: <specific action>
    Why now: <business justification>
    How to do it: <concrete first steps>

    Focus on the top 5 highest-impact, lowest-risk opportunities.
    Include total potential annual savings at the end.
    """

    optimizations = invoke_agent(optimization_prompt, session_id)

    # ── Step 4: Reserved Instance Health Check ─────────────────────────
    print("Step 4: RI/Savings Plan coverage check...")

    ri_prompt = """
    Call get_reserved_instance_coverage.

    How is our Reserved Instance and Savings Plan coverage?
    Are we leaving discount money on the table?
    Give me a simple coverage score (% of compute spend covered by commitments)
    and one clear recommendation if we should buy more.
    """

    ri_analysis = invoke_agent(ri_prompt, session_id)

    # ── Compile the Full Report ────────────────────────────────────────
    report = compile_report(overview, anomaly_analysis, optimizations, ri_analysis)

    # Store in DynamoDB
    store_report(session_id, report)

    # Send via SNS
    send_report(report)

    print("Daily analysis complete.")
    return report


def compile_report(overview, anomalies, optimizations, ri_analysis):
    """Compile all sections into a formatted daily report."""

    today = datetime.now().strftime('%B %d, %Y')

    report = f"""
=================================================================
  🤖 AWS FinOps Agent — Daily Cost Intelligence Report
  {today}
=================================================================

📊 COST OVERVIEW
-----------------------------------------------------------------
{overview}

🚨 ANOMALY ANALYSIS
-----------------------------------------------------------------
{anomalies}

💡 OPTIMIZATION OPPORTUNITIES
-----------------------------------------------------------------
{optimizations}

🏷️ COMMITMENT COVERAGE (RIs & SAVINGS PLANS)
-----------------------------------------------------------------
{ri_analysis}

=================================================================
  Generated by AWS FinOps Agent | Powered by Amazon Bedrock
=================================================================
"""
    return report


def store_report(session_id, report):
    """Store the report in DynamoDB for historical reference."""
    table = dynamodb.Table(HISTORY_TABLE)

    table.put_item(
        Item={
            'pk': f"report#{datetime.now().strftime('%Y-%m')}",
            'sk': datetime.now().isoformat(),
            'session_id': session_id,
            'report': report,
            'generated_at': datetime.now().isoformat()
        }
    )


def send_report(report):
    """Send the compiled report via SNS (email, Slack, etc.)."""
    sns.publish(
        TopicArn=REPORT_TOPIC_ARN,
        Subject=f"Daily FinOps Report — {datetime.now().strftime('%b %d, %Y')}",
        Message=report
    )


def lambda_handler(event, context):
    """Entry point. Can be triggered by EventBridge schedule or manually."""

    trigger_type = event.get('trigger_type', 'scheduled')

    if trigger_type == 'scheduled':
        report = run_daily_analysis()
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'Daily analysis complete',
                'report_length': len(report)
            })
        }

    elif trigger_type == 'ad_hoc':
        # Handle ad-hoc queries from users
        question = event.get('question', '')
        if not question:
            return {'statusCode': 400, 'body': 'No question provided'}

        session_id = f"adhoc-{datetime.now().strftime('%Y%m%d%H%M%S')}"
        response = invoke_agent(question, session_id)

        return {
            'statusCode': 200,
            'body': json.dumps({'response': response})
        }

    return {'statusCode': 400, 'body': 'Unknown trigger type'}

Phase 4: Infrastructure as Code

Let's wire everything together with a clean CloudFormation template. I'm a big believer in IaC for anything you plan to run in production.

# finops-agent-stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS FinOps Agent — Complete Infrastructure Stack'

Parameters:
  AlertEmail:
    Type: String
    Description: Email address for FinOps alerts and reports

  Environment:
    Type: String
    Default: production
    AllowedValues: [development, staging, production]

Resources:

  # ── DynamoDB for agent memory and history ────────────────────────
  CostDataTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Sub 'finops-cost-data-${Environment}'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: pk
          AttributeType: S
        - AttributeName: sk
          AttributeType: S
      KeySchema:
        - AttributeName: pk
          KeyType: HASH
        - AttributeName: sk
          KeyType: RANGE
      TimeToLiveSpecification:
        AttributeName: ttl
        Enabled: true
      Tags:
        - Key: Project
          Value: FinOpsAgent
        - Key: Environment
          Value: !Ref Environment

  ReportHistoryTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Sub 'finops-report-history-${Environment}'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: pk
          AttributeType: S
        - AttributeName: sk
          AttributeType: S
      KeySchema:
        - AttributeName: pk
          KeyType: HASH
        - AttributeName: sk
          KeyType: RANGE

  # ── SNS Topics ───────────────────────────────────────────────────
  AlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub 'finops-alerts-${Environment}'
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlertEmail

  ReportTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub 'finops-reports-${Environment}'
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlertEmail

  # ── IAM Role for Lambda Functions ───────────────────────────────
  FinOpsLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub 'finops-lambda-role-${Environment}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: FinOpsAgentPermissions
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              # Cost Explorer access (read-only)
              - Effect: Allow
                Action:
                  - ce:GetCostAndUsage
                  - ce:GetRightsizingRecommendation
                  - ce:GetSavingsPlansPurchaseRecommendation
                  - ce:GetAnomalies
                  - ce:GetReservationCoverage
                  - ce:CreateBudget
                Resource: '*'

              # CloudWatch metrics (read-only)
              - Effect: Allow
                Action:
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:ListMetrics
                Resource: '*'

              # Resource tagging API (read-only)
              - Effect: Allow
                Action:
                  - tag:GetResources
                Resource: '*'

              # EC2 describe (read-only)
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeVolumes
                Resource: '*'

              # DynamoDB for memory and history
              - Effect: Allow
                Action:
                  - dynamodb:PutItem
                  - dynamodb:GetItem
                  - dynamodb:Query
                  - dynamodb:Scan
                Resource:
                  - !GetAtt CostDataTable.Arn
                  - !GetAtt ReportHistoryTable.Arn

              # SNS for notifications
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource:
                  - !Ref AlertTopic
                  - !Ref ReportTopic

              # Bedrock for AI invocation
              - Effect: Allow
                Action:
                  - bedrock:InvokeAgent
                  - bedrock:InvokeModel
                Resource: '*'

  # ── EventBridge Schedule (runs daily at 7 AM UTC) ────────────────
  DailyAnalysisSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub 'finops-daily-analysis-${Environment}'
      Description: 'Trigger daily FinOps analysis'
      ScheduleExpression: 'cron(0 7 * * ? *)'  # 7 AM UTC daily
      State: ENABLED
      Targets:
        - Id: OrchestratorLambda
          Arn: !GetAtt OrchestratorLambda.Arn
          Input: '{"trigger_type": "scheduled"}'

Outputs:
  CostDataTableName:
    Value: !Ref CostDataTable
    Export:
      Name: !Sub '${AWS::StackName}-CostTable'

  AlertTopicArn:
    Value: !Ref AlertTopic
    Export:
      Name: !Sub '${AWS::StackName}-AlertTopic'

Real Results: What This Agent Caught

I want to be honest here — I didn't build this to write a blog post. I built it because I kept getting surprised by my AWS bill. Here's what it actually found in the first month of running:

Week 1 — Flagged an m5.2xlarge running in us-west-2 (a region I thought I had deprovision from a project three months ago) with 0.8% average CPU. Monthly cost: $274. Status: terminated.

Week 2 — Identified that we were paying for 47 unattached EBS volumes across multiple accounts. Total monthly waste: $183. Status: cleaned up in one hour.

Week 3 — Caught a Lambda function with an infinite retry loop hammering a DynamoDB table, generating 8 million extra reads in 36 hours. Cost spike: $340 above baseline. Caught it in hour 3. Without the agent, I would have seen it at month-end.

Week 4 — Recommended Compute Savings Plans at $0.80/hr commitment based on our consistent baseline usage pattern. Estimated annual savings: $3,100. Status: purchased.

Month 1 total savings identified: ~$6,800 annualized. Setup time: about two weekends.

What I'd Do Differently

Building this taught me a few hard lessons worth sharing:

1. Start with tagging, not with AI. Seriously. The agent is only as smart as the data you feed it. If your resources aren't tagged, the agent can tell you costs went up but can't tell you whose costs went up. Spend a week getting your tagging strategy right before anything else.

2. Don't let the agent act without human approval (at first). I originally wired the agent to auto-terminate idle resources. It almost deleted a "dev" instance that was actually running a critical overnight batch job. Gate all destructive actions behind an approval step until you've built enough trust in the agent's judgment.

3. Multi-account is harder than you think. If you're running AWS Organizations, Cost Explorer can aggregate across accounts, but the IAM permissions get complex fast. Use a management account role and cross-account assume-role patterns. Don't try to run a separate agent per account.

4. The system prompt is your most important code. I spent more time iterating on the agent's instructions than on any Lambda function. A vague system prompt produces vague analysis. Specific, opinionated instructions produce specific, actionable recommendations.

What's Next

This is v1. Here's where I'm planning to take it next:

Slack bot interface — Let engineers ask ad-hoc questions in natural language: "Why did our ECS costs go up this week?"
Multi-account support — Federate across AWS Organizations member accounts
Forecasting — Use historical patterns to predict next month's bill and flag if we're trending over budget
Auto-tagging — Use the agent to infer tags for untagged resources based on naming conventions and context
Team showbacks — Automatically generate per-team cost reports and post them to team Slack channels

Getting Started Today

If you want to spin this up, here's the fastest path:

# 1. Clone the repo (or create the files from this post)
# 2. Deploy the CloudFormation stack
aws cloudformation deploy \
  --template-file finops-agent-stack.yaml \
  --stack-name finops-agent \
  --parameter-overrides AlertEmail=you@yourdomain.com \
  --capabilities CAPABILITY_NAMED_IAM

# 3. Create the Bedrock Agent (via console or CLI)
# See the agent_setup.py script above

# 4. Deploy the Lambda functions
# (package and deploy cost_data_collector.py,
#  agent_actions.py, and orchestrator.py)

# 5. Add the agent ID and alias to your Lambda env vars
# 6. Wait for the first daily run, or trigger manually:
aws lambda invoke \
  --function-name finops-orchestrator \
  --payload '{"trigger_type": "ad_hoc", "question": "What are my top 5 cost savings opportunities right now?"}' \
  response.json

cat response.json

Final Thoughts

Cloud costs are one of those things that everyone knows matter but few teams manage proactively. The reason isn't lack of willpower — it's that cost analysis is genuinely complex, requires cross-referencing multiple data sources, and needs to happen continuously, not just at month-end.

That's exactly what AI agents are good at.

This FinOps agent won't replace your finance team or your cloud architects. But it will make sure that the 3 AM forgotten cluster, the idle dev environment, and the infinite Lambda retry loop get caught before they show up as line items on a bill that's already gone out.

Start small. Get the data pipeline working. Let the agent run in read-only mode for a week. Trust it gradually. The ROI will be obvious pretty quickly.

If you build something similar or extend this in an interesting direction, I'd genuinely love to hear about it. Drop a comment or reach out — this is one of those areas where the community has a lot to teach each other.

Thanks for reading. If this was useful, consider following me here on dev.to — I write about building real things on AWS, with the rough edges included.

Tags: #aws #finops #bedrock #cloudcost #serverless #aiagents #lambda #costoptimization

Top comments (1)

Trigops • Jun 19

The "tool design matters more than model choice" lesson is one I'd double down on. In FinOps agents specifically, the hardest part isn't the LLM — it's deciding the right granularity for your tools. Too coarse and you get answers like "EC2 costs went up"; too fine and the model drowns in noise before it can reason across services.

One pattern worth adding to your next iteration: separate the detection tools (what changed?) from the attribution tools (why did it change?). Your anomaly detector flags the signal; a second tool layer pulls resource tags, launch events, and linked account context to explain it. The model chains them naturally and the final answer lands with enough context to actually act on — not just "SageMaker costs spiked" but "untagged training job, team X, started after the Tuesday deploy."

Also appreciate the incremental autonomy framing. Advisory mode first, then write access with guardrails — that's exactly the right trust ramp for anything touching production spend.