DEV Community

Cover image for How I Built an "Agentic" AWS Cost Optimizer (That Doesn't Break Production)
Rick Wise
Rick Wise

Posted on

How I Built an "Agentic" AWS Cost Optimizer (That Doesn't Break Production)

I’ve spent 25 years in the Software Industry, and I’ve learned one universal truth: Engineers are terrified of deleting things.

We all have that one EC2 instance named test-do-not-delete-final that has been running for 3 years. We know it’s probably waste. The dashboard says it’s waste. But nobody deletes it. Why?

Because the risk of breaking production is infinite, and the reward of saving $50/month is zero.

This is the "Fear Tax." And it’s why most FinOps tools fail. They give you a list of 1,000 "optimization opportunities," and you ignore them all because you don't have time to manually verify safety for each one.

I built CloudWise Agentic Tier to solve this. It’s an agent that doesn't just find waste—it safely removes it after explicit approval, with a rollback guarantee.

Here is the technical deep dive on how I built the safety architecture using Python, Boto3, and Cross-Account IAM Roles.


The Architecture: "Safety First"

The core design philosophy is Reversibility. Every destructive action must be reversible. If it can't be undone, the agent isn't allowed to touch it.

The Workflow

  1. Scan: Identify idle resources (e.g., EBS volumes unattached > 7 days).
  2. Pre-Check: Run read-only calls to verify the resource state and resolve dependencies.
  3. Snapshot: Take a final backup (e.g., CreateSnapshot).
  4. Dry Run: Simulate the deletion to check for IAM permissions and dependencies.
  5. Execute: Perform the destructive action.
  6. Rollback (Optional): If anything breaks, one-click restore.

CloudWise Agentic Architecture


The "Secret Sauce": Pre-Checks & Placeholders

Most tools just run boto3.client('ec2').delete_volume(). That’s dangerous.

My agent uses a Pre-Check Phase to verify the resource state before generating the execution plan. It also resolves dynamic placeholders.

1. The Pre-Check Logic

Before we even think about deleting, we run a read-only probe.

def _execute_pre_checks(session, pre_checks):
    """
    Run read-only API calls to verify resource state.
    """
    results = []
    for check in pre_checks:
        # e.g. service="ec2", action="describe_volumes", params={"VolumeIds": ["vol-123"]}
        try:
            method = getattr(session.client(check['service']), check['action'])
            response = method(**check['params'])
            results.append({"success": True, "response": response})
        except Exception as e:
            results.append({"success": False, "error": str(e)})
    return results
Enter fullscreen mode Exit fullscreen mode

2. Dynamic Placeholder Resolution

The planner doesn't always know the ID of the snapshot it will create. So I implemented a placeholder system.

The plan might look like this:

  1. ec2:CreateSnapshot (Target: vol-123)
  2. ec2:DeleteVolume (Target: vol-123)

But if we need to restore, we need the Snapshot ID that hasn't been created yet.

The system captures the output of Step 1 and injects it into the Rollback Plan using a placeholder like SNAPSHOT_ID_FROM_PRECHECK.

def _resolve_placeholders(api_calls, pre_check_results):
    """
    Resolve dynamic placeholders like VOLUME_ID_FROM_PRECHECK
    using data from the pre-check phase.
    """
    lookup = _build_precheck_lookup(pre_check_results)

    resolved_calls = []
    for call in api_calls:
        params_str = json.dumps(call['params'])

        # Replace placeholders with actual values
        for key, value in lookup.items():
            if key in params_str:
                params_str = params_str.replace(key, value)

        call['params'] = json.loads(params_str)
        resolved_calls.append(call)

    return resolved_calls
Enter fullscreen mode Exit fullscreen mode

Security: The "2-Hop" IAM Chain

Security is the biggest blocker for SaaS tools. I use a 2-Hop IAM Architecture to ensure strict isolation.

  1. Hop 1 (Service Role): The Lambda function assumes a CloudWiseServiceRole in my account. This acts as a bastion.
  2. Hop 2 (Customer Role): The Service Role assumes the CloudWiseRemediationRole in the customer's account.

Why 2 Hops?

It allows me to rotate the internal Lambda roles without asking 100 customers to update their Trust Policies. The customer only trusts one static Service Role ARN.

The Customer Trust Policy

This is the only thing the customer installs. It trusts my AWS Account, not a specific user, but enforces an ExternalId to prevent "Confused Deputy" attacks.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": { "AWS": "arn:aws:iam::MY_PROD_ACCOUNT_ID:root" },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": { "sts:ExternalId": "CUSTOMER_UNIQUE_ID" }
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Note: Using root principal allows me to manage the specific IAM role that assumes this role on my side, without breaking the customer's trust.


Handling Edge Cases (The "In The Trenches" Stuff)

CloudFront is Weird

You can't just update a CloudFront distribution. You need the current ETag (version ID) to prove you aren't overwriting someone else's changes.

My agent handles this automatically:

def _prepare_cloudfront_update(client, parameters):
    # 1. Fetch current config to get the ETag
    dist = client.get_distribution(Id=parameters['Id'])
    etag = dist['ETag']

    # 2. Merge our changes
    current_config = dist['Distribution']['DistributionConfig']
    current_config.update(parameters['DistributionConfig'])

    # 3. Return the payload with the ETag
    return {
        "Id": parameters['Id'],
        "DistributionConfig": current_config,
        "IfMatch": etag
    }
Enter fullscreen mode Exit fullscreen mode

The "Agentic" Future

The term "Agentic" is getting thrown around a lot, but in infrastructure, it has a specific meaning to me: Software that does the work, not just the analysis.

For FinOps to mature, we have to stop treating "Cost Optimization" as a homework assignment for engineers. It should be a garbage collection process that runs in the background—safe, reversible, and automated.

If you want to see this in action (or critique my code/architecture), I’m building this in public. You can check out the live tool at cloudcostwise.io.


I’m Rick, a solo founder building CloudWise. I write about AWS, Python, and the psychology of engineering.

Top comments (0)