Mursal Furqan Kumbhar for AWS Community Builders

Posted on Jul 5

I Let a Bedrock Agent Watch My AWS Bill for 30 Days. Here Is Everything It Caught, Missed, and Made Up

#aws #machinelearning #devops #tutorial

Ciao Amici 👋

Pull up a chair, because I have a story for you. It involves a forgotten GPU endpoint, a robot accountant I built over two evenings, a NAT gateway that quietly ate my wallet, and one genuinely absurd moment where my own creation reported itself to me as suspicious activity. For thirty days I let an Amazon Bedrock Agent read my AWS bill every morning and tell me what it thought, and by the end of this story you will have every line of code, every chart, every dollar figure, and every embarrassing mistake I made along the way. Some of those mistakes were mine. Two of them, delightfully, were the agent's. Make some chai, this is a long one, and I promise it earns its length.

The morning it all started

Last November I opened my AWS billing dashboard and felt that very specific cold feeling in the stomach that every cloud engineer knows. A SageMaker real-time endpoint I had spun up for a quick demo had been running for eleven days. Eleven days of me paying for a GPU-backed instance to serve exactly zero requests.

The endpoint cost me about $38. Not catastrophic. But here is the thing: it was the third time in a year this movie had played. Before that, an Elastic IP sat unattached for a month. Before that, a NAT gateway I had completely forgotten was part of a VPC I built for a workshop. Each time, the same script: I discover the charge weeks later, feel briefly stupid, delete the thing, and promise myself I will check the console more often. I never do, because manually checking a billing console is exactly the kind of chore human brains are engineered to avoid.

And yes, I had CloudWatch billing alerts configured, like a responsible adult. But a threshold alert only tells you that you are spending money. It never tells you why, and it definitely never tells you what to do about it. An alert saying "your bill crossed $80" at 2 AM is not insight. It is anxiety with a timestamp.

So, staring at that dead endpoint, an idea landed. What if I stopped being my own billing alarm entirely? What if, every morning, someone else read the bill for me, someone tireless, slightly paranoid, and immune to the phrase "I will deal with it after the weekend"?

That someone became a Bedrock Agent. For 30 days I gave it read-only access to Cost Explorer and a handful of describe APIs, scheduled it to run every morning, and told it to behave like a cautious FinOps consultant. Every day at 9 AM Karachi time it would analyze my spend, flag anything weird, estimate savings, and email me a report with the subject line "The Accountant," because if a robot is going to lecture me about money, it should at least have a title.

This is the full story of that month. Spoiler: I am still running it. Longer spoiler: it once found a database that does not exist.

Why an agent, and not just budgets and alarms?

A colleague asked me this on day two, with a raised eyebrow that suggested he already knew part of my answer would be "because it sounded fun."

It did sound fun. But there is a real gap here, and it decides whether this whole pattern is a toy or a tool, so let me make the case properly.

AWS Budgets, billing alarms, even Cost Anomaly Detection: these are threshold and statistics tools. They compare numbers to lines. They are excellent at "did I spend more than X" and reasonable at "is today statistically unusual." What they cannot do is answer the questions that actually eat your time when a bill goes wrong:

Why did my bill go up 40% this week, in plain language?
Is this NAT gateway data processing charge normal for my workload, or a routing misconfiguration?
Which of my resources are technically running but functionally idle?
If I keep going at this rate, what does month-end look like, and which line items are driving it?
Is this spike a problem, or the training run I kicked off on purpose yesterday?

That last one matters more than people admit. Anomaly detection without context produces alerts you learn to ignore, and an ignored alert is worse than no alert. The difference between "spend increased" and "spend is wrong" is a judgment call, and judgment calls require reasoning over context. That is exactly the shape of problem LLM agents are decent at, provided you feed them structured data and keep them on a very short leash.

The short leash part is not a figure of speech. It is the entire safety story of this project, and you will see it earn its keep on day 16.

There was a quieter motivation too. I spend a lot of my community time around ML tooling, and I kept seeing agent demos that book flights or order pizza, tasks nobody actually wants automated by a nondeterministic system. Billing analysis felt like the opposite case: tedious, recurring, data-rich, and low-stakes as long as the agent can only read. If agents are useful anywhere in a personal AWS account, they should be useful here. Consider the next five thousand words the test of that hypothesis.

Meet The Accountant

Here is the whole system. It is deliberately boring, and every boring choice was on purpose.

Five pieces:

EventBridge Scheduler fires once a day at 04:00 UTC, which is 9 AM for me in Karachi. Daily is the right cadence for a personal account. Hourly would multiply the watcher's own cost by 24 for almost no extra insight, because Cost Explorer data lags a few hours anyway.
A small trigger Lambda invokes the Bedrock Agent with a fixed daily prompt, collects the streamed response, and forwards the final report.
The Bedrock Agent (I used Claude via Bedrock, though the pattern works with any agent-capable model on the platform) reads its instructions, decides which tools it needs, calls them, and reasons over the results.
An action group Lambda is the only component with AWS API access. It exposes a small menu of tools: get cost and usage, get cost forecast, list SageMaker endpoints, list EC2 instances, describe NAT gateways, find unattached EBS volumes and Elastic IPs, and describe RDS instances.
The agent's final report lands in my inbox via Amazon SES, formatted as HTML with a color-coded status banner, and impossible for me to claim I did not see. Email was a deliberate choice over chat tools: I check my inbox every morning anyway, it keeps the whole stack inside AWS, and a month of reports becomes a searchable archive for free.

Now the tattoo-worthy design decision, the one I would print inside the cover of every agent tutorial if publishers let me: the agent has zero write permissions. It cannot stop, terminate, resize, delete, or modify anything. It can only look and complain. I wanted a consultant, not an intern with root access.

Hold that thought. It becomes the hero of this story later.

Building my robot accountant

Let me walk you through the build in the order I actually did it, including the parts that wasted an evening, because the wasted evenings are where the useful knowledge lives.

Step 1: The IAM policy

Everything the action group Lambda can do fits in one tight policy. Read it, and notice what is absent:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CostVisibility",
      "Effect": "Allow",
      "Action": [
        "ce:GetCostAndUsage",
        "ce:GetCostForecast",
        "ce:GetDimensionValues"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ResourceVisibility",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes",
        "ec2:DescribeNatGateways",
        "ec2:DescribeAddresses",
        "sagemaker:ListEndpoints",
        "sagemaker:DescribeEndpoint",
        "rds:DescribeDBInstances"
      ],
      "Resource": "*"
    }
  ]
}

No ec2:StopInstances. No sagemaker:DeleteEndpoint. Nothing with a verb that changes state. Whenever the agent recommended killing something, a human (me, holding coffee) did the killing after a ten-second sanity check.

One small subtlety so you do not repeat my twenty wasted minutes: ce:* actions only work against Resource: "*" because Cost Explorer is account-scoped, so do not bother trying to narrow it.

Step 2: The action group Lambda

Bedrock Agents call your Lambda with a structured event describing which tool the model wants and with what parameters. Your job is to run the right boto3 calls and hand back JSON. Here is a trimmed version of my handler; the full version just has more entries in the tool table and better logging.

import boto3
import json
from datetime import datetime, timedelta

ce = boto3.client("ce")
sm = boto3.client("sagemaker")
ec2 = boto3.client("ec2")

def get_daily_costs(days=14):
    """Cost per service per day for the last N days, one API call."""
    end = datetime.utcnow().date()
    start = end - timedelta(days=days)
    resp = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    out = []
    for day in resp["ResultsByTime"]:
        for grp in day["Groups"]:
            amount = float(grp["Metrics"]["UnblendedCost"]["Amount"])
            if amount > 0.01:  # skip fractional-cent noise
                out.append({
                    "date": day["TimePeriod"]["Start"],
                    "service": grp["Keys"][0],
                    "usd": round(amount, 2),
                })
    return out

def get_forecast():
    start = datetime.utcnow().date() + timedelta(days=1)
    end = start.replace(day=28) + timedelta(days=4)
    end = end - timedelta(days=end.day)  # last day of month
    resp = ce.get_cost_forecast(
        TimePeriod={"Start": str(start), "End": str(end)},
        Metric="UNBLENDED_COST",
        Granularity="MONTHLY",
    )
    return {"projected_remaining_usd": round(float(resp["Total"]["Amount"]), 2)}

def list_sagemaker_endpoints():
    eps = sm.list_endpoints()["Endpoints"]
    return [
        {
            "name": e["EndpointName"],
            "status": e["EndpointStatus"],
            "created": e["CreationTime"].isoformat(),
        }
        for e in eps
    ]

def find_unattached_volumes():
    vols = ec2.describe_volumes(
        Filters=[{"Name": "status", "Values": ["available"]}]
    )["Volumes"]
    return [
        {"id": v["VolumeId"], "size_gb": v["Size"], "type": v["VolumeType"]}
        for v in vols
    ]

def find_unattached_eips():
    addrs = ec2.describe_addresses()["Addresses"]
    return [
        {"ip": a["PublicIp"]}
        for a in addrs if "AssociationId" not in a
    ]

TOOLS = {
    "get_daily_costs": lambda p: get_daily_costs(int(p.get("days", 14))),
    "get_forecast": lambda p: get_forecast(),
    "list_sagemaker_endpoints": lambda p: list_sagemaker_endpoints(),
    "find_unattached_volumes": lambda p: find_unattached_volumes(),
    "find_unattached_eips": lambda p: find_unattached_eips(),
}

def lambda_handler(event, context):
    func = event["function"]
    params = {p["name"]: p["value"] for p in event.get("parameters", [])}
    body = TOOLS.get(func, lambda p: {"error": f"unknown function {func}"})(params)

    return {
        "messageVersion": "1.0",
        "response": {
            "actionGroup": event["actionGroup"],
            "function": func,
            "functionResponse": {
                "responseBody": {"TEXT": {"body": json.dumps(body)}}
            },
        },
    }

Two lessons from this file, each purchased with part of an evening:

Cost Explorer calls cost money. Each GetCostAndUsage request bills at $0.01. That sounds like nothing until your agent decides to be thorough and makes forty calls in one session, fetching one day at a time. I added an explicit instruction telling it to fetch 14 days in a single call. After that, the system settled at six to eight Cost Explorer calls per day.

Empty results need to say they are empty. Early on, my tools returned a bare [] when nothing was found. Later I changed them to return {"found": 0, "items": []} with an explicit count. Remember this detail. It is Chekhov's empty list, and it goes off on day 16.

Step 3: The agent itself

In the Bedrock console: create an agent, pick a model, paste instructions, attach the Lambda as an action group, and define each function with its parameters. The console generates the function schema for you these days, which removed most of the friction this setup used to have.

The instruction prompt matters more than any other component in this system. Mine went through five versions, and the diffs between them tell the real story of the month. This is where it ended up, lightly condensed:

You are a cautious FinOps analyst reviewing a personal AWS account.
The owner is a solo developer running ML experiments, a portfolio
site, and occasional demos. Typical monthly spend is 60 to 120 USD.

Every day you will:
1. Fetch the last 14 days of daily costs grouped by service,
   in a single call.
2. Compare today's trajectory against the trailing average.
3. Check idle-prone resources: SageMaker endpoints, unattached
   EBS volumes and Elastic IPs, NAT gateways, and stopped EC2
   instances with attached storage.
4. Fetch the month-end forecast.
5. Report findings in this exact structure:
   STATUS: green / yellow / red
   HEADLINE: one sentence
   ANOMALIES: bullet list, or "none"
   RECOMMENDATIONS: bullet list with estimated monthly savings
   as a range for each, or "none"
   FORECAST: projected month-end total
   WATCHER OVERHEAD: today's estimated cost of running this
   analysis (Bedrock + Cost Explorer calls)

Rules:
- Never recommend an action you cannot support with data you
  fetched in this session. If a tool returned zero items, there
  is nothing to recommend about those items.
- If you are unsure whether a resource is intentional, say so
  and ask. Do not assume.
- Distinguish between "spend increased" and "spend is anomalous".
  A planned training run is not an anomaly.
- State every savings estimate as a range, never a point figure.
- Your own Bedrock and Lambda usage is expected. Report it under
  WATCHER OVERHEAD, never under ANOMALIES.

If that prompt reads like it was written by someone who got burned, good instincts. Every rule after the numbered list was added mid-experiment in response to a specific failure. Scar tissue, in prompt form. The stories behind those scars are coming.

Step 4: The trigger Lambda and email delivery

The trigger Lambda is short. It invokes the agent, collects the streamed chunks, and sends the report through Amazon SES as an HTML email with a status-colored banner at the top:

import boto3
import os

agent_rt = boto3.client("bedrock-agent-runtime")
ses = boto3.client("ses")

COLORS = {"green": "#2eb67d", "yellow": "#ecb22e", "red": "#e01e5a"}

def lambda_handler(event, context):
    resp = agent_rt.invoke_agent(
        agentId=os.environ["AGENT_ID"],
        agentAliasId=os.environ["AGENT_ALIAS_ID"],
        sessionId=context.aws_request_id,
        inputText="Run the daily billing review.",
    )

    report = ""
    for chunk in resp["completion"]:
        if "chunk" in chunk:
            report += chunk["chunk"]["bytes"].decode()

    status = "yellow"
    for line in report.splitlines():
        if line.strip().upper().startswith("STATUS:"):
            status = line.split(":", 1)[1].strip().lower()
            break

    html = f"""
    <div style="font-family:Menlo,Consolas,monospace;max-width:640px">
      <div style="background:{COLORS.get(status, '#ecb22e')};
                  color:#fff;padding:10px 16px;border-radius:6px;
                  font-weight:bold;font-size:16px">
        The Accountant &middot; {status.upper()}
      </div>
      <pre style="white-space:pre-wrap;font-size:13px;
                  line-height:1.55;padding:8px 4px">{report}</pre>
    </div>
    """

    ses.send_email(
        Source=os.environ["FROM_ADDRESS"],
        Destination={"ToAddresses": [os.environ["TO_ADDRESS"]]},
        Message={
            "Subject": {
                "Data": f"The Accountant · {status.upper()} · daily review"
            },
            "Body": {
                "Html": {"Data": html},
                "Text": {"Data": report},
            },
        },
    )
    return {"status": status}

The Lambda's execution role needs one extra permission, ses:SendEmail, and that is the only write-adjacent permission in the whole system. It writes to my inbox, nothing else.

One SES gotcha before you build: new SES accounts start in sandbox mode, meaning both sender and recipient addresses must be verified identities. For this project that is actually fine, even pleasant. Verify your own address once in the SES console, click the confirmation link, done. You never need production access because you are only ever emailing yourself. The rare case where the sandbox is a feature.

The status word in the subject line turned out to be an underrated detail. After a week my inbox itself became a dashboard: a column of GREEN subjects I could skim past, with the occasional YELLOW or RED that actually earned a tap. And unlike a chat message, the reports pile up into a searchable archive. Mid-experiment I searched "NAT" in my mail app and had the full paper trail of an incident in three seconds.

Note the sessionId: fresh one per day, so each morning is a blank slate. I considered giving it memory of previous reports and decided against it for the experiment, because I wanted every finding to be reproducible from that day's data alone. Version two revisits this, and I will tell you why at the end.

Prefer WhatsApp instead?

Two people asked me this before the experiment even ended, which around here is not surprising; WhatsApp is where life happens. It works, with one honesty caveat: the AWS-native route, AWS End User Messaging Social, requires linking a WhatsApp Business Account, and Meta's business verification is the tedious part, not the code. Once linked, the swap is a few lines:

social = boto3.client("socialmessaging")

social.send_whatsapp_message(
    originationPhoneNumberId=os.environ["WA_PHONE_ID"],
    metaApiVersion="v20.0",
    message=json.dumps({
        "messaging_product": "whatsapp",
        "to": os.environ["MY_NUMBER"],
        "type": "text",
        "text": {"body": f"The Accountant · {status.upper()}\n\n{report}"},
    }).encode(),
)

If you just want it working in ten minutes for personal use, the Twilio WhatsApp sandbox is the pragmatic shortcut: one POST request from the same Lambda, no Meta paperwork. I stayed with email because it kept the entire stack inside one AWS account and gave me the archive, but a RED alert buzzing your phone on WhatsApp is a legitimately better interrupt for genuine emergencies. The architecture does not care; delivery is the last five lines of one Lambda, so pick the channel you actually look at.

Total build time, including the prompt fiddling and the email formatting I was unreasonably fussy about: around six hours across two evenings. Then I scheduled it, went to sleep, and waited for my new employee's first day.

The diary

Now the part you actually came for. I kept notes every day of the experiment. Here are the highlights, week by week. All figures come from my account during the experiment window; your account will tell its own story, which is rather the point.

Week 1: The agent finds my dirty laundry immediately

Day 1. First report lands in my inbox at 9:02 AM, YELLOW glaring from the subject line. It flagged a SageMaker endpoint named vitb16-demo-endpoint that had been running for six days. Estimated waste: $28 to $34 per month. I knew this one. It was serving a palmprint recognition demo I had shown a colleague exactly once, weeks after building the model, and I had told myself I would take it down "after the weekend." The agent did not care about my intentions. I killed the endpoint before finishing my tea. First blood to the robot, on its first morning, in under two minutes of compute.

Day 2. Green. The report was three lines long. I felt an entirely irrational sense of approval, like passing a surprise inspection.

Day 3. Green status, but with a line I did not expect: "Note: 3 unattached gp2 volumes totaling 240 GB, approximately $24/month. These may be remnants of terminated training instances." They were exactly that. Leftovers from my thesis experiments, volumes from terminated GPU instances where DeleteOnTermination had been false. I genuinely did not know these existed. They do not appear anywhere you casually look; they just sit in the EC2 console under a tab nobody visits, quietly billing. I deleted two immediately and snapshotted the third before deleting it, in case it held a dataset I cared about. It held nothing. It never does.

Day 5. The agent noted my Route 53 hosted zone and domain charges as "consistent and expected for a portfolio site" and then never mentioned them again all month. Small thing, but this is the moment I started trusting it. It had built a sense of what normal looked like and stopped crying wolf about it. Threshold alarms cannot do this. They have no concept of "expected."

Day 7. End of week one. Total recurring waste identified: roughly $55 to $60 per month, on an account averaging around $107. Sit with that for a second. More than half my baseline spend was waste, and an LLM found essentially all of it in its first three mornings. I had audited this account myself two months earlier and missed the volumes completely. My pride recovered. Eventually.

Week 2: The NAT gateway incident

Day 8. Status yellow, trending toward alarmed. "EC2-Other charges increased significantly day-over-day."

Day 9. Status red. Headline: "EC2-Other charges increased 6x over the trailing average, driven by NAT gateway data processing."

Here is the real-life situation behind it, and it is the kind of mistake that is obvious only in hindsight. I had deployed a scraper into a private subnet for a dataset collection job, pulling Sindhi news articles for a language benchmark project I have been building on the side. Private subnet felt like the responsible default. But private subnet means every byte of outbound traffic transits the NAT gateway, and NAT gateways charge $0.045 per GB processed on top of their hourly rate. My scraper was pulling gigabytes of HTML through that gateway like the meter did not exist, because I did not know the meter existed.

The agent did not just flag the spike. It named the mechanism and proposed the fix: "If a workload in a private subnet is downloading large volumes of external data, evaluate whether it requires private networking. For S3-bound traffic, a gateway VPC endpoint processes data at no charge."

I moved the scraper to a public subnet with a security group locked to egress only, since it held no credentials or sensitive data, and added the S3 gateway endpoint for the upload path. Charges collapsed the next day. Estimated savings for the rest of the scraping job: about $19. More valuable than the money was the education; I now check the network path of every batch workload before launching it, a habit purchased for the price of two red mornings.

Day 11. Green. The agent noted the EC2-Other line had "returned to baseline following the routing change," meaning it connected this morning's data to the anomaly it flagged two days earlier, within a single fresh session, purely because the 14-day window contained both. Emergent context from a dumb design decision. I will take it.

Day 12. A quiet day, but the forecast line caught my eye: "Projected month-end: $71 to $79, versus $107 trailing-month baseline." Watching that forecast walk downward over the month became weirdly motivating, like a fitness tracker for infrastructure.

Week 3: The agent lies to me twice

Every honest story about an LLM experiment needs this chapter, so here is mine, in full, no flattering edits.

Day 16. Status yellow. The agent claimed my RDS instance was "idle with minimal connection activity" and recommended stopping it, estimated savings $12 to $15 per month. Sensible advice with one flaw: I do not have an RDS instance. Never did, in this account.

I pulled the session trace to see what happened. The describe_db_instances tool had returned an empty list, my early-version bare []. Somewhere between that empty array and the final report, the model's reasoning turned "no instances found" into "an idle instance was found." It is a fascinating failure if you stare at it long enough: the concepts of idle and absent live close together in the model's semantic space, and nothing in my tooling or prompt forced the distinction. Made worse, of course, by the confident dollar range attached to the phantom.

Two fixes went in that afternoon. The tools started returning explicit counts ({"found": 0, "items": []}, Chekhov's empty list finally fixed), and the instructions gained the rule you saw earlier: never recommend an action unsupported by data fetched in this session; if a tool returned zero items, there is nothing to recommend about those items. The phantom never returned. Eleven subsequent days, zero fabricated resources.

Day 19. It happened again, in a form I find funnier every time I think about it. The agent reported Amazon Bedrock charges as an anomaly and recommended I "review unexpected Bedrock usage in the account."

The Bedrock usage was the agent itself.

It had discovered its own operating costs in the billing data and reported itself to me as suspicious activity. I laughed out loud, screenshotted the email, and then realized this is a genuine and slightly deep problem: a watcher built on metered infrastructure will eventually observe its own footprint, and unless it has a concept of self, it will classify itself as an intruder. My fix was prosaic rather than philosophical: a WATCHER OVERHEAD line in the report format, and an instruction that its own Bedrock and Lambda charges are expected and belong there. Somewhere in that incident there is a lovely essay about self-models that I am choosing not to write today.

The generalizable lesson from both failures, and the one thing I hope you quote from this whole piece: agent reliability is mostly a tool design and prompt engineering problem, not a model problem. Every failure I saw in 30 days traced back to an ambiguity I had left lying around, either in what the tools returned or in what the instructions permitted. Fix the ambiguity and the failure mode dies. I never once had to change the model.

Day 21. Green, uneventful, and the first day the new report format ran clean end to end. WATCHER OVERHEAD: $0.17. The accountant now accounts for itself.

Week 4: Steady state

The final week was almost boring, which is the whole point of the system.

Day 24. Yellow, correctly. An Elastic IP had been sitting unattached for two days after I terminated a test instance, at the newer per-hour idle pricing. About $3.60 a month. I released it from my phone while waiting for a haircut, which is exactly the level of effort billing hygiene should require.

Day 25. Yellow, and this one made me smile: "Elevated SageMaker training charges, consistent with an intentional training workload. Not classified as anomalous." I had kicked off a fine-tuning run the night before. The agent saw a spike, checked its character (training jobs, not endpoints; bounded, not recurring), and filed it exactly as instructed under the distinction between increased and anomalous. This is the report that alarms and statistical detectors cannot write, and it is the single best argument for the whole approach.

Day 28. The agent noticed something I never asked it to look for: nonzero inter-region data transfer, on an account where everything nominally lives in one region. It suggested a cross-region S3 access, and it was right; an old script of mine was reading a bucket in another region, a leftover from an experiment two years ago. Savings of maybe a dollar a month. The dollar is irrelevant. The fact that it went looking is not.

Day 30. Green. Final forecast: $76 to $78. Actual month-end: $76.10. I bought myself a decent lunch with a fraction of the difference and considered it a performance bonus paid to the wrong party.

The receipts

The month in one table. "Baseline" is my average across the previous three months.

Metric	Baseline (avg)	Experiment month	Change
Total monthly spend	$107.40	$76.10	-29%
Idle/waste spend (est.)	~$58	~$6	-90%
Billing surprises	1 to 2 per month	0	gone
Time I spent on billing	~45 min/month	~10 min/month	-78%
Watcher overhead	$0	$4.87	new cost

And the full recommendations log, verdicts included:

Day	Finding	Est. monthly impact	Verdict
1	Idle SageMaker endpoint	$28 to $34	Real. Killed it.
3	3 unattached EBS volumes	~$24	Real. Deleted.
9	NAT gateway data processing spike	~$19 (one-off)	Real. Rearchitected.
16	"Idle RDS instance"	$12 to $15	Hallucinated. No RDS exists.
19	"Unexpected Bedrock usage"	n/a	The agent found itself.
24	Unattached Elastic IP	~$3.60	Real. Released.
25	SageMaker training spike	n/a	Correctly labeled intentional.
28	Cross-region S3 access	~$1	Real. Fixed the script.

Six correct findings, one correct non-finding, two false ones. Both false ones were eliminated by tool and prompt fixes and never recurred. I will take that ratio in any system I run.

What the watcher itself costs

Full transparency, because a cost optimization story that hides its own costs would be a comedy:

Component	Monthly cost
Bedrock model invocations (30 daily sessions)	~$3.90
Cost Explorer API calls (~220 total)	~$2.20
Lambda, EventBridge, SES, CloudWatch Logs	under $0.30
Total	~$4.87

Call it five dollars to find fifty-five.

On a personal account the return is comfortable. On a company account with real spend and real sprawl, this is not even a conversation. The same architecture pointed at an organization's consolidated billing, with per-team tags in the grouping dimension, would pay for itself before lunch on day one.

Questions friends asked me along the way

A few of these came from colleagues, a couple from the AWS Community Builders Slack when I mentioned what I was up to.

Why not Cost Anomaly Detection? It is free.
Use both; they are not competitors. Anomaly Detection is a smoke detector: statistical, fast, and context-free. The agent is the person who walks into the kitchen, sees the smoke, and tells you it is the toaster and not the house. On day 25 Anomaly Detection would have flagged my training run. The agent explained it and moved on.

Why fresh sessions daily instead of giving it memory?
For the experiment, reproducibility. Every claim in every report had to be derivable from that day's tool calls alone, or the failures would have been impossible to debug. In steady-state operation, memory of past findings is clearly useful (it could track whether recommendations were acted on), and that is the headline feature of my version two.

Could the model just be pattern-matching generic FinOps advice rather than analyzing my data?
The NAT gateway diagnosis and the cross-region catch were both specific to numbers it fetched in-session, with the mechanism named correctly. The RDS hallucination is actually the counter-evidence in my favor: when it did drift into generic pattern-matching ("accounts usually have an underused database"), it produced garbage, and the fix was forcing every claim to bind to fetched data. The discipline is enforceable.

What about a multi-account org?
Same shape, bigger tools. Point Cost Explorer at the payer account, group by linked account and tag, and give the report a per-team section. The read-only constraint becomes more important there, not less.

Did you worry about sending billing data to a model?
The data stays inside my own AWS account boundary; Bedrock invocations are not used to train the underlying models, per AWS's service terms. For a personal account I was comfortable. A regulated enterprise should read those terms themselves rather than take a blog post's word for it, including mine.

So, should you hire a robot accountant?

The honest answer depends on what kind of AWS user you are.

Build it if you run experiments, demos, side projects, or anything with a lifecycle shorter than your attention span. The idle-resource problem is fundamentally a problem of forgetting, and machines do not forget. Six hours of building bought me a permanent employee who works for five dollars a month and has no opinion about weekends.

Skip it if your account runs one stable production workload that never changes. A static bill needs a budget alert, not an analyst. This tool earns its keep in accounts with churn.

And absolutely do not give the agent write permissions, no matter how elegant the fully autonomous version looks in your head. My agent hallucinated a database on day 16 and priced its imaginary savings with a straight face. Imagine that reasoning chain ending in a stop_db_instance call aimed at whatever resource it decided best fit its imaginary finding. Read-only is not a limitation of this design. It is the design. The human approves; the human executes; the agent never touches anything.

Three smaller lessons I would press into your hands before you go build:

Make the agent state ranges, not points. "Savings of $28 to $34" invites verification. "Savings of $31.20" invites false trust. Precision is a costume that confidence wears.
Teach it your normal in the instructions, including typical spend and known steady charges. Anomaly detection without a baseline is just complaining with extra steps.
Log every session and read the traces when something is off. The trace of which tools were called and what they returned was the only way to tell whether a bad report came from bad data or bad reasoning. In my month it was never the model and always my instructions. Debug yourself first.

Wrapping up

So, amici, that is the story of my month with a robot accountant. It began with a dead GPU endpoint and a familiar sting of embarrassment, and it ended with a 29% smaller bill, an inbox full of color-coded morning reports, and a considerably sharper understanding of where agents genuinely earn their keep. The Accountant found real money, explained real mechanisms, correctly shrugged at my intentional spikes, and got measurably smarter every time I tightened its instructions.

It also invented a database out of thin air and once reported its own existence to me as a suspicious charge, which is why it will remain, forever, a colleague I listen to carefully and never hand the keys. Version two is already half-built: weekly deep-dives, per-project cost attribution through tags, and persistence so it can nag me about recommendations I have ignored.

If you build your own version, and I hope this post leaves you no excuse not to, I would genuinely love to hear what your agent catches in its first week. Find me on Dev.to or LinkedIn and tell me the story. The weirdest finding gets a shout-out in the follow-up post. Until then, keep your endpoints deleted and your subnets public where they can afford to be.

#HappyCoding 👋

Top comments (1)

Michael Salinas • Jul 10

Thank you for sharing such an excellent post. I really enjoyed reading it.

I’m a Python Full-Stack Engineer with over 10 years of experience designing and building scalable software solutions for clients across a variety of industries. Along the way, I’ve learned that successful projects depend not only on strong technical execution but also on creating real business value.

With my recent contract completed, I’m exploring new opportunities to collaborate with professionals who value innovation, practical problem-solving, and long-term partnerships. I enjoy discussing ideas that combine technical excellence with sound business strategy, creating outcomes that benefit everyone involved.

I believe every connection has the potential to become something meaningful. If you're interested in exchanging ideas, exploring opportunities, or simply connecting with someone who enjoys building impactful technology, I'd be happy to hear from you.

Wishing you success in your future endeavors, and I look forward to connecting.