Let’s be honest—S3 is cheap.... until it isn’t.
One fine day, I was staring at my AWS bill and realized S3 storage had quietly been eating away a chunk of it. With buckets spread across teams, projects, and experiments that nobody remembered, I had a problem:
- How do I figure out which buckets are costing me the most?
- And how do I make sure I get a report every Monday without manually pulling Athena queries (and pretending to be productive)?
So, like any lazy engineer who doesn’t want to do repetitive work, I decided to automate it.😎
Step 1: Inventory, Inventory, Inventory 📦
First, I enabled S3 Inventory reports in Parquet format (JSON would’ve been too chatty). These reports contain all the details about objects, storage classes, and sizes. Perfect for Athena queries.
Once enabled, the reports started landing in a dedicated S3 bucket. From there, I built multiple Athena tables pointing to these Parquet files—because SQL > endless scrolling through S3 console.
Step 2: Writing Athena Queries Like a Detective 🕵️
The idea was simple:
- For each table (inventory per bucket/prefix), calculate storage size.
- Estimate cost based on storage class pricing.
- Find out the top 'N' offenders (buckets hogging most of the bill).
Here’s a flavour of one Athena query I used in my code to get cost, size (per storage class) of a bucket/table:
WITH filtered_data AS (
SELECT
storage_class,
intelligent_tiering_access_tier,
SUM(size) AS total_size
FROM
my_bucket_1
WHERE
dt = '2025-08-19-01-00'
GROUP BY
storage_class, intelligent_tiering_access_tier
)
SELECT
storage_class,
intelligent_tiering_access_tier,
total_size,
CASE
WHEN storage_class = 'STANDARD' THEN total_size * 0.0210 / (1024 * 1024 * 1024)
WHEN storage_class = 'GLACIER' THEN total_size * 0.0036 / (1024 * 1024 * 1024)
WHEN storage_class = 'DEEP_ARCHIVE' THEN total_size * 0.00099 / (1024 * 1024 * 1024)
ELSE 0
END AS estimated_cost_usd
FROM
filtered_data;
Basically, I’m asking Athena:
💡 “Hey buddy, tell me which storage class is silently burning my credits.”
Step 3: Lambda + Boto3 = Automation ❤️
Now came the fun part—wrapping it up in a Lambda function so I don’t have to run queries by hand.
What the Lambda does:
- For each S3 Inventory Athena table it runs a single SQL that returns per-prefix and per-storage-class (and intelligent-tier) metrics: object_count, total_size (bytes) and an estimated_cost_usd (I hard-coded the per-GB prices for STANDARD, GLACIER, DEEP_ARCHIVE and the intelligent-tier access tiers right in the query).
- It polls Athena (the classic get_query_execution loop with time.sleep(5)) until each query finishes, then fetches the CSV result written to S3.
- It aggregates the query rows into a Python structure keyed by (table, prefix) — summing object counts, bytes and the estimated cost, and keeping a breakdown list for each storage class / tier.
- Once all tables are processed the Lambda computes total cost per (table,prefix) and sorts descending to pick the top N (configurable, I used 15).
- It then builds a detailed CSV report with a parent summary row per ranked prefix (total cost, total size, total objects) followed by the breakdown rows for each storage class/tier (object_count, size_GB, cost_USD).
- The CSV is uploaded to S3 (so the report is versioned, shareable, and easy to inspect).
- Finally the Lambda sends a short SES email summary (top lines + sizes/costs) and includes the S3 path to the CSV — so every Monday morning (via EventBridge schedule) I get an actionable report I can forward to teams or use to trigger cleanups.
Check the full python code in my github repo, give it a shot.💪
Now, instead of me digging into Athena on Mondays, I just sip coffee and open my inbox ☕📧.
Step 4: Scheduling with EventBridge (aka my weekend butler)
To make this truly hands-off, I scheduled the Lambda to run every Sunday midnight via EventBridge. That way, when I walk into Monday standup, I already know which buckets are the villains of the week.
(And yes, this occasionally makes me look more prepared than I actually am. 😎)
Final Report Look 📊
The email I get looks something like this:
- Table: my_bucket_1 | Bucket: my-bucket-1 | Size: 12.3 TB | Cost: $257.23
- Table: my_bucket_2 | Bucket: my.bucket.2 | Size: 9.1 TB | Cost: $198.72
- … and so on.
- Neat, simple, actionable.
Why This Matters
- Visibility: S3 is easy to ignore, but costs add up fast.
- Automation: No manual queries, no “oops I forgot.”
- Cost Savings: Spot old projects, test buckets, or misconfigured storage classes early.
What started as a weekend experiment has now become my weekly cost sanity check.
It’s not rocket science, but it saves $$$ and brain cycles.
If you’re also haunted by “mystery buckets” on your AWS bill, give this approach a shot. You’ll thank yourself every Monday morning.
Have you tried something similar to tame your S3 costs? Or maybe you’ve found an even better way? Drop a comment—I’d love to learn (and maybe steal your idea 😅).
Top comments (0)