DEV Community

Cover image for Building in Public: The Technical Decisions Behind an AWS Cost Optimization Tool
Germán Neironi
Germán Neironi

Posted on

Building in Public: The Technical Decisions Behind an AWS Cost Optimization Tool

I'm going to share something most founders don't: the actual technical journey of building a product from scratch.

No "we raised $10M and hired 50 engineers." Just one developer and a lot of coffee.

This is how I built CloudPruneAI - an AWS cost optimization tool that scans accounts and generates infrastructure-as-code to fix waste.

The Problem I Wanted to Solve

After years of managing AWS infrastructure, I kept seeing the same pattern:

  1. Company grows fast
  2. Engineers spin up resources "temporarily"
  3. Nobody cleans them up
  4. CFO asks "why is our AWS bill so high?"
  5. Everyone panics and manually audits for a week

The tools that existed either:

  • Cost $45K+/year (CloudHealth, Cloudability)
  • Only showed dashboards without actionable fixes
  • Required a dedicated FinOps team to operate

I wanted something that a solo developer or small team could use: scan, see waste, get code to fix it. Done.

The Stack Decisions

Backend: FastAPI + Python

Why FastAPI?

  • Async by default (critical when you're making dozens of AWS API calls per scan)
  • Auto-generated OpenAPI docs (saves time during frontend integration)
  • Type hints with Pydantic (catches bugs before they reach production)
  • Easy to deploy on Lambda with Mangum (one handler, done)

Python was the natural choice because AWS SDKs, CDK, and most infrastructure tooling lives in the Python ecosystem.

Frontend: Next.js 14 + Material UI

Why Next.js 14?

  • App Router is finally stable
  • Server components reduce client bundle
  • Easy deployment on AWS Amplify (SSR support)

Material UI gave me a professional-looking dashboard without spending weeks on design. For an MVP, speed matters more than pixel-perfect custom UI.

Infrastructure: AWS CDK

Why CDK over Terraform?

  • Same language as backend (Python) — one less context switch
  • Better AWS integration for the services I needed
  • And honestly... I'm building a tool that generates CDK, so I should use it myself

Database: PostgreSQL on RDS

Simple choice. Relational data, need transactions, familiar with it. Used SQLAlchemy 2.0 with async support to keep everything non-blocking.

The Core: How the Scanner Works

The scanner is the heart of the product. At a high level:

  1. User connects their AWS account (read-only access via IAM role)
  2. The scanner runs multiple analyzers in parallel — each one specialized for a different AWS service
  3. Each analyzer calls AWS APIs, checks utilization metrics via CloudWatch, and identifies waste patterns
  4. Results are aggregated with estimated monthly savings per recommendation

The key insight was making each analyzer independent and atomic. They don't know about each other. This means I can add new ones without touching existing code, and a failure in one doesn't break the others.

The Differentiator: AI-Powered Code Generation

Most cost tools stop at "you're wasting money on X." We go further: for each recommendation, you can generate a CDK project that implements the fix.

I use Claude's API for code generation. The tricky part wasn't the AI — it was the infrastructure around it:

  • Generation can take 10-30 seconds
  • API Gateway has a 30-second timeout
  • Users don't want to stare at a spinner

Solution: Async processing with a queue. User clicks "Generate," gets an immediate response with a task ID, and polls for completion. A separate worker processes the queue and stores results. When ready, the user downloads a ZIP with a complete CDK project.

This pattern (queue + worker + polling) solved the timeout issue cleanly and scales naturally.

Mistakes I Made

1. Underestimating Auth Complexity

I thought "just add Auth0" would take a day. It took a week.

Problems:

  • Session handling between SSR and client components
  • JWT verification on the backend
  • Logout not actually logging out (cached sessions)

Lesson: Auth is never "just add a library."

2. Not Setting Up Environments Early

I built everything in staging first. When it was time for production, I had to:

  • Duplicate all CDK stacks
  • Create separate databases
  • Configure separate Auth0 applications
  • Debug environment variable issues for days

Should have done staging + production from day one. The marginal effort is tiny compared to retrofitting later.

3. Overthinking the Business Model

First idea: SaaS subscription ($99/month)
Second idea: One-time payment per scan
Third idea: Gainshare (percentage of savings)
Final: Gainshare + optional subscription

I spent too much time on pricing before having users. Should have just launched with something simple and iterated.

What Worked Well

1. Monorepo with Turborepo

Frontend, backend, infrastructure — all in one repo. One command to run everything locally. One PR for full-stack changes. No version mismatches between services.

2. Early User Testing

I scanned a friend's AWS account before the UI was even done. Found significant waste across multiple accounts.

That validated the core value prop before I wrote a single line of frontend code. If the scanner hadn't found real savings, I would have known early, not after building the entire product.

3. Building the Hard Part First

Scanner + code generation were the riskiest parts. I built those first.

If the AI couldn't generate production-quality CDK code, the whole product wouldn't work. Better to find out early.

Architecture Overview

Request flow:

User → Route 53 (DNS) → splits into two paths:

Frontend path:

  • AWS Amplify (Next.js SSR) → Auth0 (authentication)

Backend path:

  • API Gateway (HTTP API) → Lambda (FastAPI) → RDS PostgreSQL + DynamoDB (cache)

Code generation flow:

  • Lambda → SQS queue → CDK Worker Lambda → stores result → user downloads ZIP

Everything serverless except the database. Two isolated environments (staging + production) with separate infrastructure.

Costs

Running this in production:

Service Monthly Cost
RDS (db.t4g.micro) ~$15
Lambda ~$5
Amplify ~$5
API Gateway ~$1
DynamoDB ~$1
Total ~$27/month

Serverless keeps costs near zero until you have real traffic. That's important when you're bootstrapping.

What's Next

Currently looking for beta testers. If you have an AWS account spending >$1K/month and want a free scan, sign up here.

In exchange, I just need 15 minutes of feedback on what works and what doesn't.


Key Takeaways

  1. Start with the risky parts — validate core tech before building UI
  2. Set up environments early — staging + production from day one
  3. Async processing for slow operations — don't fight API timeouts
  4. Auth is always harder than you think — budget extra time
  5. Test with real users early — even before the product is "ready"

Questions about the stack or decisions? Drop them in the comments — happy to go deeper on any part.

Top comments (0)