I'm going to share something most founders don't: the actual technical journey of building a product from scratch.
No "we raised $10M and hired 50 engineers." Just one developer and a lot of coffee.
This is how I built CloudPruneAI - an AWS cost optimization tool that scans accounts and generates infrastructure-as-code to fix waste.
The Problem I Wanted to Solve
After years of managing AWS infrastructure, I kept seeing the same pattern:
- Company grows fast
- Engineers spin up resources "temporarily"
- Nobody cleans them up
- CFO asks "why is our AWS bill so high?"
- Everyone panics and manually audits for a week
The tools that existed either:
- Cost $45K+/year (CloudHealth, Cloudability)
- Only showed dashboards without actionable fixes
- Required a dedicated FinOps team to operate
I wanted something that a solo developer or small team could use: scan, see waste, get code to fix it. Done.
The Stack Decisions
Backend: FastAPI + Python
Why FastAPI?
- Async by default (critical when you're making dozens of AWS API calls per scan)
- Auto-generated OpenAPI docs (saves time during frontend integration)
- Type hints with Pydantic (catches bugs before they reach production)
- Easy to deploy on Lambda with Mangum (one handler, done)
Python was the natural choice because AWS SDKs, CDK, and most infrastructure tooling lives in the Python ecosystem.
Frontend: Next.js 14 + Material UI
Why Next.js 14?
- App Router is finally stable
- Server components reduce client bundle
- Easy deployment on AWS Amplify (SSR support)
Material UI gave me a professional-looking dashboard without spending weeks on design. For an MVP, speed matters more than pixel-perfect custom UI.
Infrastructure: AWS CDK
Why CDK over Terraform?
- Same language as backend (Python) — one less context switch
- Better AWS integration for the services I needed
- And honestly... I'm building a tool that generates CDK, so I should use it myself
Database: PostgreSQL on RDS
Simple choice. Relational data, need transactions, familiar with it. Used SQLAlchemy 2.0 with async support to keep everything non-blocking.
The Core: How the Scanner Works
The scanner is the heart of the product. At a high level:
- User connects their AWS account (read-only access via IAM role)
- The scanner runs multiple analyzers in parallel — each one specialized for a different AWS service
- Each analyzer calls AWS APIs, checks utilization metrics via CloudWatch, and identifies waste patterns
- Results are aggregated with estimated monthly savings per recommendation
The key insight was making each analyzer independent and atomic. They don't know about each other. This means I can add new ones without touching existing code, and a failure in one doesn't break the others.
The Differentiator: AI-Powered Code Generation
Most cost tools stop at "you're wasting money on X." We go further: for each recommendation, you can generate a CDK project that implements the fix.
I use Claude's API for code generation. The tricky part wasn't the AI — it was the infrastructure around it:
- Generation can take 10-30 seconds
- API Gateway has a 30-second timeout
- Users don't want to stare at a spinner
Solution: Async processing with a queue. User clicks "Generate," gets an immediate response with a task ID, and polls for completion. A separate worker processes the queue and stores results. When ready, the user downloads a ZIP with a complete CDK project.
This pattern (queue + worker + polling) solved the timeout issue cleanly and scales naturally.
Mistakes I Made
1. Underestimating Auth Complexity
I thought "just add Auth0" would take a day. It took a week.
Problems:
- Session handling between SSR and client components
- JWT verification on the backend
- Logout not actually logging out (cached sessions)
Lesson: Auth is never "just add a library."
2. Not Setting Up Environments Early
I built everything in staging first. When it was time for production, I had to:
- Duplicate all CDK stacks
- Create separate databases
- Configure separate Auth0 applications
- Debug environment variable issues for days
Should have done staging + production from day one. The marginal effort is tiny compared to retrofitting later.
3. Overthinking the Business Model
First idea: SaaS subscription ($99/month)
Second idea: One-time payment per scan
Third idea: Gainshare (percentage of savings)
Final: Gainshare + optional subscription
I spent too much time on pricing before having users. Should have just launched with something simple and iterated.
What Worked Well
1. Monorepo with Turborepo
Frontend, backend, infrastructure — all in one repo. One command to run everything locally. One PR for full-stack changes. No version mismatches between services.
2. Early User Testing
I scanned a friend's AWS account before the UI was even done. Found significant waste across multiple accounts.
That validated the core value prop before I wrote a single line of frontend code. If the scanner hadn't found real savings, I would have known early, not after building the entire product.
3. Building the Hard Part First
Scanner + code generation were the riskiest parts. I built those first.
If the AI couldn't generate production-quality CDK code, the whole product wouldn't work. Better to find out early.
Architecture Overview
Request flow:
User → Route 53 (DNS) → splits into two paths:
Frontend path:
- AWS Amplify (Next.js SSR) → Auth0 (authentication)
Backend path:
- API Gateway (HTTP API) → Lambda (FastAPI) → RDS PostgreSQL + DynamoDB (cache)
Code generation flow:
- Lambda → SQS queue → CDK Worker Lambda → stores result → user downloads ZIP
Everything serverless except the database. Two isolated environments (staging + production) with separate infrastructure.
Costs
Running this in production:
| Service | Monthly Cost |
|---|---|
| RDS (db.t4g.micro) | ~$15 |
| Lambda | ~$5 |
| Amplify | ~$5 |
| API Gateway | ~$1 |
| DynamoDB | ~$1 |
| Total | ~$27/month |
Serverless keeps costs near zero until you have real traffic. That's important when you're bootstrapping.
What's Next
Currently looking for beta testers. If you have an AWS account spending >$1K/month and want a free scan, sign up here.
In exchange, I just need 15 minutes of feedback on what works and what doesn't.
Key Takeaways
- Start with the risky parts — validate core tech before building UI
- Set up environments early — staging + production from day one
- Async processing for slow operations — don't fight API timeouts
- Auth is always harder than you think — budget extra time
- Test with real users early — even before the product is "ready"
Questions about the stack or decisions? Drop them in the comments — happy to go deeper on any part.
Top comments (0)