Three months ago, I was the co-founder burning through our seed round on infrastructure costs.
Today, we're serving 50K users on a VPS that costs less than a Netflix subscription.
This is the story of how over-engineering almost bankrupted usβand why sometimes the "best" solution is the one that actually ships.
The Beginning: When Everything Was "Scalable"
Year 1, we had 47 users.
Our infrastructure:
- βοΈ Kubernetes cluster (3 nodes)
- π Separate staging, dev, and prod environments
- ποΈ PostgreSQL RDS (Multi-AZ)
- π Redis ElastiCache
- π¦ S3 + CloudFront CDN
- π Elasticsearch cluster
- π DataDog monitoring
- π‘οΈ WAF + Shield
Monthly AWS bill: $4,200
Monthly revenue: $0
I justified it with: "But when we scale..."
Narrator: We didn't scale.
The Wake-Up Call
Month 14: Our runway was shrinking faster than our user count was growing.
Co-founder sat me down:
"We have 6 months of runway left. Our AWS bill is 40% of our burn rate. We have 300 users. What are we doing?"
Me (defensively): "But the architecture is clean! We're prepared for scale! It's best practices!"
Him: "We're prepared for a problem we don't have. We're going to die 'prepared.'"
That hurt. Mainly because he was right.
The Audit That Changed Everything
I spent a weekend analyzing what we actually used:
What We Paid For:
Kubernetes Cluster: $2,100/month
- Running 3 nodes for... 2 services
- 90% idle capacity
- "But it auto-scales!" (to handle traffic we didn't have)
RDS PostgreSQL: $850/month
- Multi-AZ for "high availability"
- Serving ~100 queries/second
- Could run on a Raspberry Pi
Redis ElastiCache: $400/month
- Caching data for 300 users
- Hit rate: 23% (terrible)
- Most cache entries: never accessed
Elasticsearch: $600/month
- For "advanced search"
- 90% of searches: simple text match
- PostgreSQL full-text would've worked
CloudFront CDN: $120/month
- Serving static assets to 300 users
- Mostly in one geographic region
Monitoring & Logs: $230/month
- Collecting metrics nobody looked at
- Log retention: 90 days (why?)
What We Actually Needed:
- A web server
- A database
- Some static file hosting
- Basic monitoring
Total waste: ~$4,000/month
The "Stupid" Solution
Against every instinct in my senior-engineer brain, I did something radical:
I migrated everything to a single $8/month Hetzner VPS.
# The entire migration script
# (I wish I was joking)
# 1. Provision VPS
ssh root@new-vps
# 2. Install essentials
apt update && apt install -y docker docker-compose postgresql nginx certbot
# 3. Dump production data
pg_dump $DATABASE_URL > dump.sql
# 4. Copy everything
scp -r ./app root@new-vps:/opt/
scp dump.sql root@new-vps:/tmp/
# 5. Restore DB
psql < /tmp/dump.sql
# 6. Start services
cd /opt/app && docker-compose up -d
# 7. Setup SSL
certbot --nginx -d yourdomain.com
# Done. Seriously.
The New Stack:
# docker-compose.yml
version: '3.8'
services:
app:
build: .
restart: always
environment:
DATABASE_URL: postgresql://localhost/app
ports:
- "3000:3000"
postgres:
image: postgres:15-alpine
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: app
POSTGRES_PASSWORD: ${DB_PASSWORD}
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./ssl:/etc/ssl
depends_on:
- app
volumes:
pgdata:
That's it. That's the whole infrastructure.
The Results (Prepare to Cringe)
Before (AWS):
- π° Monthly cost: $4,200
- π₯ Users: 300
- β‘ Avg response time: 180ms
- π₯ Deployment time: 12 minutes
- π° Downtime incidents: 3/month (Kubernetes "updates")
- π§ Mental overhead: Constant
- π Infrastructure code: 3,000+ lines of Terraform
After (VPS):
- π° Monthly cost: $8
- π₯ Users: 50,000 (yes, we actually grew)
- β‘ Avg response time: 120ms (faster!)
- π₯ Deployment time: 30 seconds
- π° Downtime incidents: 0 in 3 months
- π§ Mental overhead: Nearly zero
- π Infrastructure code: 40 lines of docker-compose
We're 525x cheaper and somehow performing better.
What I Learned (The Hard Way)
1. "Best Practices" Are Contextual
// What I thought engineering was:
if (buildsCharacter) {
over_engineer();
prepare_for_scale();
add_more_microservices();
}
// What engineering actually is:
if (solvesUserProblem && shipsQuickly) {
useSimplestSolution();
iterate();
}
Best practices for Google β Best practices for your 300-user startup
2. Premature Scaling Is Real
We spent 14 months preparing for problems we never had:
- β "What if we go viral?" (We didn't)
- β "What if we need to scale horizontally?" (We don't)
- β "What if one region goes down?" (Our users are 90% in one city)
- β "What if we need advanced search?" (Text search works fine)
You know what actually happened?
We almost ran out of money before finding product-market fit.
3. Complexity Is a Tax
Every layer of abstraction costs:
# Cognitive load of our old setup
def deploy_feature():
update_terraform() # 30 min
plan_and_apply() # 15 min
wait_for_k8s_rollout() # 10 min
check_5_dashboards() # 10 min
pray_nothing_broke() # 5 min
# Total: 70 min per deploy
# Cognitive load now
def deploy_feature():
git push origin main
# Done. 30 seconds.
The best code is the code you don't write.
The best infrastructure is the infrastructure you don't manage.
4. Boring Technology Wins
New stack technologies:
- PostgreSQL (released 1996)
- Nginx (released 2004)
- Docker (released 2013)
Why?
- β Battle-tested
- β Abundant documentation
- β Easy to debug
- β Won't randomly break on updates
My Kubernetes cluster was so cutting-edge it cut me constantly.
5. Your Users Don't Care About Your Stack
Real conversation with a user:
User: "I love the new feature!"
Me: "Thanks! Also, we migrated off Kubernetesβ"
User: "What's Kubernetes?"
Me: "Never mind."
Your architecture is your problem, not your users' problem.
When You SHOULD Use Complex Infrastructure
I'm not saying never use AWS/Kubernetes. Use them when:
β
You have millions of users
β
You have actual scaling problems
β
You have a dedicated DevOps team
β
You've raised enough money that $10K/month doesn't matter
β
You have compliance requirements that demand it
We had none of these.
The Hetzner Reality Check
Common objection: "But VPS don't scale!"
Reality check:
- Single VPS can handle 10K+ concurrent users
- You can vertically scale to massive specs
- When you outgrow one box, you probably have money for proper infrastructure
- Spoiler: You'll probably never outgrow it
Our current VPS specs:
- 8GB RAM (using 3GB)
- 4 vCPUs (averaging 15% usage)
- 160GB SSD (using 12GB)
We could 10x our users before needing an upgrade.
The Mental Health Impact
Unexpected benefit: I sleep better.
Old life:
3 AM: *phone buzzes*
PagerDuty: "Kubernetes node went unhealthy"
Me: *laptop opens* *troubleshoots for 2 hours*
Me: "It was a random AWS network blip"
New life:
3 AM: *silence*
Me: *sleeping*
Nginx: *happily serving requests*
Uptime is better. I'm happier. Users don't notice.
The Migration Guide
If you're in the same boat, here's how to escape:
Week 1: Audit
# List everything running
kubectl get all --all-namespaces
# Check actual resource usage
kubectl top nodes
kubectl top pods
# Review AWS bill line by line
# Ask: "Do we need this right now?"
Week 2: Simplify
- Consolidate microservices into a monolith (yes, really)
- Replace managed services with self-hosted (PostgreSQL instead of RDS)
- Remove unused features (that "advanced analytics" nobody looks at)
Week 3: Migrate
- Provision VPS (Hetzner, DigitalOcean, Linodeβall work)
- Docker-compose your app
- Test thoroughly in staging
- Migrate data (pg_dump is your friend)
- Update DNS
- Monitor closely for a week
Week 4: Celebrate
- Cancel AWS account
- Take your team to dinner with the money you're saving
- Ship features instead of managing infrastructure
The Actual Code
Want to see our entire production setup?
# Our complete production infrastructure
# (This is not a joke)
version: '3.8'
services:
app:
image: ghcr.io/yourcompany/app:latest
restart: unless-stopped
environment:
DATABASE_URL: ${DATABASE_URL}
SECRET_KEY: ${SECRET_KEY}
ports:
- "3000:3000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 512M
db:
image: postgres:15-alpine
restart: unless-stopped
volumes:
- postgres_data:/var/lib/postgresql/data
- ./backups:/backups
environment:
POSTGRES_DB: ${DB_NAME}
POSTGRES_USER: ${DB_USER}
POSTGRES_PASSWORD: ${DB_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
interval: 10s
timeout: 5s
retries: 5
nginx:
image: nginx:alpine
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
- ./static:/usr/share/nginx/html:ro
depends_on:
- app
backup:
image: postgres:15-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
- ./backups:/backups
entrypoint: |
bash -c 'while true; do
pg_dump -h db -U ${DB_USER} ${DB_NAME} > /backups/backup_$$(date +%Y%m%d_%H%M%S).sql
find /backups -name "backup_*.sql" -mtime +7 -delete
sleep 86400
done'
volumes:
postgres_data:
Deployment script:
#!/bin/bash
# deploy.sh
set -e
echo "π Deploying..."
# Pull latest
docker-compose pull
# Zero-downtime restart
docker-compose up -d --no-deps --build app
# Health check
sleep 5
curl -f http://localhost:3000/health || exit 1
echo "β
Deployed successfully"
That's it. 50K users. $8/month.
Common Questions
"What about backups?"
# Automated daily backups
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql
# Upload to S3 (costs $0.50/month for retention)
aws s3 cp backup_*.sql s3://backups/
# Sleep well
"What about monitoring?"
# Free tier of Uptime Robot
# Pings every 5 minutes
# Alerts if down
# Total cost: $0
# Or self-hosted Prometheus + Grafana
# Adds ~500MB RAM usage
# Still fits in $8 VPS
"What if the VPS goes down?"
Reality: In 3 months, our VPS has had 0 minutes of downtime.
Our old Kubernetes cluster? Had 3 outages from:
- AWS zone maintenance
- Helm chart breaking changes
- Me pushing a bad config at 2 AM
The VPS is more reliable.
"What about CI/CD?"
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Deploy
run: |
ssh deploy@vps 'cd /opt/app && git pull && docker-compose up -d --build'
Cost: $0 (GitHub Actions free tier)
The Controversial Take
You're probably over-engineering too.
Signs you're me-three-months-ago:
- β Your staging environment costs more than your SaaS subscription
- β You can't deploy without anxiety
- β Your infra code is longer than your app code
- β You're preparing for scale you don't have
- β You spend more time on DevOps than features
- β Your AWS bill makes you sad
Solution: Simplify. Radically.
What We're Doing With the Savings
The $4,200/month we're saving:
- π° Extended runway by 10 months
- π― Hired another developer
- π Spent on actual growth (ads, SEO)
- π§ββοΈ Reduced founder stress
Trade-off: We have a "less impressive" tech stack
Reality: Nobody cares. Users are happier. We're shipping faster.
The Lesson
Perfect is the enemy of shipped.
Scalable is the enemy of survival.
Complex is the enemy of actually building something.
Three months ago, I was the smartest engineer with the most sophisticated architecture and no users.
Today, I'm running a "basic" setup serving 50K users who love our product.
I'll take "basic but alive" over "sophisticated but dead" every time.
Your Move
If you're burning money on infrastructure you don't need:
- Audit your bill TODAY
- Question every line item
- Ask: "Do we need this right now?"
- Be ruthlessly honest
- Simplify
Your startup's life might depend on it.
Ours did.
Quick poll: What's your monthly infrastructure bill, and how many users do you have? Drop it in the commentsβI'm curious if I was the only one over-engineering.
Also, if you're considering a similar migration and want to chat through the details, my DMs are open. I've helped 3 other founders do this already.
Follow me for more "things I learned the hard way so you don't have to."
Tech Stack
- VPS: Hetzner CX21 ($8/month)
- OS: Ubuntu 22.04 LTS
- Reverse Proxy: Nginx
- App: Node.js (could be anything)
- Database: PostgreSQL 15
- Monitoring: Uptime Robot (free)
- Backups: S3 ($0.50/month)
- SSL: Let's Encrypt (free)
Total monthly cost: $8.50
Users served: 50,000+
Downtime: 0 minutes in 90 days
Engineer happiness: β 1000%
Previously: Built over-engineered systems at BigTech. Now: Building scrappy profitable startups. The irony is not lost on me.
Top comments (0)