DEV Community

Cover image for I Spent $127K on AWS Before Realizing We Could Run on $8/Month
Sherin Joseph Roy
Sherin Joseph Roy

Posted on

I Spent $127K on AWS Before Realizing We Could Run on $8/Month

Three months ago, I was the co-founder burning through our seed round on infrastructure costs.

Today, we're serving 50K users on a VPS that costs less than a Netflix subscription.

This is the story of how over-engineering almost bankrupted usβ€”and why sometimes the "best" solution is the one that actually ships.

The Beginning: When Everything Was "Scalable"

Year 1, we had 47 users.

Our infrastructure:

  • βš™οΈ Kubernetes cluster (3 nodes)
  • πŸ“Š Separate staging, dev, and prod environments
  • πŸ—„οΈ PostgreSQL RDS (Multi-AZ)
  • πŸš€ Redis ElastiCache
  • πŸ“¦ S3 + CloudFront CDN
  • πŸ” Elasticsearch cluster
  • πŸ“ˆ DataDog monitoring
  • πŸ›‘οΈ WAF + Shield

Monthly AWS bill: $4,200

Monthly revenue: $0

I justified it with: "But when we scale..."

Narrator: We didn't scale.

The Wake-Up Call

Month 14: Our runway was shrinking faster than our user count was growing.

Co-founder sat me down:

"We have 6 months of runway left. Our AWS bill is 40% of our burn rate. We have 300 users. What are we doing?"

Me (defensively): "But the architecture is clean! We're prepared for scale! It's best practices!"

Him: "We're prepared for a problem we don't have. We're going to die 'prepared.'"

That hurt. Mainly because he was right.

The Audit That Changed Everything

I spent a weekend analyzing what we actually used:

What We Paid For:

Kubernetes Cluster: $2,100/month
  - Running 3 nodes for... 2 services
  - 90% idle capacity
  - "But it auto-scales!" (to handle traffic we didn't have)

RDS PostgreSQL: $850/month
  - Multi-AZ for "high availability"
  - Serving ~100 queries/second
  - Could run on a Raspberry Pi

Redis ElastiCache: $400/month
  - Caching data for 300 users
  - Hit rate: 23% (terrible)
  - Most cache entries: never accessed

Elasticsearch: $600/month
  - For "advanced search"
  - 90% of searches: simple text match
  - PostgreSQL full-text would've worked

CloudFront CDN: $120/month
  - Serving static assets to 300 users
  - Mostly in one geographic region

Monitoring & Logs: $230/month
  - Collecting metrics nobody looked at
  - Log retention: 90 days (why?)
Enter fullscreen mode Exit fullscreen mode

What We Actually Needed:

  • A web server
  • A database
  • Some static file hosting
  • Basic monitoring

Total waste: ~$4,000/month

The "Stupid" Solution

Against every instinct in my senior-engineer brain, I did something radical:

I migrated everything to a single $8/month Hetzner VPS.

# The entire migration script
# (I wish I was joking)

# 1. Provision VPS
ssh root@new-vps

# 2. Install essentials
apt update && apt install -y docker docker-compose postgresql nginx certbot

# 3. Dump production data
pg_dump $DATABASE_URL > dump.sql

# 4. Copy everything
scp -r ./app root@new-vps:/opt/
scp dump.sql root@new-vps:/tmp/

# 5. Restore DB
psql < /tmp/dump.sql

# 6. Start services
cd /opt/app && docker-compose up -d

# 7. Setup SSL
certbot --nginx -d yourdomain.com

# Done. Seriously.
Enter fullscreen mode Exit fullscreen mode

The New Stack:

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    restart: always
    environment:
      DATABASE_URL: postgresql://localhost/app
    ports:
      - "3000:3000"

  postgres:
    image: postgres:15-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: app
      POSTGRES_PASSWORD: ${DB_PASSWORD}

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/ssl
    depends_on:
      - app

volumes:
  pgdata:
Enter fullscreen mode Exit fullscreen mode

That's it. That's the whole infrastructure.

The Results (Prepare to Cringe)

Before (AWS):

  • πŸ’° Monthly cost: $4,200
  • πŸ‘₯ Users: 300
  • ⚑ Avg response time: 180ms
  • πŸ”₯ Deployment time: 12 minutes
  • 😰 Downtime incidents: 3/month (Kubernetes "updates")
  • 🧠 Mental overhead: Constant
  • πŸ“š Infrastructure code: 3,000+ lines of Terraform

After (VPS):

  • πŸ’° Monthly cost: $8
  • πŸ‘₯ Users: 50,000 (yes, we actually grew)
  • ⚑ Avg response time: 120ms (faster!)
  • πŸ”₯ Deployment time: 30 seconds
  • 😰 Downtime incidents: 0 in 3 months
  • 🧠 Mental overhead: Nearly zero
  • πŸ“š Infrastructure code: 40 lines of docker-compose

We're 525x cheaper and somehow performing better.

What I Learned (The Hard Way)

1. "Best Practices" Are Contextual

// What I thought engineering was:
if (buildsCharacter) {
  over_engineer();
  prepare_for_scale();
  add_more_microservices();
}

// What engineering actually is:
if (solvesUserProblem && shipsQuickly) {
  useSimplestSolution();
  iterate();
}
Enter fullscreen mode Exit fullscreen mode

Best practices for Google β‰  Best practices for your 300-user startup

2. Premature Scaling Is Real

We spent 14 months preparing for problems we never had:

  • ❌ "What if we go viral?" (We didn't)
  • ❌ "What if we need to scale horizontally?" (We don't)
  • ❌ "What if one region goes down?" (Our users are 90% in one city)
  • ❌ "What if we need advanced search?" (Text search works fine)

You know what actually happened?

We almost ran out of money before finding product-market fit.

3. Complexity Is a Tax

Every layer of abstraction costs:

# Cognitive load of our old setup
def deploy_feature():
    update_terraform()      # 30 min
    plan_and_apply()        # 15 min
    wait_for_k8s_rollout()  # 10 min
    check_5_dashboards()    # 10 min
    pray_nothing_broke()    # 5 min
    # Total: 70 min per deploy

# Cognitive load now
def deploy_feature():
    git push origin main
    # Done. 30 seconds.
Enter fullscreen mode Exit fullscreen mode

The best code is the code you don't write.

The best infrastructure is the infrastructure you don't manage.

4. Boring Technology Wins

New stack technologies:

  • PostgreSQL (released 1996)
  • Nginx (released 2004)
  • Docker (released 2013)

Why?

  • βœ… Battle-tested
  • βœ… Abundant documentation
  • βœ… Easy to debug
  • βœ… Won't randomly break on updates

My Kubernetes cluster was so cutting-edge it cut me constantly.

5. Your Users Don't Care About Your Stack

Real conversation with a user:

User: "I love the new feature!"

Me: "Thanks! Also, we migrated off Kubernetesβ€”"

User: "What's Kubernetes?"

Me: "Never mind."

Your architecture is your problem, not your users' problem.

When You SHOULD Use Complex Infrastructure

I'm not saying never use AWS/Kubernetes. Use them when:

βœ… You have millions of users
βœ… You have actual scaling problems
βœ… You have a dedicated DevOps team

βœ… You've raised enough money that $10K/month doesn't matter
βœ… You have compliance requirements that demand it

We had none of these.

The Hetzner Reality Check

Common objection: "But VPS don't scale!"

Reality check:

  • Single VPS can handle 10K+ concurrent users
  • You can vertically scale to massive specs
  • When you outgrow one box, you probably have money for proper infrastructure
  • Spoiler: You'll probably never outgrow it

Our current VPS specs:

  • 8GB RAM (using 3GB)
  • 4 vCPUs (averaging 15% usage)
  • 160GB SSD (using 12GB)

We could 10x our users before needing an upgrade.

The Mental Health Impact

Unexpected benefit: I sleep better.

Old life:

3 AM: *phone buzzes*
PagerDuty: "Kubernetes node went unhealthy"
Me: *laptop opens* *troubleshoots for 2 hours*
Me: "It was a random AWS network blip"
Enter fullscreen mode Exit fullscreen mode

New life:

3 AM: *silence*
Me: *sleeping*
Nginx: *happily serving requests*
Enter fullscreen mode Exit fullscreen mode

Uptime is better. I'm happier. Users don't notice.

The Migration Guide

If you're in the same boat, here's how to escape:

Week 1: Audit

# List everything running
kubectl get all --all-namespaces

# Check actual resource usage
kubectl top nodes
kubectl top pods

# Review AWS bill line by line
# Ask: "Do we need this right now?"
Enter fullscreen mode Exit fullscreen mode

Week 2: Simplify

  • Consolidate microservices into a monolith (yes, really)
  • Replace managed services with self-hosted (PostgreSQL instead of RDS)
  • Remove unused features (that "advanced analytics" nobody looks at)

Week 3: Migrate

  • Provision VPS (Hetzner, DigitalOcean, Linodeβ€”all work)
  • Docker-compose your app
  • Test thoroughly in staging
  • Migrate data (pg_dump is your friend)
  • Update DNS
  • Monitor closely for a week

Week 4: Celebrate

  • Cancel AWS account
  • Take your team to dinner with the money you're saving
  • Ship features instead of managing infrastructure

The Actual Code

Want to see our entire production setup?

# Our complete production infrastructure
# (This is not a joke)

version: '3.8'

services:
  app:
    image: ghcr.io/yourcompany/app:latest
    restart: unless-stopped
    environment:
      DATABASE_URL: ${DATABASE_URL}
      SECRET_KEY: ${SECRET_KEY}
    ports:
      - "3000:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 2G
        reservations:
          memory: 512M

  db:
    image: postgres:15-alpine
    restart: unless-stopped
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./backups:/backups
    environment:
      POSTGRES_DB: ${DB_NAME}
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  nginx:
    image: nginx:alpine
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
      - ./static:/usr/share/nginx/html:ro
    depends_on:
      - app

  backup:
    image: postgres:15-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./backups:/backups
    entrypoint: |
      bash -c 'while true; do
        pg_dump -h db -U ${DB_USER} ${DB_NAME} > /backups/backup_$$(date +%Y%m%d_%H%M%S).sql
        find /backups -name "backup_*.sql" -mtime +7 -delete
        sleep 86400
      done'

volumes:
  postgres_data:
Enter fullscreen mode Exit fullscreen mode

Deployment script:

#!/bin/bash
# deploy.sh

set -e

echo "πŸš€ Deploying..."

# Pull latest
docker-compose pull

# Zero-downtime restart
docker-compose up -d --no-deps --build app

# Health check
sleep 5
curl -f http://localhost:3000/health || exit 1

echo "βœ… Deployed successfully"
Enter fullscreen mode Exit fullscreen mode

That's it. 50K users. $8/month.

Common Questions

"What about backups?"

# Automated daily backups
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql

# Upload to S3 (costs $0.50/month for retention)
aws s3 cp backup_*.sql s3://backups/

# Sleep well
Enter fullscreen mode Exit fullscreen mode

"What about monitoring?"

# Free tier of Uptime Robot
# Pings every 5 minutes
# Alerts if down
# Total cost: $0
Enter fullscreen mode Exit fullscreen mode
# Or self-hosted Prometheus + Grafana
# Adds ~500MB RAM usage
# Still fits in $8 VPS
Enter fullscreen mode Exit fullscreen mode

"What if the VPS goes down?"

Reality: In 3 months, our VPS has had 0 minutes of downtime.

Our old Kubernetes cluster? Had 3 outages from:

  • AWS zone maintenance
  • Helm chart breaking changes
  • Me pushing a bad config at 2 AM

The VPS is more reliable.

"What about CI/CD?"

# .github/workflows/deploy.yml
name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Deploy
        run: |
          ssh deploy@vps 'cd /opt/app && git pull && docker-compose up -d --build'
Enter fullscreen mode Exit fullscreen mode

Cost: $0 (GitHub Actions free tier)

The Controversial Take

You're probably over-engineering too.

Signs you're me-three-months-ago:

  • ❌ Your staging environment costs more than your SaaS subscription
  • ❌ You can't deploy without anxiety
  • ❌ Your infra code is longer than your app code
  • ❌ You're preparing for scale you don't have
  • ❌ You spend more time on DevOps than features
  • ❌ Your AWS bill makes you sad

Solution: Simplify. Radically.

What We're Doing With the Savings

The $4,200/month we're saving:

  • πŸ’° Extended runway by 10 months
  • 🎯 Hired another developer
  • πŸ“ˆ Spent on actual growth (ads, SEO)
  • πŸ§˜β€β™‚οΈ Reduced founder stress

Trade-off: We have a "less impressive" tech stack

Reality: Nobody cares. Users are happier. We're shipping faster.

The Lesson

Perfect is the enemy of shipped.

Scalable is the enemy of survival.

Complex is the enemy of actually building something.

Three months ago, I was the smartest engineer with the most sophisticated architecture and no users.

Today, I'm running a "basic" setup serving 50K users who love our product.

I'll take "basic but alive" over "sophisticated but dead" every time.

Your Move

If you're burning money on infrastructure you don't need:

  1. Audit your bill TODAY
  2. Question every line item
  3. Ask: "Do we need this right now?"
  4. Be ruthlessly honest
  5. Simplify

Your startup's life might depend on it.

Ours did.


Quick poll: What's your monthly infrastructure bill, and how many users do you have? Drop it in the commentsβ€”I'm curious if I was the only one over-engineering.

Also, if you're considering a similar migration and want to chat through the details, my DMs are open. I've helped 3 other founders do this already.

Follow me for more "things I learned the hard way so you don't have to."


Tech Stack

  • VPS: Hetzner CX21 ($8/month)
  • OS: Ubuntu 22.04 LTS
  • Reverse Proxy: Nginx
  • App: Node.js (could be anything)
  • Database: PostgreSQL 15
  • Monitoring: Uptime Robot (free)
  • Backups: S3 ($0.50/month)
  • SSL: Let's Encrypt (free)

Total monthly cost: $8.50

Users served: 50,000+

Downtime: 0 minutes in 90 days

Engineer happiness: ↑ 1000%


Previously: Built over-engineered systems at BigTech. Now: Building scrappy profitable startups. The irony is not lost on me.

Top comments (0)